
as you suggest bcache and flashcache seem to offer a way around this for mdadm but i've never used either of them - i was already using zfs by the time they became available. i don't think the SATA interface speed is a deal-breaker for them because the only way around that is spending huge amounts of money.
There were some not-too-expensive battery backed PCI ramdisks available a while ago. Not anymore though.
. Battery backed write cache. Bcache/flashcache offer this but they have their shortcomings, in particular that most available cache modules are still on top of the SATA channel.
this is the only real advantage of hardware raid over mdadm. IMO, ZFS's ability to use an SSD or other fast block device as cache completely eliminates this last remaining superiority of hardware raid over software raid.
Yes I've now been enlightened on this subject :)
. Online resize/reconfigure
both btrfs and zfs offer this.
Can it seamlessly continue over reboot? Obviously it can't progress while the system is rebooting like a hardware raid but I'd hope it could pick up where it left of automatically.
. BIOS boot support (see recent thread "RAID, again" by me)
this is a misfeature of a crappy BIOS rather than a fault with software raid.
any decent BIOS not only has the ability to choose which disk to boot from (rather than hard-code it to only boot from whichever disk is plugged into the first disk port) but will also let you specify a boot order so that it will try disk 1 followed by disk 2 and then disk 3 or whatever. they'll also typically let you press F2 or F12 or whatever at boot time to pop up a boot device selection menu.
even server motherboards like supermicro let you choose the boot device and have a boot menu option accessible over IPMI.
This is where a lot of people get this wrong. Once the BIOS has succeeded in reading the bootsector from a boot disk it's committed. If the bootsector reads okay (even after a long time on a failing disk) but anything between the bootsector and the OS fails, your boot has failed. This 'anything between' includes the grub bootstrap, xen hypervisor, linux kernel, and initramfs, so it's a substantial amount of data to read from a disk that may be on its last legs. A good hardware RAID will have long since failed the disk by this point and booting will succeed. My last remaining reservation on going ahead with some testing is is there an equivalent of clvm for zfs? Or is that the right approach for zfs? My main server cluster is: 2 machines each running 2 x 2TB disks with DRBD with the primary exporting the whole disk as an iSCSI volume 2 machines each importing the iSCSI volume running lvm (clvm) on top, and using the lv's as backing stores for xen VM's. How would this best be done using zfs? Thanks James

On Fri, 12 Apr 2013, James Harper <james.harper@bendigoit.com.au> wrote:
. Online resize/reconfigure
both btrfs and zfs offer this.
Can it seamlessly continue over reboot? Obviously it can't progress while the system is rebooting like a hardware raid but I'd hope it could pick up where it left of automatically.
Traditional RAID systems (hardware and software) have fixed sectors on each disk for RAID stripes. So if you have 5 disks in a RAID-5 then every 5th block is a parity block. I believe that both ZFS and BTRFS do it differently. I believe that in both cases if you write an amount of data corresponding to an entire stripe then it will write it in the traditional manner. But if you have a small write that needs to be synchronous (IE filesystem metadata) then it may be written in a RAID-1 manner instead of a RAID-5 (or some similar construct that involves a subset of the disks). In that case adding a new disk doesn't really require that all data be shuffled around on all disks. For example if you had a BTRFS or ZFS RAID-Z type array that was 10% used and you added some more disks there wouldn't be a great need to balance it immediately. You could just let the filesystem allocate new data across all disks and balance itself gradually.
My last remaining reservation on going ahead with some testing is is there an equivalent of clvm for zfs? Or is that the right approach for zfs? My main server cluster is:
2 machines each running 2 x 2TB disks with DRBD with the primary exporting the whole disk as an iSCSI volume 2 machines each importing the iSCSI volume running lvm (clvm) on top, and using the lv's as backing stores for xen VM's.
How would this best be done using zfs?
There are scripts to use zfs send/receive in a tight loop, synchronising it as often as every minute. This is about as good as DRBD. DRBD allows a synchronous write to be delayed until the data is committed to disks on both servers. The zfs send/receive option would allow a write to succeed before it appears on the other system. This could be bad for a database but for other tasks wouldn't necessarily be so bad. I've considered using zfs send/receive for a mail server. If the primary failed then some delivered mail could disappear when the secondary became active (which would be bad). But if the primary's failure didn't involve enough disks dying to break it's RAID then after recovering the problem the extra email could be copied across. Email is in some ways an easier problem to solve because the important operation is file creation and files that matter aren't modified. If the file exists somewhere then it can be copied across and everything's good. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
participants (2)
-
James Harper
-
Russell Coker