
On Fri, Apr 12, 2013 at 03:12:23AM +0000, James Harper wrote:
in particular, the reason why RAID-Z is so much better than mdadm RAID (which is, in turn, IMO much better than most hardware RAID)
Disagree about the hardware raid comment. You say "most hardware RAID", but if you consider the set of hardware RAID implementations that you would actually use on a server, mdadm is pretty feature poor. In particular the advantages of hardware RAID are:
with the sole exception of non-volatile write cache for HW RAID5/6, i'm including them too. RAID5's write performance sucks without a good write cache (and it has to be non-volatile or battery backed for safety), and RAID6 is even worse. as you suggest bcache and flashcache seem to offer a way around this for mdadm but i've never used either of them - i was already using zfs by the time they became available. i don't think the SATA interface speed is a deal-breaker for them because the only way around that is spending huge amounts of money.
. Battery backed write cache. Bcache/flashcache offer this but they have their shortcomings, in particular that most available cache modules are still on top of the SATA channel.
this is the only real advantage of hardware raid over mdadm. IMO, ZFS's ability to use an SSD or other fast block device as cache completely eliminates this last remaining superiority of hardware raid over software raid. i personally don't see any technical reason to chose hardware raid over ZFS (although CYA managerial reasons and specifications written by technologically-illiterate buffoons who have picked up some cool buzzwords like "RAID" will often override technical best practice)
. Online resize/reconfigure
both btrfs and zfs offer this.
. BIOS boot support (see recent thread "RAID, again" by me)
this is a misfeature of a crappy BIOS rather than a fault with software raid. any decent BIOS not only has the ability to choose which disk to boot from (rather than hard-code it to only boot from whichever disk is plugged into the first disk port) but will also let you specify a boot order so that it will try disk 1 followed by disk 2 and then disk 3 or whatever. they'll also typically let you press F2 or F12 or whatever at boot time to pop up a boot device selection menu. even server motherboards like supermicro let you choose the boot device and have a boot menu option accessible over IPMI.
Does ZFS have any native support for battery or flash backed write cache? With mdadm I can run bcache on top of it, then lvm on top of that, but with ZFS's tight integration of everything I'm not sure that would be possible and I'd have to run bcache on top of the component disks or the individual zvols, maybe (I'm probably mucking up the zfs terminology here but I hope you know what I mean).
yep. ZFS has built-in support for both "log" devices AKA synchronous write cache (ZIL or ZFS Intent Log) and "cache" or read-cache devices (aka L2ARC, "2nd-level Adaptive Replacement Cache", which is block-device based). ZIL is kind of like a write-back cache for synchronous writes, and is what makes RAID-Z performance not suck like software RAID5 does (L2ARC is in addition to the RAM-based ARC which is pretty much like your common garden-variety linux disk caching. ARC and L2ARC are used for read caching of frequently/recently accessed data as well as to support de-duplication by keeping block hashes in ARC or L2ARC) Anyway, ZFS can use any block device(s) for "log" or "cache", including a disk or partition on SATA or SAS or whatever interface. also including faster devices such as PCI-e SSDs like the Fusion-IO and PCI-e battery-backed RAM disks. they're out of my personal budget range so i haven't bothered checking on availility for x86/etc systems but I know that in the Sun world there were extremely expensive specialised cache devices sold specifically for use with ZFS. In the x86/pc world you'd just use a generic super-fast block device like the Fusion-IO, I guess. if you could afford it. dunno what they cost now but i looked into the Fusion-IO PCI-e cards for someone at work a year or two ago (they needed an extremely fast device to write massive amounts of data as it was captured by a scientific instrument camera). IIRC it was about $5K for a 1TB drive. which isn't bad considering what it was, but was out of the budget for that researcher. instead, i cobbled together a system with loads of RAM for a large ramdisk instead....clumsy, and they had to manually copy data onto more permanent storage later, but it was fast enough to keep up with the camera. hmmm, seems to be $1995 for 420GB (read speed of 1.4GB/s, write speed 700MB/s - their higher end models get 1.1GB/s write speed) at the moment according to these articles: http://www.techworld.com.au/article/458387/fusion-io_releases_1_6tb_flash_ca... http://www.computerworld.com/s/article/9226103/Fusion_io_releases_cheaper_io... i expect that bcache and flashcache could use high-speed devices like these too. for those on a tighter budget, a current SATA SSD offers about 500-550MB/s read and slightly lower write speeds, at around $1/GB. from my own experience, even a small 4GB partition on a SATA SSD for the ZIL makes a massive performance difference. and a larger cache partition also helps. BTW, the principal author of btrfs (Chris Mason, IIRC) left Oracle last year and is now working at Fusion IO. He's still working on btrfs. craig -- craig sanders <cas@taz.net.au>