
On Wed, 3 Jul 2013, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
Having looked at it a bit more, it seems better suited to the SSD-caching scenario than ZFS; there are auto-tuning parameters in bcache to detect at what point to just bypass the cache and go straight to the disks, saving more cache room for blocks that will benefit.
This is precisely what the ZFS L2ARC is supposed to do.
And the write-ahead logging is limited only by the size of the cache.(Whereas ZFS' ZIL can't grow very large)
I don't know enough about bcache writes to make a comparison, but the maximum ZIL size would only be dictated by write throughput.
As I understand it, ZFS flushes the ZIL after at most five seconds. FAQs recommend the ZIL be sized at 10x your backing disk(s)'s maximum per-second write performance. (So if 200mbyte/sec, then a ZIL of 2GB)
So my understanding of that is that if you have a burst of small writes to the ZIL, that the backing disks can't write out fast at all, then you'll hit a wall in less than ten seconds.
http://en.wikipedia.org/wiki/ZFS#ZFS_cache:_ARC_.28L1.29.2C_L2ARC.2C_ZIL One thing to note is that if you don't have separate devices for the ZIL (which could be a pair of SSDs or could be a pair of fast disks) then part of the zpool will be used as the ZIL. So the writes that go into the ZIL will be using up your precious IO capacity of main storage (which probably isn't that great if you use RAID-Z) if you don't have an external ZIL. http://en.wikipedia.org/wiki/ZFS#Copy-on-write_transactional_model My understanding is that transaction groups that are committed to the zpool will be larger because the ZIL can allow them to be coalesced. The general concept of log based filesystems is that you can increase performance by making all those small writes end up mostly contiguous on main storage.
Whereas, if I understand bcache's design correctly, it will continue to write data to the SSD until it fills up, without a maximum dirty time. Because it's accumulating more writes before streaming it to the backing disks, there's a better chance of random writes being aggregated into linear ones. (From http://bcache.evilpiepirate.org/Design/ )
The problem here is that a filesystem like Ext4 isn't designed to make writes contiguous. It is designed to have some metadata near the data (which reduces seeks and is a good thing) but it will always have some significant seeks. The elevator algorithm should be more efficient if it writes more data in each pass, but it still shouldn't compare with a filesystem that is designed for making small writes contiguous. That said, last time I compared BTRFS to Ext4 with Bonnie++ Ext4 won easily for write performance. When I used Postal the results were fairly similar. But that was a while ago and BTRFS has improved a lot since. I'll have to do another test with a recent version of BTRFS and also see how it compares to ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/