bcache hits mainstream kernel

I can't be the only one who's been waiting for the bcache stuff to hit mainstream kernels. I rebooted into a stable 3.10 kernel yesterday. Due to the requirement to reformat disks, I haven't started using bcache yet. Is anyone else here already onto it? I'd be curious to hear how it compares to the zfs+l2arc setup some of us have been using previously. bcache.txt from the linux kernel: https://github.com/torvalds/linux/blob/master/Documentation/bcache.txt Cheers, Toby

On 02/07/13 10:33, Toby Corkindale wrote:
I can't be the only one who's been waiting for the bcache stuff to hit mainstream kernels. I rebooted into a stable 3.10 kernel yesterday. Due to the requirement to reformat disks, I haven't started using bcache yet. Is anyone else here already onto it? I'd be curious to hear how it compares to the zfs+l2arc setup some of us have been using previously.
bcache.txt from the linux kernel: https://github.com/torvalds/linux/blob/master/Documentation/bcache.txt
I do wonder if this has landed a bit too late though. Back when they started, good SSDs were expensive and small; but now you can pick up relatively large and fast drives relatively cheaply. You can afford to use one as your primary drive, and just offload big media files to spinning drive arrays. (Which are fine for that access pattern of linear reads and writes) Even the documentation is showing its age, using Intel X-25 drives as the example, which are now four years old. I'm sure there's still a place for this technology when you don't *want* to have to manually choose where to store different categories of files - such as in NAS/storage appliances. Some database loads might benefit, although for PostgreSQL at least you can (and should) configure it to use SSDs for the transaction logging and such anyway, which would give you most the benefits anyway. Having looked at it a bit more, it seems better suited to the SSD-caching scenario than ZFS; there are auto-tuning parameters in bcache to detect at what point to just bypass the cache and go straight to the disks, saving more cache room for blocks that will benefit. And the write-ahead logging is limited only by the size of the cache.(Whereas ZFS' ZIL can't grow very large) tjc

On 2 July 2013 20:20, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote: <...>
Having looked at it a bit more, it seems better suited to the SSD-caching scenario than ZFS; there are auto-tuning parameters in bcache to detect at what point to just bypass the cache and go straight to the disks, saving more cache room for blocks that will benefit.
This is precisely what the ZFS L2ARC is supposed to do.
And the write-ahead logging is limited only by the size of the cache.(Whereas ZFS' ZIL can't grow very large)
I don't know enough about bcache writes to make a comparison, but the maximum ZIL size would only be dictated by write throughput. At least bcache is filesystem agnostic and doesn't suffer from the NIH syndrome. -- Joel Shea <jwshea@gmail.com>

On 03/07/13 15:22, Joel W Shea wrote:
On 2 July 2013 20:20, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote: <...>
Having looked at it a bit more, it seems better suited to the SSD-caching scenario than ZFS; there are auto-tuning parameters in bcache to detect at what point to just bypass the cache and go straight to the disks, saving more cache room for blocks that will benefit.
This is precisely what the ZFS L2ARC is supposed to do.
And the write-ahead logging is limited only by the size of the cache.(Whereas ZFS' ZIL can't grow very large)
I don't know enough about bcache writes to make a comparison, but the maximum ZIL size would only be dictated by write throughput.
As I understand it, ZFS flushes the ZIL after at most five seconds. FAQs recommend the ZIL be sized at 10x your backing disk(s)'s maximum per-second write performance. (So if 200mbyte/sec, then a ZIL of 2GB) So my understanding of that is that if you have a burst of small writes to the ZIL, that the backing disks can't write out fast at all, then you'll hit a wall in less than ten seconds. Whereas, if I understand bcache's design correctly, it will continue to write data to the SSD until it fills up, without a maximum dirty time. Because it's accumulating more writes before streaming it to the backing disks, there's a better chance of random writes being aggregated into linear ones. (From http://bcache.evilpiepirate.org/Design/ )
At least bcache is filesystem agnostic and doesn't suffer from the NIH syndrome.
Agreed.

On Wed, 3 Jul 2013, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
Having looked at it a bit more, it seems better suited to the SSD-caching scenario than ZFS; there are auto-tuning parameters in bcache to detect at what point to just bypass the cache and go straight to the disks, saving more cache room for blocks that will benefit.
This is precisely what the ZFS L2ARC is supposed to do.
And the write-ahead logging is limited only by the size of the cache.(Whereas ZFS' ZIL can't grow very large)
I don't know enough about bcache writes to make a comparison, but the maximum ZIL size would only be dictated by write throughput.
As I understand it, ZFS flushes the ZIL after at most five seconds. FAQs recommend the ZIL be sized at 10x your backing disk(s)'s maximum per-second write performance. (So if 200mbyte/sec, then a ZIL of 2GB)
So my understanding of that is that if you have a burst of small writes to the ZIL, that the backing disks can't write out fast at all, then you'll hit a wall in less than ten seconds.
http://en.wikipedia.org/wiki/ZFS#ZFS_cache:_ARC_.28L1.29.2C_L2ARC.2C_ZIL One thing to note is that if you don't have separate devices for the ZIL (which could be a pair of SSDs or could be a pair of fast disks) then part of the zpool will be used as the ZIL. So the writes that go into the ZIL will be using up your precious IO capacity of main storage (which probably isn't that great if you use RAID-Z) if you don't have an external ZIL. http://en.wikipedia.org/wiki/ZFS#Copy-on-write_transactional_model My understanding is that transaction groups that are committed to the zpool will be larger because the ZIL can allow them to be coalesced. The general concept of log based filesystems is that you can increase performance by making all those small writes end up mostly contiguous on main storage.
Whereas, if I understand bcache's design correctly, it will continue to write data to the SSD until it fills up, without a maximum dirty time. Because it's accumulating more writes before streaming it to the backing disks, there's a better chance of random writes being aggregated into linear ones. (From http://bcache.evilpiepirate.org/Design/ )
The problem here is that a filesystem like Ext4 isn't designed to make writes contiguous. It is designed to have some metadata near the data (which reduces seeks and is a good thing) but it will always have some significant seeks. The elevator algorithm should be more efficient if it writes more data in each pass, but it still shouldn't compare with a filesystem that is designed for making small writes contiguous. That said, last time I compared BTRFS to Ext4 with Bonnie++ Ext4 won easily for write performance. When I used Postal the results were fairly similar. But that was a while ago and BTRFS has improved a lot since. I'll have to do another test with a recent version of BTRFS and also see how it compares to ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker <russell@coker.com.au> wrote:
I'll have to do another test with a recent version of BTRFS and also see how it compares to ZFS.
When you do this, could you include a test of the new "skinny extents" feature introduced in 3.10? http://btrfs.wiki.kernel.org/

On 02/07/13 10:33, Toby Corkindale wrote:
I can't be the only one who's been waiting for the bcache stuff to hit mainstream kernels. I rebooted into a stable 3.10 kernel yesterday. Due to the requirement to reformat disks, I haven't started using bcache yet. Is anyone else here already onto it? I'd be curious to hear how it compares to the zfs+l2arc setup some of us have been using previously.
bcache.txt from the linux kernel:
https://github.com/torvalds/linux/blob/master/Documentation/bcache.txt
I do wonder if this has landed a bit too late though. Back when they started, good SSDs were expensive and small; but now you can pick up relatively large and fast drives relatively cheaply. You can afford to use one as your primary drive, and just offload big media files to spinning drive arrays. (Which are fine for that access pattern of linear reads and writes)
Even the documentation is showing its age, using Intel X-25 drives as the example, which are now four years old.
I'm sure there's still a place for this technology when you don't *want* to have to manually choose where to store different categories of files - such as in NAS/storage appliances.
Some database loads might benefit, although for PostgreSQL at least you can (and should) configure it to use SSDs for the transaction logging and such anyway, which would give you most the benefits anyway.
Having looked at it a bit more, it seems better suited to the SSD-caching scenario than ZFS; there are auto-tuning parameters in bcache to detect at what point to just bypass the cache and go straight to the disks, saving more cache room for blocks that will benefit. And the write-ahead logging is limited only by the size of the cache.(Whereas ZFS' ZIL can't grow very large)
The cases I can see where bcache wins is for multiple VM's where there may be several linear writes in progress that turn out to be not linear because they are interleaved with other VM writes. And you say SSD's are cheap, but they still aren't cheap if you want TB's of data. TB's of data + 200GB bcache SSD is cheap though and then you don't have to think too hard about what data to place on your precious SSD. Also bcache has some awareness of underlying block device metrics, in particular RAID5 where it attempts to ensure that whole stripes are written out where possible. I don't use RAID5 anymore except in some very special circumstances that aren't performance heavy anyway, but it's good to know. Also bcache optimises the access patterns for SSD which in theory reduces the case of write amplification. I ran bcache a while ago and it was a bit of a struggle getting it all going (bcache defaults to 4K block size which Windows on Xen had problems with...) and I didn't go any further than a bit of testing because there was just too much overhead in patching my own kernel and keeping Xen going. Debian is now up to 3.9 in sid and 3.10-rcX in experimental, so hopefully I'll be able to get back to testing again. I hope PCI-E SSD's become affordable soon. I attended a Dell tech class ("Master Class") recently and they were talking about how their SAN's move "hot" extents (they call them pages) to faster storage and also try to balance it out so that each array has equal load by moving extents around between arrays, which removes all the guesswork from putting what data files and transaction logs where. It would be very cool if Linux could do this sort of thing natively. James
participants (5)
-
James Harper
-
Jason White
-
Joel W Shea
-
Russell Coker
-
Toby Corkindale