state of ZFS and btrfs (was: RAID-1 synchronisation)

On 05/02/12 19:13, Daniel Pittman wrote:
On Sat, Feb 4, 2012 at 23:30, Russell Coker<russell@coker.com.au> wrote:
On Sun, 5 Feb 2012, Daniel Pittman<daniel@rimspace.net> wrote:
Yes, it does good. For something like RAID-5 that has enough information to detect which device is wrong you can find and correct problems with the system. With a RAID-1 you can know that a device is returning bad data.
You mean RAID-6. RAID-5 is no better than RAID-1 when it comes to determining which one of the disks is returning corrupt data.
Sorry, yes, I did. I was thinking about that at the time I wrote it, then went back and edited. Obviously, poorly.
On Sun, 5 Feb 2012, Daniel Pittman<daniel@rimspace.net> wrote:
Yes. This is the substantial advantage that BTRFS and ZFS have over device-level RAID and LVM. The stronger checksums are useful because they can be checked inline with lower I/O cost, and the intimate knowledge of allocated space means you can check more quickly. Both very attractive features.
https://btrfs.wiki.kernel.org/
The above URL lists mirroring as "Additional features in development, I don't think I can use this on serious servers for a while.
Honestly, I wouldn't touch btrfs for some time. Last I heard overhead was still in the thirty to fifty percent range, which makes ZFS look nimble and efficient by comparison. ZFS looks significantly better, though it is still a little annoying to use on Linux, and Debian/KFreeBSD is annoying enough that I never got to testing it out seriously.
I've been giving ZFS a bit of a trial run on one of my servers, because the block-based deduplication looked attractive, as did the checksumming and using an SSD as L2 cache. Unfortunately I ended up giving up -- it *seemed* pretty good on the surface, but I managed to kill the server too often. I ended up in a state where attempting to 'rm -rf' a (large) directory would cause the machine to spike up to over 100 load, and after 15+ minutes the hardware watchdog would reboot the machine because it'd become totally unresponsive, even to the console. Prior to these crashes, there'd be a bunch of log messages about kernel memory allocation failures. I'm pretty sure this was because the memory required for the dedup hash map is huge -- gigabytes per terabyte of storage. Now, if that can't fit in RAM, then ZFS should just be paging it in from disk, which is slow, but shouldn't crash the machine. However in practice it did seem to kill it :( I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images), but the overhead just wasn't worth it for me. One thing I'm already struggling to find in btrfs is simply to work out how much space a snapshot is consuming! Anyone? -Toby

Hi Toby As I understand it, snapshots them selves use 0% disk space, but of cause the storage required to keep the copy will be locked, but because of the way it does this its still a very small amount of space. Cheers Mike On 6/02/12 12:20 PM, Toby Corkindale wrote:
On 05/02/12 19:13, Daniel Pittman wrote:
On Sat, Feb 4, 2012 at 23:30, Russell Coker<russell@coker.com.au> wrote:
On Sun, 5 Feb 2012, Daniel Pittman<daniel@rimspace.net> wrote:
Yes, it does good. For something like RAID-5 that has enough information to detect which device is wrong you can find and correct problems with the system. With a RAID-1 you can know that a device is returning bad data. You mean RAID-6. RAID-5 is no better than RAID-1 when it comes to determining which one of the disks is returning corrupt data. Sorry, yes, I did. I was thinking about that at the time I wrote it, then went back and edited. Obviously, poorly.
On Sun, 5 Feb 2012, Daniel Pittman<daniel@rimspace.net> wrote:
Yes. This is the substantial advantage that BTRFS and ZFS have over device-level RAID and LVM. The stronger checksums are useful because they can be checked inline with lower I/O cost, and the intimate knowledge of allocated space means you can check more quickly. Both very attractive features. https://btrfs.wiki.kernel.org/
The above URL lists mirroring as "Additional features in development, I don't think I can use this on serious servers for a while. Honestly, I wouldn't touch btrfs for some time. Last I heard overhead was still in the thirty to fifty percent range, which makes ZFS look nimble and efficient by comparison. ZFS looks significantly better, though it is still a little annoying to use on Linux, and Debian/KFreeBSD is annoying enough that I never got to testing it out seriously.
I've been giving ZFS a bit of a trial run on one of my servers, because the block-based deduplication looked attractive, as did the checksumming and using an SSD as L2 cache.
Unfortunately I ended up giving up -- it *seemed* pretty good on the surface, but I managed to kill the server too often. I ended up in a state where attempting to 'rm -rf' a (large) directory would cause the machine to spike up to over 100 load, and after 15+ minutes the hardware watchdog would reboot the machine because it'd become totally unresponsive, even to the console. Prior to these crashes, there'd be a bunch of log messages about kernel memory allocation failures.
I'm pretty sure this was because the memory required for the dedup hash map is huge -- gigabytes per terabyte of storage. Now, if that can't fit in RAM, then ZFS should just be paging it in from disk, which is slow, but shouldn't crash the machine. However in practice it did seem to kill it :(
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images), but the overhead just wasn't worth it for me.
One thing I'm already struggling to find in btrfs is simply to work out how much space a snapshot is consuming! Anyone?
-Toby _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

On Mon, 6 Feb 2012, Mike O'Connor wrote:
On 6/02/12 12:20 PM, Toby Corkindale wrote:
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images), but the overhead just wasn't worth it for me.
One thing I'm already struggling to find in btrfs is simply to work out how much space a snapshot is consuming! Anyone?
As I understand it, snapshots them selves use 0% disk space, but of cause the storage required to keep the copy will be locked, but because of the way it does this its still a very small amount of space.
Bottom post and trim please. Have a btrfs filesystem with 4GB on it. Snapshot it (1) Allocate a 1GB and 3GB file. Take a snapshot (2). Delete the 1GB file. How much space is allocate within btrfs, how much space is the current filesystem taking up, how much space did snapshot 2 take up over snapshot 1, and how much space would be freed by deleting snapshot 2 are all seperate questions. (I have a 2.5year old backuppc instance at home where I have similar questions. I have just filled up my disk, never having had to purge any old snapshots outside of the normal trimming algorithm I gave it. Now that I have filled it, I don't want to purge the most historical information out of it. I know there are some intermediate snapshots where I allocated large temporary files that I can safely purge from the backups. What is the mimimal set of backups that can be purged that will free up the maximal space that I need to allow me to restart the backups without buying more disk?). -- Tim Connors

On 06/02/12 13:20, Mike O'Connor wrote:
Hi Toby
As I understand it, snapshots them selves use 0% disk space, but of cause the storage required to keep the copy will be locked, but because of the way it does this its still a very small amount of space.
Hi Mike, Yes, that's how I understand the system to work too. The question is -- as I modify the working filesystem, how can I tell how much space is being used by the snapshot to keep its original copies of everything..? -Toby
On 6/02/12 12:20 PM, Toby Corkindale wrote:
On 05/02/12 19:13, Daniel Pittman wrote:
On Sat, Feb 4, 2012 at 23:30, Russell Coker<russell@coker.com.au> wrote:
On Sun, 5 Feb 2012, Daniel Pittman<daniel@rimspace.net> wrote:
Yes, it does good. For something like RAID-5 that has enough information to detect which device is wrong you can find and correct problems with the system. With a RAID-1 you can know that a device is returning bad data. You mean RAID-6. RAID-5 is no better than RAID-1 when it comes to determining which one of the disks is returning corrupt data. Sorry, yes, I did. I was thinking about that at the time I wrote it, then went back and edited. Obviously, poorly.
On Sun, 5 Feb 2012, Daniel Pittman<daniel@rimspace.net> wrote:
Yes. This is the substantial advantage that BTRFS and ZFS have over device-level RAID and LVM. The stronger checksums are useful because they can be checked inline with lower I/O cost, and the intimate knowledge of allocated space means you can check more quickly. Both very attractive features. https://btrfs.wiki.kernel.org/
The above URL lists mirroring as "Additional features in development, I don't think I can use this on serious servers for a while. Honestly, I wouldn't touch btrfs for some time. Last I heard overhead was still in the thirty to fifty percent range, which makes ZFS look nimble and efficient by comparison. ZFS looks significantly better, though it is still a little annoying to use on Linux, and Debian/KFreeBSD is annoying enough that I never got to testing it out seriously.
I've been giving ZFS a bit of a trial run on one of my servers, because the block-based deduplication looked attractive, as did the checksumming and using an SSD as L2 cache.
Unfortunately I ended up giving up -- it *seemed* pretty good on the surface, but I managed to kill the server too often. I ended up in a state where attempting to 'rm -rf' a (large) directory would cause the machine to spike up to over 100 load, and after 15+ minutes the hardware watchdog would reboot the machine because it'd become totally unresponsive, even to the console. Prior to these crashes, there'd be a bunch of log messages about kernel memory allocation failures.
I'm pretty sure this was because the memory required for the dedup hash map is huge -- gigabytes per terabyte of storage. Now, if that can't fit in RAM, then ZFS should just be paging it in from disk, which is slow, but shouldn't crash the machine. However in practice it did seem to kill it :(
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images), but the overhead just wasn't worth it for me.
One thing I'm already struggling to find in btrfs is simply to work out how much space a snapshot is consuming! Anyone?
-Toby _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- .signature

Quoting Toby Corkindale (toby.corkindale@strategicdata.com.au):
I've been giving ZFS a bit of a trial run on one of my servers, because the block-based deduplication looked attractive, as did the checksumming and using an SSD as L2 cache.
Unfortunately I ended up giving up -- it *seemed* pretty good on the surface, but I managed to kill the server too often.
If I were to deploy ZFS for any important deployment, personally I'd use Nexenta for that: OpenSolaris kernel, almost entirely GNU userspace. It appears to be extremely reliable for that purpose. -- Rick Moen "Do not be afraid to use exclamation points rick@linuxmafia.com in your writing. They can sense fear." McQ! (4x80) -- FakeAPStylebook

Quoting "Rick Moen" <rick@linuxmafia.com>:
Quoting Toby Corkindale (toby.corkindale@strategicdata.com.au):
I've been giving ZFS a bit of a trial run on one of my servers, because the block-based deduplication looked attractive, as did the checksumming and using an SSD as L2 cache.
Unfortunately I ended up giving up -- it *seemed* pretty good on the surface, but I managed to kill the server too often.
If I were to deploy ZFS for any important deployment, personally I'd use Nexenta for that: OpenSolaris kernel, almost entirely GNU userspace. It appears to be extremely reliable for that purpose.
I am running ZFS production machines under FreeBSD and had no issues over the last 18 months. Regards Peter

Quoting Peter Ross (Peter.Ross@bogen.in-berlin.de):
I am running ZFS production machines under FreeBSD and had no issues over the last 18 months.
I'm sure, but, even as fond as I am of FreeBSD, having almost entirely GNU userspace is a significant advantage from my own perspective.

On Monday 06 February 2012 13:24:46 Rick Moen wrote:
If I were to deploy ZFS for any important deployment, personally I'd use Nexenta for that
As an aside (apologies if I've mentioned this before), the Sequoia BlueGene/Q supercomputer currently being built for LLNL will use Lustre as its 50PB cluster filesystem, using ZFS on Linux as its back end and, needless to say, has some fairly high performance targets (512-1024 MB/s for streaming writes I believe). Hence the http://zfsonlinux.org/ project which they are behind. More info in this presentation from last April: http://zfsonlinux.org/docs/LUG11_ZFS_on_Linux_for_Lustre.pdf cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On Sat, Feb 11, 2012 at 12:42:06PM +1100, Chris Samuel wrote:
On Monday 06 February 2012 13:24:46 Rick Moen wrote:
If I were to deploy ZFS for any important deployment, personally I'd use Nexenta for that As an aside (apologies if I've mentioned this before), the Sequoia BlueGene/Q supercomputer currently being built for LLNL will use Lustre as its 50PB cluster filesystem, using ZFS on Linux as its back end and, needless to say, has some fairly high performance targets (512-1024 MB/s for streaming writes I believe). ^^^ minor correction - that's GB/s :-)
cheers, robin
Hence the http://zfsonlinux.org/ project which they are behind.
More info in this presentation from last April:
http://zfsonlinux.org/docs/LUG11_ZFS_on_Linux_for_Lustre.pdf
cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Toby Corkindale wrote:
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images)
Um, cp --reflink=yes host0.img host1.img. You don't get *de*dupping (yet), but if you explicitly tell btrfs that the new file should start off with the same blocks as the old file, they'll be non-dupped to begin with. OK, granted, over time they will drift apart and dedupped would have helped if both VMs made the same changes to the same virtual blocks, but AFAICT --reflink solves the first 90% of the problem.

On 06/02/12 14:20, Trent W. Buck wrote:
Toby Corkindale wrote:
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images)
Um, cp --reflink=yes host0.img host1.img.
You don't get *de*dupping (yet), but if you explicitly tell btrfs that the new file should start off with the same blocks as the old file, they'll be non-dupped to begin with.
OK, granted, over time they will drift apart and dedupped would have helped if both VMs made the same changes to the same virtual blocks, but AFAICT --reflink solves the first 90% of the problem.
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time. But thanks for pointing it out, as it might be useful info for someone. Cheers, Toby

Quoting "Toby Corkindale" <toby.corkindale@strategicdata.com.au>:
On 06/02/12 14:20, Trent W. Buck wrote:
Toby Corkindale wrote:
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images)
Um, cp --reflink=yes host0.img host1.img.
You don't get *de*dupping (yet), but if you explicitly tell btrfs that the new file should start off with the same blocks as the old file, they'll be non-dupped to begin with.
OK, granted, over time they will drift apart and dedupped would have helped if both VMs made the same changes to the same virtual blocks, but AFAICT --reflink solves the first 90% of the problem.
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time.
You can make a fresh install and a snapshot, that is subsequently used for cloning (so you avoid to install again and again). Because you create a clone from the same snapshot again and again, you do not duplicate the space. I rarely create a new image from scratch. Regards Peter

On 06/02/12 14:54, Peter Ross wrote:
Quoting "Toby Corkindale"<toby.corkindale@strategicdata.com.au>:
On 06/02/12 14:20, Trent W. Buck wrote:
Toby Corkindale wrote:
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images)
Um, cp --reflink=yes host0.img host1.img.
You don't get *de*dupping (yet), but if you explicitly tell btrfs that the new file should start off with the same blocks as the old file, they'll be non-dupped to begin with.
OK, granted, over time they will drift apart and dedupped would have helped if both VMs made the same changes to the same virtual blocks, but AFAICT --reflink solves the first 90% of the problem.
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time.
You can make a fresh install and a snapshot, that is subsequently used for cloning (so you avoid to install again and again).
Because you create a clone from the same snapshot again and again, you do not duplicate the space.
I rarely create a new image from scratch.
Yeah, but we're talking about virtual MACHINES here. They run. They change. Even if the base install is near enough to identical, after a year or so and the VMs have been release-upgraded, you've diverged from that original snapshot entirely -- yet all the images still share a lot of identical code if they're the same ubuntu/debian/whatever version. Although for those kind of VMs, I tend to use Linux Containers instead anyway; it's more often Windows machines in VM images, which don't enjoy having their images repeatedly cloned at all. (Well, not if you don't want the licensing system bitching at you all the time) -Toby

Quoting "Toby Corkindale" <toby.corkindale@strategicdata.com.au>:
On 06/02/12 14:54, Peter Ross wrote:
Quoting "Toby Corkindale"<toby.corkindale@strategicdata.com.au>:
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time.
You can make a fresh install and a snapshot, that is subsequently used for cloning (so you avoid to install again and again).
Because you create a clone from the same snapshot again and again, you do not duplicate the space.
I rarely create a new image from scratch.
Yeah, but we're talking about virtual MACHINES here. They run. They change. Even if the base install is near enough to identical, after a year or so and the VMs have been release-upgraded, you've diverged from that original snapshot entirely -- yet all the images still share a lot of identical code if they're the same ubuntu/debian/whatever version.
Although for those kind of VMs, I tend to use Linux Containers instead anyway
I do it in similar fashion with jails here, and a copy-on-write system as ZFS or btrfs fits very neatly in this.
it's more often Windows machines in VM images, which don't enjoy having their images repeatedly cloned at all. (Well, not if you don't want the licensing system bitching at you all the time)
To a certain point SYSPREP may help. Fortunately I don't have much need for it these days:-) I could use deduplication on ZFS but the available storage is too big to consider it;-) But I see your point and agree that it can be handy. Regards Peter

Toby Corkindale wrote:
On 06/02/12 14:20, Trent W. Buck wrote:
Toby Corkindale wrote:
I'm giving btrfs another go now; pity about losing the block-based dedup (which is great when you have a lot of virtual machine images)
Um, cp --reflink=yes host0.img host1.img.
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time.
Sigh, that's still the norm? Fair enough, I guess; it seems like a waste of time to me when the ISO to base install phase is (near) identical each time.

Trent W. Buck <trentbuck@gmail.com> wrote:
Sigh, that's still the norm? Fair enough, I guess; it seems like a waste of time to me when the ISO to base install phase is (near) identical each time.
I think the norm is (or should be) to run a script that creates the image. I remember reading about just such a tool... so I just looked it up - oz-install. It doesn't seem to be in Debian yet.

On 06/02/12 15:56, Jason White wrote:
Trent W. Buck<trentbuck@gmail.com> wrote:
Sigh, that's still the norm? Fair enough, I guess; it seems like a waste of time to me when the ISO to base install phase is (near) identical each time.
I think the norm is (or should be) to run a script that creates the image. I remember reading about just such a tool... so I just looked it up - oz-install. It doesn't seem to be in Debian yet.
There's also debootstrap, if you don't mind missing out on a tonne of things the regular installer normally sets up..

On Mon, 6 Feb 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time.
If you had two VMs that had been running for a while and repeatedly upgraded then you could shut down node B, do a reflink copy of node A, and then rsync the files from node B to the new reflink copy. Then you would end up with an identical set of node B files but with reflink for the files that are identical (IE everything that's packaged from the distribution). -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 06/02/12 20:42, Russell Coker wrote:
On Mon, 6 Feb 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au> wrote:
Ah, it might solve 90% of your problem, but it doesn't for most people, where a VM image is created from scratch via an ISO image of an installer, and then the VM has lots of patches and upgrades applied over time.
If you had two VMs that had been running for a while and repeatedly upgraded then you could shut down node B, do a reflink copy of node A, and then rsync the files from node B to the new reflink copy. Then you would end up with an identical set of node B files but with reflink for the files that are identical (IE everything that's packaged from the distribution).
Or I could get a filesystem editor and a calculator and go through the blocks one at a time, and manually cross-linking them with a hexeditor and a copy of the filesystem specification. or I could buy a bigger hard drive and put up with it for now, until someone implements a stable block-based deduplication in an open-source filesystem.. :) We already have it in the kernel for memory (as KSM) so I'm surprised we haven't seen something for ext4 already.

Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
We already have it in the kernel for memory (as KSM) so I'm surprised we haven't seen something for ext4 already.
It's in the Btrfs FAQ: patches have been posted as of last year, but further work has not been done. Understandably, they're concentrating at the moment on making the features already implemented more reliable rather than on adding new features (except for fsck, that is).

On Tuesday 07 February 2012 16:22:25 Jason White wrote:
Understandably, they're concentrating at the moment on making the features already implemented more reliable rather than on adding new features (except for fsck, that is).
Worth keeping in mind that there are two projects for fsck, one to do as much recovery in kernel as possible, and another to do a traditional out-of-tree checker and fixer. The current one in the btrfs utilities only checks, but doesn't fix. They are very nervous about releasing one which makes things worse. If you are paranoid you can there is also an out-of-tree patch to check I/O's going down to disk to confirm they won't leave the filesystem in an inconsistent state should power go out. More details (and git repo location) here: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/15005 -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On Tue, Feb 07, 2012 at 10:36:27AM +1100, Toby Corkindale wrote:
or I could buy a bigger hard drive and put up with it for now, until someone implements a stable block-based deduplication in an open-source filesystem.. :)
We already have it in the kernel for memory (as KSM) so I'm surprised we haven't seen something for ext4 already.
it's in ZFS. the catch is that it takes enormous amounts of memory....about 1GB RAM per TB of disk space IIRC to store the hashes for each block. I suspect the same catch would apply to other implementations. that's mitigated somewhat by the fact that it also uses your L2ARC (cache) for de-duping, so if you have a large fast SSD cache device on your zpool, you can get away with less RAM. I don't run enough VMs on my home machine to bother with de-duping. I've got terabytes of disk and only 16GB RAM. I do use compression on the ZFS filesystems, though. it's effectively free (in fact, for most workloads it's a performance boost because it's faster to load and decompress fewer blocks than it is to load more uncompressed blocks). craig -- craig sanders <cas@taz.net.au> BOFH excuse #258: That's easy to fix, but I can't be bothered.

On 11/02/12 13:45, Craig Sanders wrote:
On Tue, Feb 07, 2012 at 10:36:27AM +1100, Toby Corkindale wrote:
or I could buy a bigger hard drive and put up with it for now, until someone implements a stable block-based deduplication in an open-source filesystem.. :)
We already have it in the kernel for memory (as KSM) so I'm surprised we haven't seen something for ext4 already.
it's in ZFS. the catch is that it takes enormous amounts of
I think you missed the earlier messages in this thread, which started off talking about ZFS and dedup..
participants (11)
-
Chris Samuel
-
Craig Sanders
-
Jason White
-
Mike O'Connor
-
Peter Ross
-
Rick Moen
-
Robin Humble
-
Russell Coker
-
Tim Connors
-
Toby Corkindale
-
Trent W. Buck