btrfs -- what happens when a checksum fails on a single disk?

Toby Corkindale

14 Feb 2012 14 Feb '12

12:44 a.m.

btrfs -- if you have your data stored multiply (ie. raid mirror) then if one copy goes bad, it'll resync it from the other copy. That's all well and good. Can you explain the failure modes for when a block checksum fails when you have data on a single disk? (I gather that there are both block and extent checksums, but correct me if I'm wrong) Essentially I was wondering whether you'd lose an entire file (or extent of a file) if the checksum fails, or whether you could still access the rest of the file but with just one block missing in the event of a bad sector. Cheers, Toby

Show replies by date

Peter Ross

14 Feb 14 Feb

1:25 a.m.

Quoting "Toby Corkindale" <toby.corkindale@strategicdata.com.au>:

...

Essentially I was wondering whether you'd lose an entire file (or extent of a file) if the checksum fails, or whether you could still access the rest of the file but with just one block missing in the event of a bad sector.

It shouldn't be any different from a "normal" device error - you get an error when you are reading this block. I don't know the source code but I expect this error if you read sequentially. You may try to recover using random access and pointing to a block behind the broken one. I personally never bothered to do forensics on corrupted filesystems or disks or whatever. If it isn't mirrored and not backed up - it isn't important enough. Regards Peter

Toby Corkindale

1:33 a.m.

On 14/02/12 12:25, Peter Ross wrote:

...

Quoting "Toby Corkindale"<toby.corkindale@strategicdata.com.au>:

...
Essentially I was wondering whether you'd lose an entire file (or extent of a file) if the checksum fails, or whether you could still access the rest of the file but with just one block missing in the event of a bad sector.

It shouldn't be any different from a "normal" device error - you get an error when you are reading this block.

I don't know the source code but I expect this error if you read sequentially. You may try to recover using random access and pointing to a block behind the broken one.

I personally never bothered to do forensics on corrupted filesystems or disks or whatever. If it isn't mirrored and not backed up - it isn't important enough.

In the commercial server world, sure, I agree. At home? Not so much. My server is fine, but my laptop or media PC? They only have one drive each. I backup important things onto the server from them, but I'd like to know what happens if an error occurs. If you just lose a few KB in a file then it's much easier to replace it and soldier on, rather than spend hours re-installing and restoring from backups.

Avi Miller

1:36 a.m.

Hi, On 14/02/2012, at 11:44 AM, Toby Corkindale wrote:

...

Can you explain the failure modes for when a block checksum fails when you have data on a single disk?

You would get a checksum failure reading that block.

...

Essentially I was wondering whether you'd lose an entire file (or extent of a file) if the checksum fails, or whether you could still access the rest of the file but with just one block missing in the event of a bad sector.

You wouldn't lose the entire file. btrfs is copy-on-write, so best case scenario, you could walk backwards through the tree to find a previous copy of the same block and read that, assuming the faulty block hasn't changed. If you manage to break the latest copy and the previous block is different, you may have to step back to previous data for that block. The "recover" tool in btrfs-progs is useful for extracting information in a read-only fashion from btrfs volumes. Cheers, Avi

Peter Ross

1:55 a.m.

Quoting "Avi Miller" <avi.miller@gmail.com>:

...

Hi,

On 14/02/2012, at 11:44 AM, Toby Corkindale wrote:

...
Can you explain the failure modes for when a block checksum fails when you have data on a single disk?

You would get a checksum failure reading that block.

...
Essentially I was wondering whether you'd lose an entire file (or extent of a file) if the checksum fails, or whether you could still access the rest of the file but with just one block missing in the event of a bad sector.

You wouldn't lose the entire file. btrfs is copy-on-write, so best case scenario, you could walk backwards through the tree to find a previous copy of the same block and read that, assuming the faulty block hasn't changed. If you manage to break the latest copy and the previous block is different, you may have to step back to previous data for that block. The "recover" tool in btrfs-progs is useful for extracting information in a read-only fashion from btrfs volumes.

That assumes that there is an older copy. The OP thinks more of a movie or a song at home, I guess. They are written once, there will be no older copy. Even a snapshot does not help because it will include the same faulty block. With ZFS you can have multiple copies of directory trees, even on the same device. btrfs may have the same functionality. I do not have enough private data to fill a hard disk these days so it could be an option for him too. Or slice the disks in two partitions of equal size and use them as a mirror. Regards Peter

Jason White

2 a.m.

Peter Ross <Peter.Ross@bogen.in-berlin.de> wrote:

...

I do not have enough private data to fill a hard disk these days so it could be an option for him too.

Or slice the disks in two partitions of equal size and use them as a mirror.

I just use rsync to ensure that important directories under /home are backed up. I also keep tar files of the git repository (maintained by etckeeper) that tracks /etc. If a disk block fails, it's time to get a new disk and restore all the files to it.

Chris Samuel

4:29 a.m.

On Tuesday 14 February 2012 12:55:23 Peter Ross wrote:

...

With ZFS you can have multiple copies of directory trees, even on the same device. btrfs may have the same functionality.

By default on a single device metadata is duplicated and data is not, my experiment seems to show (for the ancient btrfs and kernel I'm running here at home) that duplication of data doesn't work with multiple devices. Worth asking on the btrfs list though. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Andrew McGlashan

4:59 a.m.

Hi, On 14/02/2012 12:55 PM, Peter Ross wrote:

...

I do not have enough private data to fill a hard disk these days so it could be an option for him too.

Or slice the disks in two partitions of equal size and use them as a mirror.

With BTRFS, you can mirror content or not depending on your requirement. You can even choose to mirror some files/dirs and not others. And to make it even more versatile, you can apply different RAID versions throughout as well. The problem for me with BTRFS is that it is too new and if you want the latest and greatest, then you always need the latest and greatest kernel -- something that doesn't sit well with me when I want systems that are as stable as possible. I'm sure that BTRFS time will come, but it will likely be a while for me. I like ZFS, but the licensing issues worry me, as does the "Oracle ownership" situation. It certainly seems like BTRFS is something very much to look forward to and probably shouldn't be used for any kind of critical data or systems for some time -- although Oracle has committed to using production BTRFS in OL very soon. btw Avi did a good talk at linux.conf.au this year.... but his comment about an older copy worries me here, it seems clear that with COW in use, the only way that would be useful with a single copy of the data block is if that data block has changed and then you only have the option of reverting back to the older version of that block. Write it once, never change it, never have mirrors of the data and you'll lose that block if the checksum shows corruption. -- Kind Regards AndrewM Andrew McGlashan Broadband Solutions now including VoIP Current Land Line No: 03 9012 2102 Mobile: 04 2574 1827 Fax: 03 9012 2178 National No: 1300 85 3804 Affinity Vision Australia Pty Ltd http://www.affinityvision.com.au http://adsl2choice.net.au In Case of Emergency -- http://www.affinityvision.com.au/ice.html

Brian May

5:08 a.m.

On 14 February 2012 15:59, Andrew McGlashan <andrew.mcglashan@affinityvision.com.au> wrote:

...

... you can apply different RAID versions throughout as well.

Just curious, what do you mean by that? -- Brian May <brian@microcomaustralia.com.au>

Trent W. Buck

5:14 a.m.

Brian May wrote:

...

On 14 February 2012 15:59, Andrew McGlashan <andrew.mcglashan@affinityvision.com.au> wrote:

...
... you can apply different RAID versions throughout as well.

Just curious, what do you mean by that?

Like, / is RAID1 and /boot is RAID1 but /usr is RAID5, all within one big "blob" of btrfsness.

Toby Corkindale

5:25 a.m.

On 14/02/12 16:14, Trent W. Buck wrote:

...

Brian May wrote:

...
On 14 February 2012 15:59, Andrew McGlashan <andrew.mcglashan@affinityvision.com.au> wrote:

...
... you can apply different RAID versions throughout as well.

Just curious, what do you mean by that?

Like, / is RAID1 and /boot is RAID1 but /usr is RAID5, all within one big "blob" of btrfsness.

Citation needed.. I can see how to set that in ZFS, but not btrfs. And even on ZFS, you can only control the number of mirrors per subvolume, not the type of RAID. And it's not guaranteed to store the mirrors on separate disks, although it'll try to. (For proper RAID you have to set it at zpool creation time instead) -Toby

Andrew McGlashan

6:13 a.m.

On 14/02/2012 4:25 PM, Toby Corkindale wrote:

...

On 14/02/12 16:14, Trent W. Buck wrote:

...
Brian May wrote:

...
On 14 February 2012 15:59, Andrew McGlashan <andrew.mcglashan@affinityvision.com.au> wrote:

...
... you can apply different RAID versions throughout as well.

Just curious, what do you mean by that?

Like, / is RAID1 and /boot is RAID1 but /usr is RAID5, all within one big "blob" of btrfsness.

Citation needed..

Watch the video from LCA2012 by Avi: http://mirror.internode.on.net/pub/linux.conf.au/2012/I_Cant_Believe_This_is... Cheers -- Kind Regards AndrewM Andrew McGlashan Broadband Solutions now including VoIP Current Land Line No: 03 9012 2102 Mobile: 04 2574 1827 Fax: 03 9012 2178 National No: 1300 85 3804 Affinity Vision Australia Pty Ltd http://www.affinityvision.com.au http://adsl2choice.net.au In Case of Emergency -- http://www.affinityvision.com.au/ice.html

Toby Corkindale

6:23 a.m.

On 14/02/12 17:13, Andrew McGlashan wrote:

...

On 14/02/2012 4:25 PM, Toby Corkindale wrote:

...
On 14/02/12 16:14, Trent W. Buck wrote:

...
Brian May wrote:

...
On 14 February 2012 15:59, Andrew McGlashan <andrew.mcglashan@affinityvision.com.au> wrote:

...
... you can apply different RAID versions throughout as well.

Just curious, what do you mean by that?

Like, / is RAID1 and /boot is RAID1 but /usr is RAID5, all within one big "blob" of btrfsness.

Citation needed..

Watch the video from LCA2012 by Avi:

http://mirror.internode.on.net/pub/linux.conf.au/2012/I_Cant_Believe_This_is...

Can you give me a rough time for the video to fast-forward to for the feature? And if the feature exists, why isn't it mentioned on the wiki page, man pages, etc?

Avi Miller

6:31 a.m.

On 14/02/2012, at 5:23 PM, Toby Corkindale wrote:

...

...
...
...
Like, / is RAID1 and /boot is RAID1 but /usr is RAID5, all within one big "blob" of btrfsness.

[..snip..]

...

And if the feature exists, why isn't it mentioned on the wiki page, man pages, etc?

The feature doesn't exist yet: it's one of the things to be pulled in by Chris after he delivers the btrfsck code. But yes, the plan is for btrfs to allow for multiple RAID levels within a single filesystem and to use the existing balancer code to be able to restripe between RAID levels, not just for an entire filesystem, but for subvolumes and files. Note that the restripe functionality was merged into Chris' tree about a week ago, so it's brand-new. There is a lot going into btrfs and I suspect more happening in the code than on the wiki. Cheers, Avi

Toby Corkindale

6:35 a.m.

On 14/02/12 17:31, Avi Miller wrote:

...

On 14/02/2012, at 5:23 PM, Toby Corkindale wrote:

...
...
...
...
Like, / is RAID1 and /boot is RAID1 but /usr is RAID5, all within one big "blob" of btrfsness.

[..snip..]

...
And if the feature exists, why isn't it mentioned on the wiki page, man pages, etc?

The feature doesn't exist yet: it's one of the things to be pulled in by Chris after he delivers the btrfsck code. But yes, the plan is for btrfs to allow for multiple RAID levels within a single filesystem and to use the existing balancer code to be able to restripe between RAID levels, not just for an entire filesystem, but for subvolumes and files.

Ah OK. That would explain why I was unable to find the feature even on up to date kernels and git versions of the tools!

...

Note that the restripe functionality was merged into Chris' tree about a week ago, so it's brand-new. There is a lot going into btrfs and I suspect more happening in the code than on the wiki.

I have noticed that Debian and Ubuntu ship terribly old versions of the btrfs userspace tools -- I had to build my own just to get access to the "Scrub" feature which has been around for a while. That's a pity :(

Russell Coker

7:25 a.m.

On Tue, 14 Feb 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:

...

I have noticed that Debian and Ubuntu ship terribly old versions of the btrfs userspace tools -- I had to build my own just to get access to the "Scrub" feature which has been around for a while. That's a pity :(

Debian/Unstable has a new enough version of btrfs-tools to support scrub, I've successfully scrubbed a BTRFS filesystem that had RAID-1 for data and metadata and which had some errors deliberately introduced on one partition. Debian/Stable isn't what you want to use for a filesystem that is in the BTRFS stage of development. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Trent W. Buck

15 Feb 15 Feb

12:15 a.m.

Avi Miller wrote:

...

Note that the restripe functionality was merged into Chris' tree about a week ago, so it's brand-new. There is a lot going into btrfs and I suspect more happening in the code than on the wiki.

& IIRC the wiki on kernel.org is still read-only after the breakin, there is a writable version... somewhere else, I forget. A few people I've spoken to have been running btrfs as a DKMS, thereby allowing them to keep their otherwise-stable 2.6.32 (then) or 3.2 (now) or whatever kernel. This seems like a reasonable approach to me, except maybe if your /boot or / are on btrfs (cf. say, /srv).

Avi Miller

12:59 a.m.

On 15/02/2012, at 11:15 AM, Trent W. Buck wrote:

...

Avi Miller wrote:

...
Note that the restripe functionality was merged into Chris' tree about a week ago, so it's brand-new. There is a lot going into btrfs and I suspect more happening in the code than on the wiki.

& IIRC the wiki on kernel.org is still read-only after the breakin, there is a writable version... somewhere else, I forget.

As mentioned in a previous email, but I figure it's worth repeating, the writable wiki is at http://btrfs.ipv5.de Cheers, Avi

Toby Corkindale

1:06 a.m.

On 15/02/12 11:59, Avi Miller wrote:

...

On 15/02/2012, at 11:15 AM, Trent W. Buck wrote:

...
Avi Miller wrote:

...
Note that the restripe functionality was merged into Chris' tree about a week ago, so it's brand-new. There is a lot going into btrfs and I suspect more happening in the code than on the wiki.

& IIRC the wiki on kernel.org is still read-only after the breakin, there is a writable version... somewhere else, I forget.

As mentioned in a previous email, but I figure it's worth repeating, the writable wiki is at http://btrfs.ipv5.de

It's been mentioned four times this month, but people seem to have short memories :) It would be nice if there was an up to date roadmap on it.

Chris Samuel

7:09 a.m.

On Tuesday 14 February 2012 17:31:56 Avi Miller wrote:

...

There is a lot going into btrfs and I suspect more happening in the code than on the wiki.

The best way to keep up with development is subscribing to linux-btrfs at Vger: http://vger.kernel.org/majordomo-info.html cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Jason White

7:11 a.m.

Chris Samuel <chris@csamuel.org> wrote:

...

The best way to keep up with development is subscribing to linux-btrfs at Vger:

http://vger.kernel.org/majordomo-info.html

Or read it at gmane.org if you don't want to add to your inbound mail.

Craig Sanders

14 Feb 14 Feb

8:19 a.m.

On Tue, Feb 14, 2012 at 05:23:14PM +1100, Toby Corkindale wrote:

...

And if the feature exists, why isn't it mentioned on the wiki page, man pages, etc?

i can't remember where i came across it (possibly lwn.net), but I read somewhere that the main btrfs isn't being updated (fallout from the kernel.org compromise last year). if i understood/remembered it correctly, updates are going to: http://btrfs.ipv5.de/ the main page here was last updated on Feb 9 and mentions kernel 3.2. the https://btrfs.wiki.kernel.org/ was last updated on 23 Aug 2011 and mentions kernel 3.0 craig ps: the only feature in btrfs that would be nice to have in ZFS is 'brtfs filesystem balance' to automatically rebalance sub-volumes and files when you add or remove drives. ZFS doesn't do that...new/updated files get spread across the extra drives in the pool after you add them, but there's no automatic way to re-balance them (it probably wouldn't be too hard hack up a script to copy and rename all existing files if you really wanted to do it...but there would be issues with files that are already open). and another thing that would be nice for both would be the ability to move files between sub-volumes (in the same pool) that was a fast move rather than a copy+delete. -- craig sanders <cas@taz.net.au> BOFH excuse #413: Cow-tippers tipped a cow onto the server.

Craig Sanders

8:24 a.m.

On Tue, Feb 14, 2012 at 07:19:16PM +1100, Craig Sanders wrote:

...

On Tue, Feb 14, 2012 at 05:23:14PM +1100, Toby Corkindale wrote:

...
And if the feature exists, why isn't it mentioned on the wiki page, man pages, etc?

i can't remember where i came across it (possibly lwn.net), but I read

duh. posted here on Feb 5 by Chris Samuel. craig next time search *before* posting, not *after*. -- craig sanders <cas@taz.net.au> BOFH excuse #394: Jupiter is aligned with Mars.

Peter Ross

5:11 a.m.

Quoting "Andrew McGlashan" <andrew.mcglashan@affinityvision.com.au>:

...

I like ZFS, but the licensing issues worry me, as does the "Oracle ownership" situation.

With FreeBSD and OpenIndiana and others - ZFS does not go away. Probably as UFS - it went through many many alterations but you find it a lot. Principles of UFS went into other file systems. I guess it is the same with ZFS, there is btrfs and there may be others in the future that borrow from it while the "original" develops as well.

...

btw Avi did a good talk at linux.conf.au this year.... but his comment about an older copy worries me here, it seems clear that with COW in use, the only way that would be useful with a single copy of the data block is if that data block has changed and then you only have the option of reverting back to the older version of that block. Write it once, never change it, never have mirrors of the data and you'll lose that block if the checksum shows corruption.

Which exactly happens with on other file systems as well. The checksums at least give you a heads-up that something is wrong - on other systems you may just read the corrupted data and the START_MISSILE variable suddenly equals one. Regards Peter

Toby Corkindale

5:13 a.m.

On 14/02/12 15:59, Andrew McGlashan wrote:

...

With BTRFS, you can mirror content or not depending on your requirement. You can even choose to mirror some files/dirs and not others. And to make it even more versatile, you can apply different RAID versions throughout as well.

Are you really sure about that? The manual pages make it look like you can only set the RAID options at mkfs time, with there being no way to adjust this per file, dir, subvolume or whatever.

Russell Coker

6:13 a.m.

On Tue, 14 Feb 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:

...

The manual pages make it look like you can only set the RAID options at mkfs time, with there being no way to adjust this per file, dir, subvolume or whatever.

Surely it will get such things as the filesystem develops. The Debian installation process supports installing to a degraded RAID-1 array and then adding a second disk after the installation is finished. It seems quite reasonable to expect the same functionality from BTRFS. Also there need to be options to do things such as migrate from a disk with SMART errors to a new disk, to support systems with hot-swap disks but no hardware RAID (which I've used in production), and do lots of other things. If they are going to do RAID seriously then they need comparable operations to all the things are are used to doing with software RAID and ideally most of the things we are used to doing with hardware RAID (such as adding a disk to a RAID-5 set and making a RAID-6). -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Allan Duncan

10:20 a.m.

On 14/02/12 15:59, Andrew McGlashan wrote:

...

Hi,

On 14/02/2012 12:55 PM, Peter Ross wrote:

...
I do not have enough private data to fill a hard disk these days so it could be an option for him too.

Or slice the disks in two partitions of equal size and use them as a mirror.

With BTRFS, you can mirror content or not depending on your requirement. You can even choose to mirror some files/dirs and not others. And to make it even more versatile, you can apply different RAID versions throughout as well.

The problem for me with BTRFS is that it is too new and if you want the latest and greatest, then you always need the latest and greatest kernel -- something that doesn't sit well with me when I want systems that are as stable as possible.

I'm sure that BTRFS time will come, but it will likely be a while for me. I like ZFS, but the licensing issues worry me, as does the "Oracle ownership" situation.

It certainly seems like BTRFS is something very much to look forward to and probably shouldn't be used for any kind of critical data or systems for some time -- although Oracle has committed to using production BTRFS in OL very soon.

btw Avi did a good talk at linux.conf.au this year.... but his comment about an older copy worries me here, it seems clear that with COW in use, the only way that would be useful with a single copy of the data block is if that data block has changed and then you only have the option of reverting back to the older version of that block. Write it once, never change it, never have mirrors of the data and you'll lose that block if the checksum shows corruption.

If, in this context, "block" means a sector of disk (4096 bytes these days), then all is not totally lost. If you find an error message that gives the actual sector number (absolute or relative) then you can read its contents using "dd noerror seek=nnnn bs=4096 count=1" and then write it back again. There may be some bits flipped, there is nothing you can do about that if it is not text, but the checksum will be cleared and the file as a whole is available (or more than that if the block is in the directory structure).

Tim Connors

15 Feb 15 Feb

9 a.m.

On Tue, 14 Feb 2012, Andrew McGlashan wrote:

...

Hi,

On 14/02/2012 12:55 PM, Peter Ross wrote:

...
I do not have enough private data to fill a hard disk these days so it could be an option for him too.

Or slice the disks in two partitions of equal size and use them as a mirror.

With BTRFS, you can mirror content or not depending on your requirement. You can even choose to mirror some files/dirs and not others. And to make it even more versatile, you can apply different RAID versions throughout as well.

The problem for me with BTRFS is that it is too new and if you want the latest and greatest, then you always need the latest and greatest kernel -- something that doesn't sit well with me when I want systems that are as stable as possible.

It's not ready yet. I installed 3.2 (can't get terribly much more modern by installing debian kernel packages), made a several hundred gig btrfs filesystem and chucked half of several hundred gig onto it. 3 oops with btrfs traces later, I decided I'd remake the FS as ext4.

...

I'm sure that BTRFS time will come, but it will likely be a while for me. I like ZFS, but the licensing issues worry me, as does the "Oracle ownership" situation.

Is btrfs much better in that regard, other than having a licence that is compatible with our kernel? (ZFS already was capital F-Free if you happened to run a compatible kernel. And is developed and owned by the same people). I also worry about the really bad fragmentation of files. Sounds like a fundamental problem if you want to have writes that are friendly to SSD, you're going to have reads that are very much unfriendly to spinning rust.

...

It certainly seems like BTRFS is something very much to look forward to and probably shouldn't be used for any kind of critical data or systems for some time -- although Oracle has committed to using production BTRFS in OL very soon.

Gee, given the filesystem pain we already have on our rhel and oel boxen, I'm sure I'd excite my colleagues by suggesting we stick with the New Shiny! defaults! -- Tim Connors

Avi Miller

9:25 a.m.

On 15/02/2012, at 8:00 PM, Tim Connors wrote:

...

It's not ready yet. I installed 3.2 (can't get terribly much more modern by installing debian kernel packages), made a several hundred gig btrfs filesystem and chucked half of several hundred gig onto it. 3 oops with btrfs traces later, I decided I'd remake the FS as ext4.

There have been a lot of stability fixes in 3.3 and 3.4. Are you reporting these oops' anywhere? Obviously this is very important to me, as Release Manager for Oracle Linux's UEK2 and btrfs. :) Our internal QA testing is seriously hammering btrfs at the moment, so I'd be interested to know if you can reproduce these issues. If you can, please email me directly with the steps to reproduce so that I can try and get Oracle QA to try it as well.

...

I also worry about the really bad fragmentation of files. Sounds like a fundamental problem if you want to have writes that are friendly to SSD, you're going to have reads that are very much unfriendly to spinning rust.

btrfs includes both scrubbing and balancing routines, and we recommend you scrub on a regular basis for any multi-device btrfs volume. It also does automatic and online defragmentation. btrfs also adjusts itself for SSD vs spinning rust, which you can tune via mount options (i.e. if it detects something as spinning when it's not or vice-versa).

...

Gee, given the filesystem pain we already have on our rhel and oel boxen, I'm sure I'd excite my colleagues by suggesting we stick with the New Shiny! defaults!

While we will support btrfs in production, we don't expect customers to rush to convert all their boxes on the day of release. :) Hell, I still have customers on OL3 running Oracle 9i, so just getting them to OL5 would be a start. However, for those customers that do want the benefit of the latest filesystem functionality, we will support btrfs on OL5 and OL6 when we release UEK2. Obviously the biggest benefit for enterprise customers initially is the filesystem snapshot functionality, especially when you use the yum-plugin-fs-snapshot: btrfs will automatically create a snapshot of the filesystem prior to any package installation/upgrade, so in the event of a failure, you can just reboot into a known good state. When you combine the snapshot capabilities of btrfs with our rebootless kernel patching via K-splice, along with the addition of Clusterware and OCFS2, Oracle Linux really does provide some excellent enterprise-grade resilience that is unavailable from any other vendor. Cheers, Avi

Tim Connors

9:33 a.m.

On Wed, 15 Feb 2012, Avi Miller wrote:

...

On 15/02/2012, at 8:00 PM, Tim Connors wrote:

...
It's not ready yet. I installed 3.2 (can't get terribly much more modern by installing debian kernel packages), made a several hundred gig btrfs filesystem and chucked half of several hundred gig onto it. 3 oops with btrfs traces later, I decided I'd remake the FS as ext4.

There have been a lot of stability fixes in 3.3 and 3.4. Are you reporting these oops' anywhere? Obviously this is very important to me, as Release Manager for Oracle Linux's UEK2 and btrfs. :) Our internal QA testing is seriously hammering btrfs at the moment, so I'd be interested to know if you can reproduce these issues. If you can, please email me directly with the steps to reproduce so that I can try and get Oracle QA to try it as well.

mv ~/ext3/ ~/btrfs/ :) (or I might have been reading files off them at the time. Or it might have been sitting idle. I can't actually remember!) Nothing particularly onerous. Certainly not making or using snapshots or doing anything other than accessing and using it as a simple filesystem. Only thing I can think of that was slightly non-mainstream was that it was over iscsi. Naturally, there's only so many times you can try that for your home directory. I don't have any large quantities of data I don't care about (otherwise I would have allocated a larger /dev/write-once-read-never device)! That's why I like stable filesystems and never will be the target market for experimental New Shiny filesystems. It's remarkable how well tested ext3 is. I let the other people be the guinea pigs. -- Tim Connors

Avi Miller

9:36 a.m.

On 15/02/2012, at 8:33 PM, Tim Connors wrote:

...

Only thing I can think of that was slightly non-mainstream was that it was over iscsi.

Interesting. I know we're doing a lot of testing on both local and FC-based single and multipath devices, but I'm not sure if we're doing any iSCSI testing. I'll check with QA and see if we can add that to the list. Obviously we're doing a lot of thrashing of btrfs with the xfstest tools, along with Oracle's own ORION tool and fio.

...

That's why I like stable filesystems and never will be the target market for experimental New Shiny filesystems. It's remarkable how well tested ext3 is. I let the other people be the guinea pigs.

Oh, sure. :)

Chris Samuel

16 Feb 16 Feb

6:18 a.m.

On Wednesday 15 February 2012 20:33:48 Tim Connors wrote:

...

(or I might have been reading files off them at the time. Or it might have been sitting idle. I can't actually remember!)

Did you get to capture the oops by some chance ? cheers! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Chris Samuel

10:31 a.m.

On Wednesday 15 February 2012 20:33:48 Tim Connors wrote:

...

mv ~/ext3/ ~/btrfs/ :)

Did the btrfs filesystem have compression enabled by some chance ? There's a report on the linux-btrfs mailing list of premature ENOSPC issues with 3.2.x with compression enabled (either zlib or lzo). cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Tim Connors

10:36 a.m.

On Thu, 16 Feb 2012, Chris Samuel wrote:

...

On Wednesday 15 February 2012 20:33:48 Tim Connors wrote:

...
mv ~/ext3/ ~/btrfs/ :)

Did the btrfs filesystem have compression enabled by some chance ?

There's a report on the linux-btrfs mailing list of premature ENOSPC issues with 3.2.x with compression enabled (either zlib or lzo).

Default flags. I didn't find the oopses in my digging, other than the one generated when the iscsi disconnected (can't expect btrfs to send packets through the æther! As long as it doesn't take down the kernel, which I don't think it did (my memory of events is blurring)). -- Tim Connors

Chris Samuel

15 Feb 15 Feb

9:26 a.m.

On Wednesday 15 February 2012 20:00:01 Tim Connors wrote:

...

It's not ready yet.

Correct, that's why it's marked as experimental in the kernel config.

...

I installed 3.2 (can't get terribly much more modern by installing debian kernel packages), made a several hundred gig btrfs filesystem and chucked half of several hundred gig onto it. 3 oops with btrfs traces later, I decided I'd remake the FS as ext4.

If you want to experiment with btrfs I sincerely ask you to try with the last 3.3 RC (there haven't been any btrfs changes since 3.3 RC 2 but you might as well get the rest of the fixes that have been going in since) and join the linux-btrfs list and report issues there so that they can get fixed (or, if fixed, get an idea when those are likely to get merged). There's been at least one ENOSPC fix (which, yes, can result in oops's) since 3.2 was released - "Btrfs: fix enospc error caused by wrong checks of the chunk" which was commit 9e622d6bea0202e9fe267955362c01918562c09b. cheers! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Trent W. Buck

1:14 p.m.

Tim Connors wrote:

...

I also worry about the really bad fragmentation of files. Sounds like a fundamental problem if you want to have writes that are friendly to SSD, you're going to have reads that are very much unfriendly to spinning rust.

If you are referring to -o ssd and -o ssd_thingo, they are off by default. The former is enabled automatically iff the kernel detects a non-rotating disk[0]. The latter is a variant that "ought to" deal better with shittier FTLs (like, mtdblock / USB key shitty; as opposed to Intel SSDs). [0] not the case for G.Skill Falcon II 64G, apparently.

Craig Sanders

9:34 p.m.

On Wed, Feb 15, 2012 at 08:00:01PM +1100, Tim Connors wrote:

...

On Tue, 14 Feb 2012, Andrew McGlashan wrote:

...
I'm sure that BTRFS time will come, but it will likely be a while for me. I like ZFS, but the licensing issues worry me, as does the "Oracle ownership" situation.

Is btrfs much better in that regard, other than having a licence that is compatible with our kernel? (ZFS already was capital F-Free if you happened to run a compatible kernel. And is developed and owned by the same people).

btrfs is in the mainline kernel, so yeah it's much better from a linux licensing POV. OTOH, the only licensing issue with ZFS is that it is very unlikely to be part of the mainline kernel, because the CDDL is not compatible with the GPL. Oracle would have to relicense it with a BSD style license for that to happen, which seems pretty unlikely. it's only a (minor) problem for distributors. there's no problem with end-users building and installing zfs on linux...easy enough to do with dkms. (BTW, given that debian already distributes ZFS-FUSE and has a need for ZFS support in the Debian/kFreeBSD port, it wouldn't surprise me at all to see debian-installer support for zfsonlinux in the forseeable future. i don't think it's a big deal, as long as the installation of the ZFS modules wasn't by default but just a convenience feature to make it easier for people who want to install it anyway)

...

...
It certainly seems like BTRFS is something very much to look forward to and probably shouldn't be used for any kind of critical data or systems for some time -- although Oracle has committed to using production BTRFS in OL very soon.

Gee, given the filesystem pain we already have on our rhel and oel boxen, I'm sure I'd excite my colleagues by suggesting we stick with the New Shiny! defaults!

sysadmins *love* to be excited. especially sysadmins working on storage systems - that can get pretty boring so we need interesting events to brighten up our working day :) craig -- craig sanders <cas@taz.net.au> BOFH excuse #28: CPU radiator broken

Jason White

17 Feb 17 Feb

10:39 a.m.

Craig Sanders <cas@taz.net.au> wrote:

...

OTOH, the only licensing issue with ZFS is that it is very unlikely to be part of the mainline kernel, because the CDDL is not compatible with the GPL. Oracle would have to relicense it with a BSD style license for that to happen, which seems pretty unlikely.

They could just dual-licence it under GPLv2 and CDDL if they so wished. Presumably they have strong reasons not to do it, or it would probably have happened by now.

Peter Ross

15 Feb 15 Feb

3:46 a.m.

Quoting "Peter Ross" <Peter.Ross@bogen.in-berlin.de>:

...

With ZFS you can have multiple copies of directory trees, even on the same device. btrfs may have the same functionality.

The problem was discussed on the freebsd-fs mailing list, just today: http://lists.freebsd.org/pipermail/freebsd-fs/2012-February/013721.html

...

It appears that zfs detects mismatch between data stored on the disk and checksum of what should be there. With a single disk setup like you have here there is nothing more to do than to delete the file and restore from a backup if you have one. When the error occured, what caused it or is the error in data or checksum you'll probably never know.

If a particular filesystem is deemed to be particularly important, but there is just one disk, then setting the zfs filesystem attribute 'copies' to 2 or 3 will dramatically reduce the odds of data loss if there is minor media failure. The attribute needs to be set before the data is written. If the whole disk goes, then everything is still lost.

Chris Samuel

14 Feb 14 Feb

4:25 a.m.

On Tuesday 14 February 2012 12:36:42 Avi Miller wrote:

...

so best case scenario, you could walk backwards through the tree to find a previous copy of the same block and read that, assuming the faulty block hasn't changed

Surely with a COW filesystem you're only going to get an old copy of the block if it's been modified ? cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

5026

Age (days ago)

5029

Last active (days ago)

List overview

Download

39 comments

12 participants

participants (12)

Allan Duncan
Andrew McGlashan
Avi Miller
Brian May
Chris Samuel
Craig Sanders
Jason White
Peter Ross
Russell Coker
Tim Connors
Toby Corkindale
Trent W. Buck