Re: RAID-1 synchronisation

On Sun, 5 Feb 2012, Daniel Pittman <daniel@rimspace.net> wrote:
Yes, it does good. For something like RAID-5 that has enough information to detect which device is wrong you can find and correct problems with the system. With a RAID-1 you can know that a device is returning bad data.
You mean RAID-6. RAID-5 is no better than RAID-1 when it comes to determining which one of the disks is returning corrupt data. On Sun, 5 Feb 2012, Daniel Pittman <daniel@rimspace.net> wrote:
Yes. This is the substantial advantage that BTRFS and ZFS have over device-level RAID and LVM. The stronger checksums are useful because they can be checked inline with lower I/O cost, and the intimate knowledge of allocated space means you can check more quickly. Both very attractive features.
https://btrfs.wiki.kernel.org/ The above URL lists mirroring as "Additional features in development, I don't think I can use this on serious servers for a while. https://btrfs.wiki.kernel.org/articles/u/s/i/Using_Btrfs_with_Multiple_Devic... mkfs.btrfs in Debian/Unstable apparently supports RAID (the above URL has background information), so I will test it out. On Sun, 5 Feb 2012, James Harper <james.harper@bendigoit.com.au> wrote:
One thing it does do for you is 'touch' unused blocks, and finding that those are bad now rather than later is better IMO. Also, verifying consistency and finding that you have a silent corruption problem early can only be a good thing. This is especially important for RAID5 without battery backed write cache as it can detect the RAID5 write-hole (http://en.wikipedia.org/wiki/RAID_5_write_hole). Maybe write-intent bitmaps get around this these days though?
The only difference between the RAID-5 write hole and the issue of mismatched RAID-1 devices is what happens on a device failer. It's a lot worse for RAID-5 so that's a good reason for doing such checks on a RAID-5 array, or just using RAID-6. RAID-6 allows detecting and correcting the situation where one disk has corrupt data and all disks are working and it allows detecting the situation where the disks aren't consistent if there has been a single disk failure. Just don't use RAID-5. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sunday 05 February 2012 18:30:35 Russell Coker wrote:
https://btrfs.wiki.kernel.org/
The above URL lists mirroring as "Additional features in development, I don't think I can use this on serious servers for a while.
Be warned, the btrfs wiki (along with all the other kernel.org wikis) are still read-only since the kernel.org compromise (not that it was that up to date anyway). The current writeable clone (announced 11th November on the linux- btrfs list by David Sterba) is here: http://btrfs.ipv5.de It's courtesy of Arne Jansen. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On Sun, 5 Feb 2012, Chris Samuel <chris@csamuel.org> wrote:
The current writeable clone (announced 11th November on the linux- btrfs list by David Sterba) is here:
I've just created a BTRFS filesystem with the following command: mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid? Where the raida and raidb devices are each 2G logical volumes. # df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg0-raida 4.0G 56K 3.6G 1% /mnt/tmp # dd if=/dev/urandom of=test bs=1024k count=512 ; df -h . 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 73.3964 s, 7.3 MB/s Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg0-raida 4.0G 678M 2.9G 19% /mnt/tmp It seems a bit bogus to report the Size of the filesystem as 4G when there are 2*2G devices in a RAID-1. I created a few more big random files and the filesystem ran out of space after 1.8G of data was stored, but it reported 3.1G used and 558M free. :( -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Quoting Russell Coker (russell@coker.com.au):
I've just created a BTRFS filesystem with the following command:
mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?
Where the raida and raidb devices are each 2G logical volumes.
Disposible test data, one hopes? A usable fsck utility still hasn't yet appeared in public. -- Rick Moen "When referring to Spider-Man, 'Web head' can now rick@linuxmafia.com be written as 'webhead'." McQ! (4x80) -- FakeAPStylebook

On 06/02/12 06:14, Rick Moen wrote:
Disposible test data, one hopes? A usable fsck utility still hasn't yet appeared in public.
No, but the 3.2 kernel does do a lot more recovery than previous versions did I believe. Also be aware that there will be at least one more backwards incompatible filesystem change to fix the maximum number of hard links to a file in a directory, which is currently low enough to break a number of packages (such as backuppc). http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=633062 -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Chris Samuel wrote:
Also be aware that there will be at least one more backwards incompatible filesystem change to fix the maximum number of hard links to a file in a directory
Fucking awesome, that annoyed me no end.
which is currently low enough to break a number of packages (such as backuppc).
And upgrading git in Debian! http://bugs.debian.org/645009

Hi On 06/02/2012, at 12:24 AM, Russell Coker wrote:
It seems a bit bogus to report the Size of the filesystem as 4G when there are 2*2G devices in a RAID-1.
Perhaps you should watch my presentation on btrfs from linux.conf.au 2012[1]? df lies, which is why you should run: # btrfs fi df /mount Instead. It'll show you properly what's being used by the btrfs filesystem. Cheers, Avi [1] http://www.youtube.com/watch?v=hxWuaozpe2I

Quoting "Avi Miller" <avi.miller@gmail.com>:
On 06/02/2012, at 12:24 AM, Russell Coker wrote:
It seems a bit bogus to report the Size of the filesystem as 4G when there are 2*2G devices in a RAID-1.
Perhaps you should watch my presentation on btrfs from linux.conf.au 2012[1]?
df lies, which is why you should run:
# btrfs fi df /mount
Instead. It'll show you properly what's being used by the btrfs filesystem.
Things are a bit confusing with "traditional" tools (as df), mainly because many subvolumes can share the same btrfs instance - so all show "x free GB" but it does not mean that there are many times "x GB" free. Still, the basics should get reported properly by df & Co. My brain just freezes at "many subvolumes share one btrfs instance", "many zfs sharing the same zvol" sounds much more intuitive. Regards Peter

On 06/02/12 10:36, Peter Ross wrote:
Still, the basics should get reported properly by df & Co.
It's not that easy, I believe btrfs can have per-chunk RAID settings so the kernel doesn't really have a hope at guessing correctly. That said for RAID1 there were some fixes made in reporting in 2.6.34, viz: http://btrfs.ipv5.de/index.php?title=Gotchas -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Quoting "Chris Samuel" <chris@csamuel.org>:
On 06/02/12 10:36, Peter Ross wrote:
Still, the basics should get reported properly by df & Co.
It's not that easy, I believe btrfs can have per-chunk RAID settings so the kernel doesn't really have a hope at guessing correctly.
The chunks are fixed-size "lumps", usually ca. 1GB. As long as you "leave them alone", they count like the space of a normal block device. If you RAID-1 them - it halves the number of chunks.
That said for RAID1 there were some fixes made in reporting in 2.6.34, viz:
To run RAID-1 on two partitions/disks/etc of different size is not really good practise in the first place. But the btrfs knows that there are chunks that cannot be duplicated. Why counting them in the first place? I know that you can lump multiple disks of varying size together (and writes two copies of a chunk any time if you do RAID-10) but even then btrfs needs some foresight to organize the available space. FreeBSD's ZFS had similar issues of integration into the "base system". They seem to be gone by now, at least I do not notice any these days anymore. I believe the same will happen to btrfs as it matures. Regards Peter

On 06/02/12 00:24, Russell Coker wrote:
I created a few more big random files and the filesystem ran out of space after 1.8G of data was stored, but it reported 3.1G used and 558M free. :(
2.6.34 or later will show the raw space for the volume and the raw space used by your data (i.e. including the duplication) in df, which seems to match what you're seeing. Which kernel are you using BTW? If you're playing with btrfs you should be on at least 3.2, if not a 3.3 release candidate (it is labelled as EXPERIMENTAL in the kernel for a good reason). cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On Mon, 6 Feb 2012, Chris Samuel <chris@csamuel.org> wrote:
Which kernel are you using BTW? If you're playing with btrfs you should be on at least 3.2, if not a 3.3 release candidate (it is labelled as EXPERIMENTAL in the kernel for a good reason).
The latest 3.2 kernel image from Debian/Unstable. On Mon, 6 Feb 2012, Chris Samuel <chris@csamuel.org> wrote:
Also be aware that there will be at least one more backwards incompatible filesystem change to fix the maximum number of hard links to a file in a directory, which is currently low enough to break a number of packages (such as backuppc).
How exactly will such changes be incompatible? Will they involve setting a flag so the older kernel knows not to mount it? Or will things just break? On Mon, 6 Feb 2012, Matthew Cengia <mattcen@gmail.com> wrote:
Yep, my array did a check this weekend also. the first email I got was cron saying that the check had begin, and the next was logcheck reorting on error count etc (both attached for reference).
http://etbe.coker.com.au/2012/02/06/reliability-raid/ Unfortunately mdadm doesn't send email about those conditions. I've filed a Debian bug report with a patch and linked to it from my latest blog post about RAID. On Mon, 6 Feb 2012, Avi Miller <avi.miller@gmail.com> wrote:
Perhaps you should watch my presentation on btrfs from linux.conf.au 2012[1]?
I attended your presentation, but I must have been snoozing during the part about df.
df lies, which is why you should run: # btrfs fi df /mount
Thanks, although it would be nice if they had a way of giving output comparable to df so that all the programs which expect df output can just work. On Mon, 6 Feb 2012, Rick Moen <rick@linuxmafia.com> wrote:
Where the raida and raidb devices are each 2G logical volumes.
Disposible test data, one hopes? A usable fsck utility still hasn't yet appeared in public.
It's on a test/gaming machine, which incidentally has been running BTRFS on /home for quite a while without great problems. I'm thinking of putting BTRFS on one of my DNS servers. If it goes down then I can survive with only two DNS servers, and one of those servers has recently had more downtime due to power issues than BTRFS is likely to cause. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Monday 06 February 2012 12:56:52 Russell Coker wrote:
On Mon, 6 Feb 2012, Chris Samuel <chris@csamuel.org> wrote:
Also be aware that there will be at least one more backwards incompatible filesystem change to fix the maximum number of hard links to a file in a directory, which is currently low enough to break a number of packages (such as backuppc).
How exactly will such changes be incompatible?
Will they involve setting a flag so the older kernel knows not to mount it? Or will things just break?
In the past they've done it via a flag I believe. With the free space cache they even managed to hit upon a way of doing it that older kernels would ignore and then the cache would get rebuilt when mounting with a kernel that supported it.
On Mon, 6 Feb 2012, Avi Miller <avi.miller@gmail.com> wrote:
df lies, which is why you should run: # btrfs fi df /mount
Thanks, although it would be nice if they had a way of giving output comparable to df so that all the programs which expect df output can just work.
I agree, perhaps that could be done by giving the most pessimistic outcome - say assuming that all chunks on a filesystem with duplicate metadata and data will be written that way. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On 5 February 2012 22:25, Chris Samuel <chris@csamuel.org> wrote:
On Sunday 05 February 2012 18:30:35 Russell Coker wrote:
https://btrfs.wiki.kernel.org/
The above URL lists mirroring as "Additional features in development, I don't think I can use this on serious servers for a while.
Be warned, the btrfs wiki (along with all the other kernel.org wikis) are still read-only since the kernel.org compromise (not that it was that up to date anyway).
The current writeable clone (announced 11th November on the linux- btrfs list by David Sterba) is here:
... There was a rather entertaining talk on btrfs at LCA 2012 from the "dark side" at http://mirror.linux.org.au/pub/linux.conf.au/2012/I_Cant_Believe_This_is_But... It sounds rather amazing system but going by the rate new features are appearing sounds like it is still one to be wary of committing valuable data to. They also said Oracle is planning on shipping as part of their next official release of their "Unbreakable Linux". In fact many fun talk are present at http://mirror.linux.org.au/pub/linux.conf.au/2012/ Andrew

On Sunday 05 February 2012 18:30:35 Russell Coker wrote:
mkfs.btrfs in Debian/Unstable apparently supports RAID (the above URL has background information), so I will test it out.
Oops - missed this sorry! BtrFS has supported RAID1 for both data and metadata since forever, in fact by I believe default your metadata is duplicated. It also has support for RAID10 (though I can't remember when that arrived). But support for RAID5 and RAID6 is still out of tree and there are some questions about how that will be implemented finally. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
participants (8)
-
Andrew Worsley
-
Avi Miller
-
Chris Samuel
-
Jason White
-
Peter Ross
-
Rick Moen
-
Russell Coker
-
Trent W. Buck