Re: RAID-1 synchronisation

10 Feb 2012

      On Sat, Feb 11, 2012 at 12:54:55AM +1100, Russell Coker wrote:
...
On Sat, 11 Feb 2012, Robin Humble <robin.humble@anu.edu.au> wrote:
...
IMHO the main purpose of check/scrub on sw or hw raids isn't to detect
"right now" problems, but to shake out unreadable sectors and bad disks
so that they don't cause major drama later.
serious problems (eg. array failure) can occur during raid rebuild if
the raid code tries to read from a second unrecoverably bad disk.
Surely if there is a bad sector when doing a rebuild then it will only result 
in at most some corrupt data in one stripe.  Surely no RAID implementation 
would be stupid enough to eject a second disk from a RAID-5 or a third disk 
from a RAID-6 because of a few errors!
it can always eject a 2nd (or 3rd) disk for the same reason as it
ejected the first - typically rewriting the bad sector failed, or too
many bad sectors too close together, or ... so kick it out.

the chances of it hitting an issue during rebuild are greater than during
normal operation too as it has to read the whole disk to reconstruct
the new drive (not just the bit of the drive with data on it), and also
has to read the p or q parity parts of each stripe (that it otherwise
never reads).

again, (still IMHO :-) 'check' is mostly there to weed out the crap
disks and reduce the likelyhood of multi-failure scenarios.

there's some code in recent kernels for tracking the location of various
bad sectors on some parts of a raid, in order to have eg. N+2 on most of
the arry, and N+1 on a few stripes and still be able to work.
I forget what the name of this feature is.
...
...
rewrites happening ~daily in normal operation as bad disk sectors are
found during reads and remapped automatically by writes done by the
raid6 code.
That would be only unrecoverable read errors though wouldn't it?
yup. they are common.
...
Not sectors that quietly have bogus data.
correct, but those are very rare.
...
AFAIK the MD driver doesn't support reading the 
entire stripe for every read to detect quiet corruption.
correct. IIRC the md developers view is that that that sort of
corruption is best detected by checksums at the fs or at a (future?) scsi
checksum protocol level. not sure I entirely agree with them, but hey.
...
...
the fs can also shutdown or go
read-only if it finds something unreadable. whereas if its in a raid5/6
you likely won't care or notice the problem, and if the raid code
doesn't auto remap the sector for you then you can do a check/scrub or
kick out the disk and dd over it at your leisure.
But if it's RAID-5 then the current state of play is that you won't notice it 
if one disk returns bogus data and the RAID scrub of a RAID-5 will probably 
cause corruption to spread to another sector.
"one disk returns bogus data" is very rare, and just a 'check' won't
change the data on the platters (unless it hits an unreadable sector) -
it will just report the number of mismatches.

but if you are in the rare 'silently corrupting disk' situation then
yes, run a 'check' and it'll clock up a big mismatch count, which in
the case of raid5/6 always means bad things.
...
...
On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:
...
1 adam mdadm: RebuildFinished event detected on md device
      /dev/md/1, component device  mismatches found: 10496
what sort of raid is it? 1,10,5,6?
I may have missed that info in this thread...
That was a MD RAID-1 with 10M of random data dumped on one disk.
10M? I forget what unit the mismatch cnt is in. for raid6 I'm pretty
sure it's 512bytes, so maybe the above means ~5M?
it's a lot though either way, so yeah - probably busted hw.

the question becomes what is low/normal (I saw up to 768 before I
stopped being worried, and 128 mismatches was common), and what is
busted hw :-/
...
...
if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email
about problems ->
# Due to the fact that raid1/10 writes in the kernel are
unbuffered, # a raid1 array can have non-0 mismatch counts even when the #
array is healthy.
... <much deleted> or unallocated disk space.
yes, spurious mismatches are found in unallocated disk space. ie. free
space blocks as far as the fs is concerned. most likely scenario - the
fs started doing i/o to one disk of the pair, then changed its mind and
did the pair of DMA's to another location on the platters instead -
voila - mismatches. the 'mismatch' region is still in free unallocated
space as far as the fs is concerned, and it just hasn't been
overwritten with new fs blocks yet, at which time the mismatch will go
away.
no corruption, but md sees mismatches.

BTW, pretty sure we've been through all this a couple of years ago on
this list :) the linux-raid list answers this question a lot too.
...
Since we can't tell if it's a problem or not we will just pretend that it's 
not a problem.
99.9% of the time it really is not a problem.
if I ran a 'check' across 600 raid1's now, I bet 10%-50% of them would
come back with 'mismatches' and they'd all be spurious.

I guess if you had a threshold you knew was corruption vs. normal, then
you could write a script to look at the mismatch_cnt and send an email.
what would that level be though? depends on so many things...

cheers,
robin