
On Sat, Feb 11, 2012 at 12:54:55AM +1100, Russell Coker wrote:
On Sat, 11 Feb 2012, Robin Humble <robin.humble@anu.edu.au> wrote:
IMHO the main purpose of check/scrub on sw or hw raids isn't to detect "right now" problems, but to shake out unreadable sectors and bad disks so that they don't cause major drama later.
serious problems (eg. array failure) can occur during raid rebuild if the raid code tries to read from a second unrecoverably bad disk.
Surely if there is a bad sector when doing a rebuild then it will only result in at most some corrupt data in one stripe. Surely no RAID implementation would be stupid enough to eject a second disk from a RAID-5 or a third disk from a RAID-6 because of a few errors!
it can always eject a 2nd (or 3rd) disk for the same reason as it ejected the first - typically rewriting the bad sector failed, or too many bad sectors too close together, or ... so kick it out. the chances of it hitting an issue during rebuild are greater than during normal operation too as it has to read the whole disk to reconstruct the new drive (not just the bit of the drive with data on it), and also has to read the p or q parity parts of each stripe (that it otherwise never reads). again, (still IMHO :-) 'check' is mostly there to weed out the crap disks and reduce the likelyhood of multi-failure scenarios. there's some code in recent kernels for tracking the location of various bad sectors on some parts of a raid, in order to have eg. N+2 on most of the arry, and N+1 on a few stripes and still be able to work. I forget what the name of this feature is.
rewrites happening ~daily in normal operation as bad disk sectors are found during reads and remapped automatically by writes done by the raid6 code. That would be only unrecoverable read errors though wouldn't it?
yup. they are common.
Not sectors that quietly have bogus data.
correct, but those are very rare.
AFAIK the MD driver doesn't support reading the entire stripe for every read to detect quiet corruption.
correct. IIRC the md developers view is that that that sort of corruption is best detected by checksums at the fs or at a (future?) scsi checksum protocol level. not sure I entirely agree with them, but hey.
the fs can also shutdown or go read-only if it finds something unreadable. whereas if its in a raid5/6 you likely won't care or notice the problem, and if the raid code doesn't auto remap the sector for you then you can do a check/scrub or kick out the disk and dd over it at your leisure. But if it's RAID-5 then the current state of play is that you won't notice it if one disk returns bogus data and the RAID scrub of a RAID-5 will probably cause corruption to spread to another sector.
"one disk returns bogus data" is very rare, and just a 'check' won't change the data on the platters (unless it hits an unreadable sector) - it will just report the number of mismatches. but if you are in the rare 'silently corrupting disk' situation then yes, run a 'check' and it'll clock up a big mismatch count, which in the case of raid5/6 always means bad things.
On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:
1 adam mdadm: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496
what sort of raid is it? 1,10,5,6? I may have missed that info in this thread... That was a MD RAID-1 with 10M of random data dumped on one disk.
10M? I forget what unit the mismatch cnt is in. for raid6 I'm pretty sure it's 512bytes, so maybe the above means ~5M? it's a lot though either way, so yeah - probably busted hw. the question becomes what is low/normal (I saw up to 768 before I stopped being worried, and 128 mismatches was common), and what is busted hw :-/
if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email about problems ->
# Due to the fact that raid1/10 writes in the kernel are unbuffered, # a raid1 array can have non-0 mismatch counts even when the # array is healthy. ... <much deleted> or unallocated disk space.
yes, spurious mismatches are found in unallocated disk space. ie. free space blocks as far as the fs is concerned. most likely scenario - the fs started doing i/o to one disk of the pair, then changed its mind and did the pair of DMA's to another location on the platters instead - voila - mismatches. the 'mismatch' region is still in free unallocated space as far as the fs is concerned, and it just hasn't been overwritten with new fs blocks yet, at which time the mismatch will go away. no corruption, but md sees mismatches. BTW, pretty sure we've been through all this a couple of years ago on this list :) the linux-raid list answers this question a lot too.
Since we can't tell if it's a problem or not we will just pretend that it's not a problem.
99.9% of the time it really is not a problem. if I ran a 'check' across 600 raid1's now, I bet 10%-50% of them would come back with 'mismatches' and they'd all be spurious. I guess if you had a threshold you knew was corruption vs. normal, then you could write a script to look at the mismatch_cnt and send an email. what would that level be though? depends on so many things... cheers, robin