
IMHO the main purpose of check/scrub on sw or hw raids isn't to detect "right now" problems, but to shake out unreadable sectors and bad disks so that they don't cause major drama later. serious problems (eg. array failure) can occur during raid rebuild if the raid code tries to read from a second unrecoverably bad disk. we lose a few disks every time we do a md 'check' over our 104 md raid6's, but many more of the arrays do routine rewrites and fixup bad disk sectors and make things far safer in the long term. we also have rewrites happening ~daily in normal operation as bad disk sectors are found during reads and remapped automatically by writes done by the raid6 code. in the home context, bad disk sectors and the ability of the md code to hide and remap these automatically is probably the best reason to make a home raid instead of just put a single 'big enough' disk in something. if one sector goes bad in that single disk then it's pretty much restore from backup time as one part of the fs will be forever unreadable until you find and write over the bad block. the fs can also shutdown or go read-only if it finds something unreadable. whereas if its in a raid5/6 you likely won't care or notice the problem, and if the raid code doesn't auto remap the sector for you then you can do a check/scrub or kick out the disk and dd over it at your leisure. err, but having said that I'm currently thinking of single 3tb + a ~weekly 3tb backup disk to replace my htpc's 3 (all dying!) 1tb disks in raid5, as a single 3tb uses less power. On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:
1 adam mdadm: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496
what sort of raid is it? 1,10,5,6? I may have missed that info in this thread... if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email about problems -> # Due to the fact that raid1/10 writes in the kernel are unbuffered, # a raid1 array can have non-0 mismatch counts even when the # array is healthy. These non-0 counts will only exist in # transient data areas where they don't pose a problem. However, # since we can't tell the difference between a non-0 count that # is just in transient data or a non-0 count that signifies a # real problem, simply don't check the mismatch_cnt on raid1 # devices as it's providing far too many false positives. But by # leaving the raid1 device in the check list and performing the # check, we still catch and correct any bad sectors there might # be in the device. cheers, robin