Re: RAID-1 synchronisation

10 Feb 2012

      IMHO the main purpose of check/scrub on sw or hw raids isn't to detect
"right now" problems, but to shake out unreadable sectors and bad disks
so that they don't cause major drama later.

serious problems (eg. array failure) can occur during raid rebuild if
the raid code tries to read from a second unrecoverably bad disk.

we lose a few disks every time we do a md 'check' over our 104 md
raid6's, but many more of the arrays do routine rewrites and fixup bad
disk sectors and make things far safer in the long term. we also have
rewrites happening ~daily in normal operation as bad disk sectors are
found during reads and remapped automatically by writes done by the
raid6 code.

in the home context, bad disk sectors and the ability of the md code to
hide and remap these automatically is probably the best reason to make a
home raid instead of just put a single 'big enough' disk in something.

if one sector goes bad in that single disk then it's pretty much restore
from backup time as one part of the fs will be forever unreadable until
you find and write over the bad block. the fs can also shutdown or go
read-only if it finds something unreadable. whereas if its in a raid5/6
you likely won't care or notice the problem, and if the raid code
doesn't auto remap the sector for you then you can do a check/scrub or
kick out the disk and dd over it at your leisure.

err, but having said that I'm currently thinking of single 3tb + a
~weekly 3tb backup disk to replace my htpc's 3 (all dying!) 1tb disks
in raid5, as a single 3tb uses less power.

On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:
...
1 adam mdadm: RebuildFinished event detected on md device /dev/md/1, component device  mismatches found: 10496
what sort of raid is it? 1,10,5,6?
I may have missed that info in this thread...

if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email
about problems ->

        # Due to the fact that raid1/10 writes in the kernel are unbuffered,
        # a raid1 array can have non-0 mismatch counts even when the
        # array is healthy.  These non-0 counts will only exist in
        # transient data areas where they don't pose a problem.  However,
        # since we can't tell the difference between a non-0 count that
        # is just in transient data or a non-0 count that signifies a
        # real problem, simply don't check the mismatch_cnt on raid1
        # devices as it's providing far too many false positives.  But by
        # leaving the raid1 device in the check list and performing the
        # check, we still catch and correct any bad sectors there might
        # be in the device.

cheers,
robin

Re: RAID-1 synchronisation

Robin Humble