Re: drdb and mdadm and more (was Re: mail storage in a distributed database)

6 Apr 2012


      On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
...
On Thu, Apr 05, 2012 at 06:31:49PM +1000, Russell Coker wrote:
...
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
...
On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
...
We have issues where the monthly mdadm raid check grinds the system
to a halt.
do you find that these monthly cron jobs are actually useful? [...]
deb http://www.coker.com.au squeeze misc
In the above Debian repository for i386 and amd64 I have a version of
mdadm patched to send email when the disks have different content.
I am seeing lots of errors from all systems, it seems that the RAID
code in the kernel is reporting that 128 sectors (64K) of disk space
is wrong for every error (all reported numbers are multiples of 128).
if mdadm software raid is doing that, then to me it says "don't use
mdadm raid" rather than "stress-test raid every month and hope for the
best".
however, i've been using mdadm for years without seeing any sign of that
(and yes, with the monthly mdadm raid checks enabled. i used to grumble
about it slowing my system down but never made the decision to disable
it).
You won't see such an obvious sign because you aren't running the version of 
mdadm that I patched to send email about it.  Maybe logwatch/logcheck would 
inform you.
...
first question that occurs to me is: is there a bug in the raid code
itself or is the bug in the raid checking code?
Other reports that I've seen from a reliable source say that you can get lots 
of errors while still having the files match the correct md5sums.  This 
suggests that the problem is in the RAID checking code as the actual data 
returned is still correct.
...
...
Also I suspect that the Squeeze kernel has a bug in regard to this.
I'm still tracking it down.
i never really used squeeze for long on real hardware (as opposed to
on VMs)...except in passing when sid was temporarily rather similar
to what squeeze became. and i've always used later kernels - either
custom-compiled or (more recently) by installing the later linux-image
packages.
In my tests so far I haven't been able to reproduce such problems with 
Debian's 3.2.0 kernel.
...
...
If you have a RAID stripe that doesn't match then you really it to be
fixed even if replacing a disk is not possible.  Having two reads from
the same address on a RAID-1 give different results is a bad thing.
Having the data on a RAID-5 or RAID-6 array change in the process of
recovering from a dead disk is also a bad thing.
true, but as above that's a "don't do that, then" situation. if you are
getting symptoms like the above then either your hardware is bad or your
kernel version is broken. in either case, don't do that. backup your
data immediately and do something else that isn't going to lose your
data.
Unless of course you have those things reported regularly without data loss.
...
ps: one of the reasons i love virtualisation is that it makes it so easy
to experiment with this stuff and get an idea of whether it's worthwhile
trying on real hardware. spinning up a few new vms is much less hassle
than scrounging parts to build another test system.
Yes.  It is unfortunate that the DRBD server reboot problem never appeared on 
any of my VM tests (not even when I knew what to look for) and only appeared 
in production.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/