
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 06:31:49PM +1000, Russell Coker wrote:
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
We have issues where the monthly mdadm raid check grinds the system to a halt.
do you find that these monthly cron jobs are actually useful? [...]
deb http://www.coker.com.au squeeze misc
In the above Debian repository for i386 and amd64 I have a version of mdadm patched to send email when the disks have different content. I am seeing lots of errors from all systems, it seems that the RAID code in the kernel is reporting that 128 sectors (64K) of disk space is wrong for every error (all reported numbers are multiples of 128).
if mdadm software raid is doing that, then to me it says "don't use mdadm raid" rather than "stress-test raid every month and hope for the best".
however, i've been using mdadm for years without seeing any sign of that (and yes, with the monthly mdadm raid checks enabled. i used to grumble about it slowing my system down but never made the decision to disable it).
You won't see such an obvious sign because you aren't running the version of mdadm that I patched to send email about it. Maybe logwatch/logcheck would inform you.
first question that occurs to me is: is there a bug in the raid code itself or is the bug in the raid checking code?
Other reports that I've seen from a reliable source say that you can get lots of errors while still having the files match the correct md5sums. This suggests that the problem is in the RAID checking code as the actual data returned is still correct.
Also I suspect that the Squeeze kernel has a bug in regard to this. I'm still tracking it down.
i never really used squeeze for long on real hardware (as opposed to on VMs)...except in passing when sid was temporarily rather similar to what squeeze became. and i've always used later kernels - either custom-compiled or (more recently) by installing the later linux-image packages.
In my tests so far I haven't been able to reproduce such problems with Debian's 3.2.0 kernel.
If you have a RAID stripe that doesn't match then you really it to be fixed even if replacing a disk is not possible. Having two reads from the same address on a RAID-1 give different results is a bad thing. Having the data on a RAID-5 or RAID-6 array change in the process of recovering from a dead disk is also a bad thing.
true, but as above that's a "don't do that, then" situation. if you are getting symptoms like the above then either your hardware is bad or your kernel version is broken. in either case, don't do that. backup your data immediately and do something else that isn't going to lose your data.
Unless of course you have those things reported regularly without data loss.
ps: one of the reasons i love virtualisation is that it makes it so easy to experiment with this stuff and get an idea of whether it's worthwhile trying on real hardware. spinning up a few new vms is much less hassle than scrounging parts to build another test system.
Yes. It is unfortunate that the DRBD server reboot problem never appeared on any of my VM tests (not even when I knew what to look for) and only appeared in production. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/