
I run a bunch of servers with Linux software RAID-1. I use bitmaps on all of them because the ongoing overhead (*) of bitmaps is better than the occasional overhead of a full resync.
Recently one of my servers suddenly decided to do a complete RAID-1 resync for no apparent reason. Other servers with the same versions of all software (Debian/Squeeze with all updates) didn't do it. The server in question did crash a few times recently (**). Is a server crash likely to result in an entire RAID resync even when bitmaps are used?
Does anyone have any advice other than throwing the server in the bin?
I'd suggest netconsole but that's not going to help if your ethX interface is crashing, or are you already confident you are seeing all the messages at crash time? Does /proc/mdstat indicate that your bitmap is there and hasn't disappeared? Is your sata/sas/whatever controller sharing an irq at all? (you don't say what vintage your servers are). One thing that would cause a resync is if a disk got ejected from and then re-added to the array. This could happen if one of your disk controllers or one channel of your disk controller hung, eg maybe it was being serviced by the hung cpu you mention in **. I've never had a disk fail on a Linux RAID before though so I don't know if re-adding is something that might happen automatically next boot under any circumstance... it seems unlikely though, and unwanted. Unless the disk just 'disappeared' instead of reporting failure... But really, any of this should be logged, if not at crash time, then at next startup. The fact that you have other servers with identical software does seem to indicate a hardware failure. I've never been particularly comfortable that linux raid handles as many corner case failures as well as some of the hardware raid implementations. How expensive is the server, how expensive is the downtime, and how expensive is your time? And more importantly, how valuable is the data? These sort of crashes would make me worry that my data is just not going to be there one morning, or worse, is getting silently corrupted requiring dipping into backup archives to restore. James