
On Sat, 4 Feb 2012, James Harper <james.harper@bendigoit.com.au> wrote:
Does anyone have any advice other than throwing the server in the bin?
I'd suggest netconsole but that's not going to help if your ethX interface is crashing, or are you already confident you are seeing all the messages at crash time?
When the server crashes it can't be pinged from either Ethernet port. That means that either both ports are entirely unusable or eth1 can't access the LAN and eth0 has no routing table (the server is in a DC in Germany and I can't ping eth0 from the LAN).
Does /proc/mdstat indicate that your bitmap is there and hasn't disappeared?
After further investigation I have discovered that I described the problem incorrectly. In future I will take more care about pasting data from the affected system so that anyone who wants to offer advice will know the correct situation even if I describe it badly. It seems that "check = " is not the same as "recovery = " which is what you see when a drive has failed and been added again. # cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 2917680447 blocks super 1.2 [2/2] [UU] [===========>.........] check = 55.9% (1631568128/2917680447) finish=765.1min speed=28013K/sec bitmap: 1/22 pages [4KB], 65536KB chunk It seems that this is from /etc/cron.d/mdadm having a checkarray command which runs on the 3rd of the month, my slowest server didn't complete that in a reasonable amount of time while the other servers which aren't disk IO bound completed it before I noticed. The question is whether the checkarray command does any good. I've run a lot of systems with Linux software RAID and don't recall ever seeing it do any good. While a multi-day cron job with performance implications is going to do some harm. The concepts of BTRFS seem more appealing to me. If I had a BTRFS volume doing the RAID-1 then if the two disks differed then BTRFS would use checksums to determine which one was correct. Also with 2.7TB of disks and only 450G in use a BTRFS check would be a lot faster as it wouldn't check empty space. I'm assuming that BTRFS is good enough for Xen block devices...
Is your sata/sas/whatever controller sharing an irq at all? (you don't say what vintage your servers are).
The system was ordered new at the end of last year. It's got an i7-2600 CPU and I don't think it can be particularly old. /proc/interrupts indicates that no IRQ is shared, although I've never learned much about the new style of interrupts (which involves numbers >15).
serviced by the hung cpu you mention in **. I've never had a disk fail on a Linux RAID before though so I don't know if re-adding is something that might happen automatically next boot under any circumstance... it seems unlikely though, and unwanted. Unless the disk just 'disappeared' instead of reporting failure... But really, any of this should be logged, if not at crash time, then at next startup.
I've had failures in production before and not had it automatically re-add the disk. One thing though is that Linux software RAID is very hesitant to remove disks. The last time I threw a disk in the bin it was after the system BIOS gave a boot warning about SMART failures and the kernel gave SATA errors at boot, but software RAID kept it in the RAID set!
The fact that you have other servers with identical software does seem to indicate a hardware failure. I've never been particularly comfortable that linux raid handles as many corner case failures as well as some of the hardware raid implementations.
On the contrary, I KNOW that Linux software RAID is written by competent people. I'm more confident with the reliability of Linux RAID than with ANY hardware RAID.
How expensive is the server, how expensive is the downtime, and how expensive is your time? And more importantly, how valuable is the data? These sort of crashes would make me worry that my data is just not going to be there one morning, or worse, is getting silently corrupted requiring dipping into backup archives to restore.
The backups are adequate. I could get the server replaced, but I think that now I've got it working well and don't want to replace something I know with something I don't. I'm happy to live without TSO. Thanks for your suggestions. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/