outage

9 Sep 2016

      The LUV server has just been down for a few hours.

It all started a couple of weeks ago when I wrote a Mon plugin for running lm-
sensors to monitor system temperature.  When I installed that on every system 
that matters to me I noticed that the hardware that runs the LUV server was 
reporting a CPU temperature greater than the recommended maximum.  It isn't 
necessarily a bad thing for a CPU to run above the recommended maximum (80C in 
this case) for a while as long as it is well clear of the critical temperature 
(in this case 100C) but running 24*7 above the maximum temperature is 
obviously a bad thing.

I changed the configuration of BOINC (as described in a recent LUV lecture) to 
use less CPU time and the temperature went down to ~72C.  Then over about a 
week the temperature slowly increased while the system load stayed the same.  
This indicated some sort of system cooling failure that appeared to be getting 
worse.

The man who owns the hardware reported a system fault to Hetzner (the hosting 
company).  They put new thermal paste on the CPU and cleaned out some dust.  
Then the system didn't boot.

After some investigation (including some delay due to the Hetzner KVM not 
working) it turned out that I had made a mistake in not removing a /etc/fstab 
line for a filesystem I had removed months ago and this had made the system 
halt the boot and prompt for a sysadmin login.  To prevent this happening in 
future I added the mount option x-systemd.automount to all filesystems apart 
from root, this means that systemd will use it's automount facility and not 
delay system startup if a filesystem can't be mounted.  If this problem 
happens again I will be able to ssh in as root to fix it.

As the system hadn't been rebooted since early this year there were some 
kernel fixes to apply.  I don't think there was anything critical for our use 
but as we had already got some downtime it made sense to get other things 
done.  I upgraded the kernel and then the system wouldn't boot again.

The physical hardware running the LUV server was the last system I run that's 
still using Xen.  When I got it to boot again (I can't remember the exact 
problem there) I couldn't get it to run Xen.

I then purged Xen and installed KVM.  It took another 15 minutes to set up KVM 
but that's probably less time than would have taken to debug Xen and gave the 
added benefit that I'm no longer using Xen!

Now I think that everything is fine.  Email me off-list if you notice any 
problems.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

Russell Coker

Andrew Pam

Rohan McLeod

Glenn McIntosh

tags

participants (4)