
The LUV server has just been down for a few hours. It all started a couple of weeks ago when I wrote a Mon plugin for running lm- sensors to monitor system temperature. When I installed that on every system that matters to me I noticed that the hardware that runs the LUV server was reporting a CPU temperature greater than the recommended maximum. It isn't necessarily a bad thing for a CPU to run above the recommended maximum (80C in this case) for a while as long as it is well clear of the critical temperature (in this case 100C) but running 24*7 above the maximum temperature is obviously a bad thing. I changed the configuration of BOINC (as described in a recent LUV lecture) to use less CPU time and the temperature went down to ~72C. Then over about a week the temperature slowly increased while the system load stayed the same. This indicated some sort of system cooling failure that appeared to be getting worse. The man who owns the hardware reported a system fault to Hetzner (the hosting company). They put new thermal paste on the CPU and cleaned out some dust. Then the system didn't boot. After some investigation (including some delay due to the Hetzner KVM not working) it turned out that I had made a mistake in not removing a /etc/fstab line for a filesystem I had removed months ago and this had made the system halt the boot and prompt for a sysadmin login. To prevent this happening in future I added the mount option x-systemd.automount to all filesystems apart from root, this means that systemd will use it's automount facility and not delay system startup if a filesystem can't be mounted. If this problem happens again I will be able to ssh in as root to fix it. As the system hadn't been rebooted since early this year there were some kernel fixes to apply. I don't think there was anything critical for our use but as we had already got some downtime it made sense to get other things done. I upgraded the kernel and then the system wouldn't boot again. The physical hardware running the LUV server was the last system I run that's still using Xen. When I got it to boot again (I can't remember the exact problem there) I couldn't get it to run Xen. I then purged Xen and installed KVM. It took another 15 minutes to set up KVM but that's probably less time than would have taken to debug Xen and gave the added benefit that I'm no longer using Xen! Now I think that everything is fine. Email me off-list if you notice any problems. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/