
The LUV server has just been down for a few hours. It all started a couple of weeks ago when I wrote a Mon plugin for running lm- sensors to monitor system temperature. When I installed that on every system that matters to me I noticed that the hardware that runs the LUV server was reporting a CPU temperature greater than the recommended maximum. It isn't necessarily a bad thing for a CPU to run above the recommended maximum (80C in this case) for a while as long as it is well clear of the critical temperature (in this case 100C) but running 24*7 above the maximum temperature is obviously a bad thing. I changed the configuration of BOINC (as described in a recent LUV lecture) to use less CPU time and the temperature went down to ~72C. Then over about a week the temperature slowly increased while the system load stayed the same. This indicated some sort of system cooling failure that appeared to be getting worse. The man who owns the hardware reported a system fault to Hetzner (the hosting company). They put new thermal paste on the CPU and cleaned out some dust. Then the system didn't boot. After some investigation (including some delay due to the Hetzner KVM not working) it turned out that I had made a mistake in not removing a /etc/fstab line for a filesystem I had removed months ago and this had made the system halt the boot and prompt for a sysadmin login. To prevent this happening in future I added the mount option x-systemd.automount to all filesystems apart from root, this means that systemd will use it's automount facility and not delay system startup if a filesystem can't be mounted. If this problem happens again I will be able to ssh in as root to fix it. As the system hadn't been rebooted since early this year there were some kernel fixes to apply. I don't think there was anything critical for our use but as we had already got some downtime it made sense to get other things done. I upgraded the kernel and then the system wouldn't boot again. The physical hardware running the LUV server was the last system I run that's still using Xen. When I got it to boot again (I can't remember the exact problem there) I couldn't get it to run Xen. I then purged Xen and installed KVM. It took another 15 minutes to set up KVM but that's probably less time than would have taken to debug Xen and gave the added benefit that I'm no longer using Xen! Now I think that everything is fine. Email me off-list if you notice any problems. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker via luv-main wrote:
The LUV server has just been down for a few hours.
It all started a couple of weeks ago when I wrote a Mon plugin for running lm- sensors to monitor system temperature. When I installed that on every system that matters to me I noticed that the hardware that runs the LUV server was reporting a CPU temperature greater than the recommended maximum. It isn't necessarily a bad thing for a CPU to run above the recommended maximum (80C in this case) for a while as long as it is well clear of the critical temperature (in this case 100C) but running 24*7 above the maximum temperature is obviously a bad thing.
I seem to remember a law which relates operating temperature to operating life and reliability (can't remember name). anyway I try to run my CPU's with as little rise above ambient (ie the case temperature) as possible. As mentioned in a previous email I replaced the large heavy heat sink and fan combo; (which uses expletive deleted plastic clips) with an "off-the-shelf"; water-pump and case fan combo.(Corsair H55 Liquid Cooler about $100) The water pump is quite light , sits very close to the CPU and screws direct to the mother-board with decent screws and base plate, thus ensuring a highly reliable thermal connection. On a Asus M4A79XTD_EV M/B with ; 16GB RAM; and AMD Phenom II x4 (4 core) 965 CPU 3.4GHz; I can drive the CPU to 100% on all 4 cores for an hour (doing video conversions :-) ); and not see a CPU temperature rise above 20 deg C above a 'case' temperature of 30 deg C (obviously depends on room temperature) .............best thing since sliced bread ! regards Rohan McLeod

On 10/09/16 08:32, Rohan McLeod via luv-main wrote:
I seem to remember a law which relates operating temperature to operating life and reliability (can't remember name).
You might be thinking of the Arrhenius relationship, which some manufacturers use to model life stress of semiconductor components. Temperature can cause other failures: affecting fluid bearings, drying out of heat sink compound, and so on. But failures are also influenced by the thermal cycling range of the components, not just by the absolute temperature, since the cycling is what causes stress fractures in wire and solder bonds. I have seen failures in (spinning) hard drives which were running in over-temperature conditions, but only with extremes (60°+). A large scale Google study showed that over normal temperature ranges (up to about 45°), failure rate can actually reduce with temperature. It would be interesting to see a similar study on SSDs. http://static.googleusercontent.com/media/research.google.com/en//archive/di... Glenn -- sks-keyservers.net 0x6d656d65
participants (4)
-
Andrew Pam
-
Glenn McIntosh
-
Rohan McLeod
-
Russell Coker