Diagnosing System Crashes

I'm experiencing frequent system crashes at the moment and I've run out of clues for diagnosing the issue. Crash is a full lock up, display is frozen yet coherent, no response from kbd or mouse. The caps lock light does not flash nor can it be toggled, and SysRq combos do not work. Network is also unavailable. Nothing relevant in logs. Crashes are unpredictable, occurring at idle or under load. Uptime can range from 2 seconds to a week at best. Loading video content seems to be the most reliable way to trigger it though. OS is debian testing. System has previously been stable, and first known crash did not correlate to patching. Principal system components are: AMD FX8350, Asus 990fx Sabertooth MB, Samsung 840 Evo SSD for OS, Corsair RM750 PS Investigations so far: Memory ------- - Memstest86: ran for 24 hours, no faults reported - Problem persists when running on either or both of the compatible DIMMs I have for this board Misc. Drivers/HW ------- - Removed non critical HW: problem persists GFX Cards/Drivers ----------------- No integrated GFX with MB or CPU - Run entry level GFX cards with chipsets from AMD and Nvidia using open drivers for each: problem persists Kernel/OS --------- Configured kdump and can analyse dumps resulting from simulating a crash via 'echo c > /proc/sysrq-trigger' or "Alt-SysRq-c". No output results from the crashes I'm complaining about. Problem persists when booting from live CDs from different distributions. Any suggestions would be appreciated, I don't think I've missed any obvious tests. My next step is to take MB and CPU back to the shop (who thankfully are pretty cool for these type of claims). Cheers, Dave

I'm experiencing frequent system crashes at the moment and I've run out of clues for diagnosing the issue.
Crash is a full lock up, display is frozen yet coherent, no response from kbd or mouse. The caps lock light does not flash nor can it be toggled, and SysRq combos do not work. Network is also unavailable. Nothing relevant in logs.
<snip> If your perfectly fine system suddenly started crashing and no changes corresponded to the onset of crashing then it pretty much has to be hardware (or a linux virus [1]). If it was a memory error then you'd expect the occasional hard lockup but more often random segfaults etc, so I'm thinking overheating, CPU, system board, or power supply. You haven't mentioned heat. Are the CPU and system board temperature sensor readings acceptable? Heat is likely going to make any intermittent problem worse, does the frequency of crashes vary with ambient temperature. Take the side off the case and point a large fan at it if you can and see if that has any impact. Has the system ever been overclocked? Power supply is next easiest thing to test if you can find a spare. Do the input voltages seem okay? Even if they do it's possible it could still be the power supply. Good luck. These sorts of once-a-week problems are a real pain to troubleshoot. If this was a business machine it would be in the bin - the cost of a system board and cpu isn't worth the effort. James [1] I'm kidding of course. A linux virus would be extremely well written and would never cause a crash! _____________________________________ James Harper - Systems Technician Bendigo IT Bendigo: (03) 4433 9200 Melbourne: (03) 9098 0000 Email: James.Harper@bendigoit.com.au This email is intended only for the use of the addressee. You must not edit this email or any attachments without my express consent. Maxsum Solutions is not liable for any failed, corrupted or incomplete transmission of this email or any attachments or for any viruses contained in them. By opening any attachments, you accept full responsibility for the consequences. If you are not the intended recipient, any dissemination, reliance upon or copying of this email or any attachments is strictly prohibited, and you must immediately erase them permanently from your system, notify Maxsum Solutions and destroy any hard copies.

Thanks James. On 18/05/14 11:18, James Harper wrote:
You haven't mentioned heat. Are the CPU and system board temperature sensor readings acceptable?
Yes they are. I'm not 100% confident in the measurements for this board, but they do seem to stay in step with the readings from the BIOS. AFAICT no evidence of overheating.
Has the system ever been overclocked?
No.
Power supply is next easiest thing to test if you can find a spare. Do the input voltages seem okay? Even if they do it's possible it could still be the power supply.
My sensors output is pretty wacky (haven't gotten around to a proper config), but the BIOS reports numbers that look OK. I have another PSU here and I'll give that a spin as a last resort before submitting to the shop. Cheers, Dave

On 18/05/14 14:05, Dave Hellewell wrote:
Power supply is next easiest thing to test if you can find a spare. Do the input voltages seem okay? Even if they do it's possible it could still be the power supply.
My sensors output is pretty wacky (haven't gotten around to a proper config), but the BIOS reports numbers that look OK. I have another PSU here and I'll give that a spin as a last resort before submitting to the shop.
Swapped in a white box type PSU which I believe to be new. Crashing continued. Oh well... -Dave

James Harper <James.Harper@bendigoit.com.au> writes:
Crash is a full lock up, display is frozen yet coherent, no response from kbd or mouse. The caps lock light does not flash nor can it be toggled, and SysRq combos do not work. Network is also unavailable.
I assume "network is also unavailable" means you tried pinging the host and got no response. Unlikely to be useful, but exfiltrate dmesg using netconsole might show you something useful in its last moments.
If it was a memory error then you'd expect the occasional hard lockup but more often random segfaults etc, so I'm thinking overheating, CPU, system board, or power supply.
+1; I had problems like this with switched mode PSUs in a 1RU -- and swapping in a replacement unit didn't help, because it was (probably) crap power input from mains. Either swapping to a conventional PSU or putting it behind a UPS fixed it -- I can't remember which I did first.

On 19/05/14 10:56, Trent W. Buck wrote:
I assume "network is also unavailable" means you tried pinging the host and got no response. Unlikely to be useful, but exfiltrate dmesg using netconsole might show you something useful in its last moments.
Yep. All IO on the host in question is lost. I'm not familiar with netconsole. Is the gamble that messages might appear on the console before being written to disk, assuming networking is alive? -Dave

On 18.05.14 10:52, Dave Hellewell wrote:
I'm experiencing frequent system crashes at the moment and I've run out of clues for diagnosing the issue.
Crash is a full lock up, display is frozen yet coherent, no response from kbd or mouse. The caps lock light does not flash nor can it be toggled, and SysRq combos do not work. Network is also unavailable. Nothing relevant in logs.
Have you tried running "find" for a core file? Dunno if one would be generated in your circumstances. They can be had on a segmentation fault, but we don't know what you have. If "ulimit -c" gives "0", then "ulimit -c unlimited" will enable them. Erik -- "What we found was, in fact, about half of that observed warming over Greenland since 1979 can be attributed to natural variations actually stemming from the Pacific Ocean." - http://www.abc.net.au/news/2014-05-12/climate-change-hot-spot-in-pacific-oce...
participants (4)
-
Dave Hellewell
-
Erik Christiansen
-
James Harper
-
trentbuck@gmail.com