Re: Diagnosing System Crashes

Dave Hellewell <dave.hellewell@gmail.com> said,
Power supply is next easiest thing to test if you can find a spare. Do the input voltages seem okay? Even if they do it's possible it could still be the power supply.
My sensors output is pretty wacky (haven't gotten around to a proper config), but the BIOS reports numbers that look OK. I have another PSU here and I'll give that a spin as a last resort before submitting to the shop.
Neither the bios or any software voltage monitoring are any good in diagnosing PSU problems as neither reads voltage continiously. For the same reason most digital voltmeters are no good. What can happen (I have had it twice) is the PSU voltages can spike down very quickly and cause the system untold problems. Such a spike will NOT be picked up by most voltmeters. I have successfully used analog meters to check this though (An AVO). All voltages need to be checked including the 12 Volt line. In both my cases the system lockups were caused by the 12V line droping low (11.2 volts) for only around half a second, causing the hardisks to shutdown. This caused an unrecoverable error from the kernel. The symptoms in my case was an almost complete hardlock up but with the mouse pointer still working. Lindsay

One problem with that type of hdd hang is that switching vt can require disk access so it can be impossible to get the log data. Having a window open for viewing the syslog via /dev/xconsole or logging to another system can be helpful. On 18 May 2014 4:17:29 PM AEST, zlinw@mcmedia.com.au wrote:
Dave Hellewell <dave.hellewell@gmail.com> said,
Power supply is next easiest thing to test if you can find a spare. Do the input voltages seem okay? Even if they do it's possible it could still be the power supply.
My sensors output is pretty wacky (haven't gotten around to a proper config), but the BIOS reports numbers that look OK. I have another PSU
here and I'll give that a spin as a last resort before submitting to the shop.
Neither the bios or any software voltage monitoring are any good in diagnosing PSU problems as neither reads voltage continiously. For the same reason most digital voltmeters are no good. What can happen (I have had it twice) is the PSU voltages can spike down very quickly and cause the system untold problems. Such a spike will NOT be picked up by most voltmeters. I have successfully used analog meters to
check this though (An AVO). All voltages need to be checked including the 12 Volt line. In both my cases the system lockups were caused by the 12V line droping low (11.2 volts) for only around half a second, causing the hardisks to shutdown.
This caused an unrecoverable error from the kernel.
The symptoms in my case was an almost complete hardlock up but with the
mouse pointer still working.
Lindsay _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- Sent from my Samsung Galaxy Note 2 with K-9 Mail.

One problem with that type of hdd hang is that switching vt can require disk access so it can be impossible to get the log data. Having a window open for viewing the syslog via /dev/xconsole or logging to another system can be helpful.
Netconsole is also useful. But not in the OP's case of course. James _____________________________________ James Harper - Systems Technician Bendigo IT Bendigo: (03) 4433 9200 Melbourne: (03) 9098 0000 Email: James.Harper@bendigoit.com.au This email is intended only for the use of the addressee. You must not edit this email or any attachments without my express consent. Maxsum Solutions is not liable for any failed, corrupted or incomplete transmission of this email or any attachments or for any viruses contained in them. By opening any attachments, you accept full responsibility for the consequences. If you are not the intended recipient, any dissemination, reliance upon or copying of this email or any attachments is strictly prohibited, and you must immediately erase them permanently from your system, notify Maxsum Solutions and destroy any hard copies.

All voltages need to be checked including the 12 Volt line. In both my cases the system lockups were caused by the 12V line droping low (11.2 volts) for only around half a second, causing the hardisks to shutdown. This caused an unrecoverable error from the kernel.
The symptoms in my case was an almost complete hardlock up but with the mouse pointer still working.
If you have a clock showing seconds, is it still ticking? Even, if power is OK, hard disk may lock up for other reasons as mentionned above. Have you tried running the system from LiveCD with no HD connected at all? If loading video content causes the crash, this may suggest it occurs on high cpu load. I have seen cpus locking up when throttling up and down under changing load. Can you set the system so cpu throttling is turned off and see what happens but probably means it will run on rather low frequency? (a bios option?) Daniel.
Lindsay _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- dan062 <dan062@yahoo.com.au>

On 18/05/14 18:33, Dan062 wrote:
If you have a clock showing seconds, is it still ticking?
No.
Even, if power is OK, hard disk may lock up for other reasons as mentionned above.
All IO is dead as a result of the crash.
Have you tried running the system from LiveCD with no HD connected at all?
Yes, still crashes :(
If loading video content causes the crash, this may suggest it occurs on high cpu load. I have seen cpus locking up when throttling up and down under changing load. Can you set the system so cpu throttling is turned off and see what happens but probably means it will run on rather low frequency? (a bios option?)
Loading video is a more or less reliable way to force a crash. However, system load does not appear to be the cause. I've had the system crash a number of times recently just reading email or sitting idle. I don't have any CPU frequency scaling tools installed, and I've been reluctant to mess with the BIOS settings and introduce more complexity into debugging. The system had been stable for some months since assembly with the present settings. Might be something to look into later though. Cheers, Dave

Dave Hellewell <dave.hellewell@gmail.com> wrote:
Loading video is a more or less reliable way to force a crash. However, system load does not appear to be the cause.
I once had hardware problems on a video card that caused the entire system to crash (because the video driver was in the kernel, as is now the norm). If crashes are becoming more frequent with no change in the installed software, this strongly suggests hardware failures.

On 19/05/14 14:46, Jason White wrote:
Dave Hellewell <dave.hellewell@gmail.com> wrote:
Loading video is a more or less reliable way to force a crash. However, system load does not appear to be the cause.
hi this may be a power usage issue as PSU are set to shut down if any part of the system "shorts" to ground. To anaylise this remove a part of the system and run the machine and see if it still crashes, then repeat with each part of the the system until the maleficent part is found. The offending part may even be a cable. Steve

On Mon, 19 May 2014, Jason White wrote:
Dave Hellewell <dave.hellewell@gmail.com> wrote:
Loading video is a more or less reliable way to force a crash. However, system load does not appear to be the cause.
I once had hardware problems on a video card that caused the entire system to crash (because the video driver was in the kernel, as is now the norm).
If crashes are becoming more frequent with no change in the installed software, this strongly suggests hardware failures.
Running a modern kernel (>3.10)? I swear they seem so much less reliable on all of my hardware (mostly laptops). One I have when I run a a particular perl script that was making the CPU warm (not hot - only 60degC), and was crashing a few minutes into the job. The machine runs perfectly happy though when openshot is transcoding a video using all 8 cores and the CPU is running at 95degC. Another laptop tends to last a day or two before the wifi light flashes rapidly mimicking a fast version of the old flashing capslock kernel panic. Wedged so hard not even alt-sysrq b does anything. No idea what's going wrong in any of the cases, other than blaming the one common element between them all - that fscking in-kernel video driver. I feel like I'm using Windows Millenium Edition all of a sudden. -- Tim Connors

Tim Connors <tconnors@rather.puzzling.org> writes:
No idea what's going wrong in any of the cases, other than blaming the one common element between them all - that fscking in-kernel video driver. I feel like I'm using Windows Millenium Edition all of a sudden.
Intel, or <anything else> ?

On Thu, 22 May 2014, Trent W. Buck wrote:
Tim Connors <tconnors@rather.puzzling.org> writes:
No idea what's going wrong in any of the cases, other than blaming the one common element between them all - that fscking in-kernel video driver. I feel like I'm using Windows Millenium Edition all of a sudden.
Intel, or <anything else> ?
Mostly intel these days. I've got an elderly radeon card in my fileserver, and it seems reliable now that I've beat ZFS into submission. Do have one of those switcheroo nvidia thingies in my haswell laptop, but it probably hasn't been the cause of the unreliableness, because I don't use the nvidia part of it! -- Tim Connors

On Thu, 22 May 2014 05:51:38 PM Tim Connors wrote:
Mostly intel these days.
Odd, for me that's been getting better with recent kernels with IVB. The real test will be when I upgrade my work Haswell laptop to 14.04. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

In general from my experience running a small PC build/repair place 10 years ago, I'd say most likely suspect is your motherboard.. Generally CPUs never fail, ram is easy to test (you have), temperature problems are obvious or just cause throttling, power supply is common (you've tried).. Video card is common (but yours is built in), hard drive causes errors in logs or console. Good luck :)

On Tue, Jun 3, 2014 at 7:39 PM, Noah O'Donoghue <noah.odonoghue@gmail.com> wrote:
Generally CPUs never fail ...
Annoyingly it was indeed the CPU. After returning the CPU and MB to the shop they were able to isolate the fault. At least the claim was painless. Even more annoying, my NAS which is an old Dell workstation, died over the weekend. No post at power up, opened the box and saw a couple of leaky caps. <shrug> Its been running 24/7 for 5 years, and I bought it from Grays so its probably 8+ years old. More annoying still, of the (desktop grade) high end components I purchased to build a replacement system, the MB is DOA. Seems I have angered the hardware gods. Bastards. -Dave

zlinw@mcmedia.com.au writes:
The symptoms in my case was an almost complete hardlock up but with the mouse pointer still working.
I hear GPUs draw the pointer as a separate layer, to avoid having to redraw windows all the time as you mouse over them. I guess that bit of hardware survived the spike.
participants (11)
-
Chris Samuel
-
Dan062
-
Dave Hellewell
-
James Harper
-
Jason White
-
Noah O'Donoghue
-
Russell Coker
-
Steve Roylance
-
Tim Connors
-
trentbuck@gmail.com
-
zlinw@mcmedia.com.au