
On 01.09.13 01:28, James Harper wrote:
I have a program that keeps crashing. gdb is next to useless as it gives a "generic error" enumerating threads most of the time, and even when it doesn't none of the backtraces make any sense (mostly 0's and random values). It's a heisenbug so the fault cannot be reproduced under valgrind.
Is there any way I can do a reasonableness test on the stack from C/C++? At a minimum I think all I would need to do is get the return address and check that is within a certain range and not null's etc.
The fact that it is crashing does indicate that that minimum isn't going to cut it. When the stack pointer is off, it's necessary to go back in time, to find out which pair of pointer operations is mismatched. (Unless it's like the last case, below.) What I've done as a first step on many an embedded platform is initialise all of the stack area to some constant value. A later dump then shows the stack's high water mark, which at least indicates whether it is simple stack overflow. If the corruption is due to unbalanced pushes and pops, or incorrect stack pointer operations (stack frame construction/removal), then the top of stack is in the middle of a stack frame or similarly out of whack, leading to reading of the return address from the wrong location. If the subsequent execution of "randomly" selected code or data, or better yet the return address itself, immediately causes an exception and vectoring to a trap, the chances of stack preservation is maximal. If you can recognise recent stack frames around, but not aligned on the stack pointer, then you may be able to deduce which functions executed immediately prior to the one which led to the stack corruption, thus narrowing down which function could be the culprit. What does gdb report as the program counter value prior to the crash? Does that value or a traceable precursor exist near the current stack pointer address, e.g. just prior?
From your description, that doesn't appear to be the end of a stack frame. If you can shuffle up and down the stack, to find properly constructed stack frames, the return address in each should allow reconstructing enough stack trace to see which functions led to the crash.
This kind of intermittent bolt from the blue could on the other hand result from an interrupt handler failing to preserve the stack, maybe even due to leaving something not being declared "volatile" when it is visible to other code. That's not easy to catch. Incidentally, have you tried compiling with a later version of gcc, just in case that fixes it? (Annoying if that fixed it after a fortnight of skull cracking.) Not much help, is it? Unfortunately, I've always found fixing stack corruption to be an intensely diagnostic exercise, with few shortcuts. Erik -- Global temperature extremes in 94% of countries, but averages have stalled. - http://www.bbc.co.uk/news/science-environment-23172702