
I have a program that keeps crashing. gdb is next to useless as it gives a "generic error" enumerating threads most of the time, and even when it doesn't none of the backtraces make any sense (mostly 0's and random values). It's a heisenbug so the fault cannot be reproduced under valgrind. Is there any way I can do a reasonableness test on the stack from C/C++? At a minimum I think all I would need to do is get the return address and check that is within a certain range and not null's etc. Thanks James

On 01.09.13 01:28, James Harper wrote:
I have a program that keeps crashing. gdb is next to useless as it gives a "generic error" enumerating threads most of the time, and even when it doesn't none of the backtraces make any sense (mostly 0's and random values). It's a heisenbug so the fault cannot be reproduced under valgrind.
Is there any way I can do a reasonableness test on the stack from C/C++? At a minimum I think all I would need to do is get the return address and check that is within a certain range and not null's etc.
The fact that it is crashing does indicate that that minimum isn't going to cut it. When the stack pointer is off, it's necessary to go back in time, to find out which pair of pointer operations is mismatched. (Unless it's like the last case, below.) What I've done as a first step on many an embedded platform is initialise all of the stack area to some constant value. A later dump then shows the stack's high water mark, which at least indicates whether it is simple stack overflow. If the corruption is due to unbalanced pushes and pops, or incorrect stack pointer operations (stack frame construction/removal), then the top of stack is in the middle of a stack frame or similarly out of whack, leading to reading of the return address from the wrong location. If the subsequent execution of "randomly" selected code or data, or better yet the return address itself, immediately causes an exception and vectoring to a trap, the chances of stack preservation is maximal. If you can recognise recent stack frames around, but not aligned on the stack pointer, then you may be able to deduce which functions executed immediately prior to the one which led to the stack corruption, thus narrowing down which function could be the culprit. What does gdb report as the program counter value prior to the crash? Does that value or a traceable precursor exist near the current stack pointer address, e.g. just prior?
From your description, that doesn't appear to be the end of a stack frame. If you can shuffle up and down the stack, to find properly constructed stack frames, the return address in each should allow reconstructing enough stack trace to see which functions led to the crash.
This kind of intermittent bolt from the blue could on the other hand result from an interrupt handler failing to preserve the stack, maybe even due to leaving something not being declared "volatile" when it is visible to other code. That's not easy to catch. Incidentally, have you tried compiling with a later version of gcc, just in case that fixes it? (Annoying if that fixed it after a fortnight of skull cracking.) Not much help, is it? Unfortunately, I've always found fixing stack corruption to be an intensely diagnostic exercise, with few shortcuts. Erik -- Global temperature extremes in 94% of countries, but averages have stalled. - http://www.bbc.co.uk/news/science-environment-23172702

On 01/09/13 11:28, James Harper wrote:
Is there any way I can do a reasonableness test on the stack from C/C++? At a minimum I think all I would need to do is get the return address and check that is within a certain range and not null's etc.
You can make use of backtrace directly inside the program (http://linux.die.net/man/3/backtrace), producing a set of return addresses from the stack frames, which you can then sanity check (no need to convert to symbols). Note that when there is optimization (particularly if frame pointers are left out), stack frames aren't always as easy to interpret, so it often helps to try to reproduce the problem with optimization turned off if possible. It would also be interesting to turn on the gcc stack protection (-fstack-protector-all), if you think for example that the problem is being caused by a buffer overflow. This won't tell you where the problem is, but it might at least abort more gracefully closer to the point where the failure is occurring. Glenn -- sks-keyservers.net 0x6d656d65
participants (3)
-
Erik Christiansen
-
Glenn McIntosh
-
James Harper