Tuesday, January 02, 2007

 

Kernel Crash w/Stack and Register Dump

Someone just came for help with a Linux 2.6 kernel crash. The exception handler caught a bad memory access and dump a pseudo stack trace with registers before the system rebooted. The pseudo stack dump has the calling routine's name and offset, but not automatic variables and arguments. The register dump has all general registers. KGDB is not setup and the tool guys will never set it up.

First step is to find the code that crashed and disassmble it(1). This gaves you the exact instruction(2). I suspect this is caused by a bad pointer reference. With luck, the pointer is in the register(3) or there's enought information in the registers to help pinpoint the problem. The disassemble code generally uses a0, s2, t3 and such to reference registers. Each CPU type has its own convention for how to map these to the actually register. Unfortunately, these conventions are not generally part of the CPU's H/W spec as they are industry conventions.

After you figure out what pointer is causing the problem, you have to code review or add auto detecting code(4). I like the latter whenever possible.

(1) Disassembly the .o file with objdump -S option to interlace the source code and assembly.
(2) The point where the code crashed may not be where the PC is pointing to, but it should be closed, like the previous instruction. In a RISC architecture, where the instructions gets executed in the pipe lets you know how far back to go. Due to branch delay, it may even be the line right after the previous branch.
(3) On RISC processors, the address must be in a general register to be referenced. CISC processors allow referencing pointers from memory, but I believe the address is still loaded into some register before the core can access it, but it's not explicit and you have to look at the CPU specification or ask you vendor.
(4) Auto detecting code is anything from a simple ASSERT to complex data integrity checking routines. If performance is an issue, make this a compiler option. Many developers don't like to release code with these checks because it can cause a crash when the system would normally not crash. Well, I say release the code with the checks cause if it's broken, then why run a piece of code when you don't know how it's going to behave. More than likely, sweeping these issues under the rug will cause many unexplain problems to occur in the field, and it'll take much more effort to root cause those issues. Don't kill yourself trying to convince the old timers because they've worked on HA systems since before you were born and it's how they've always done it.

Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?