Friday, January 26, 2007
Debugging Corruption
The difficulty here is the problem is not found until someone uses the corrupted data. The code crashes or produces incorrect data, and the problem is assigned to the wrong developer.
Questions to ask
1. Check if your OS has ways for you to check your stack depth and validate that your stack has not crash into your heap.
2. Check for buffer overflows of buffers on the stack.
3. Check for any inline assembly or assembly code. Do they properly protect the registers they use?
4. Does disabling the interrupt or bumping priority effect the problem or make it go away?
5. Does the problem only happen after a subroutine call?
6. Is the stack getting corrupt or is it a general register?
Here's a strategy for heap corruption:
First thing to do is put up boundries, both physical and temporal. You want to catch the problem as soon as possible. The best result is to root cause the problem. The next best thing is to proof it's not your code and find someone else to hand off the problem too.
To put a physical boundry around your code, just reserve large buffers before and after the data that's getting corrupted. If it's random data, then put these large buffers everywhere. Mark these buffers with some data that's humanly reable and seemly random. I suggest 0xdeadbeef, because it's an english word that's unlike to be used and it's an odd number. Odd number is important because the only time you see a number this big is when it's a pointer, and valid pointers are always even(unless you're running an 8 bit processor).
When the problem occurs, dump the buffers to see if it's corrupted. If the program crashed, you'll need to generate a core dump for post-mortem analysis with GDB.
To put a temporal boundry around your code, write routines which validates these large buffers have not been corrupted and place these checks whereever someone enters your module.
The above works well on a singled threaded application that has no exception handler or ISRs. If you suspect an exception handler, then just put the validation routine in the exception handler. If you suspect another thread, try bumping the priority of the thread that your module runs to be the highest in your process. If that doesn't work and you suspect and ISR, then try disabling the interrupts.
Debugging is part logical, luck, art, and intiutions. My strategy for debugging is not to find the root cause, but to narrow it down. Think of ways to quickly eliminate as many of the likely possibilites as possible. Your source code control is also a valuable tool. If the problem is a regression problem, find where the problem started to occur and see what check in occur from that point to the previous version of the code that did not have the problem. Managers and SQA love to use that latter method.
Questions to ask
1. Check if your OS has ways for you to check your stack depth and validate that your stack has not crash into your heap.
2. Check for buffer overflows of buffers on the stack.
3. Check for any inline assembly or assembly code. Do they properly protect the registers they use?
4. Does disabling the interrupt or bumping priority effect the problem or make it go away?
5. Does the problem only happen after a subroutine call?
6. Is the stack getting corrupt or is it a general register?
Here's a strategy for heap corruption:
First thing to do is put up boundries, both physical and temporal. You want to catch the problem as soon as possible. The best result is to root cause the problem. The next best thing is to proof it's not your code and find someone else to hand off the problem too.
To put a physical boundry around your code, just reserve large buffers before and after the data that's getting corrupted. If it's random data, then put these large buffers everywhere. Mark these buffers with some data that's humanly reable and seemly random. I suggest 0xdeadbeef, because it's an english word that's unlike to be used and it's an odd number. Odd number is important because the only time you see a number this big is when it's a pointer, and valid pointers are always even(unless you're running an 8 bit processor).
When the problem occurs, dump the buffers to see if it's corrupted. If the program crashed, you'll need to generate a core dump for post-mortem analysis with GDB.
To put a temporal boundry around your code, write routines which validates these large buffers have not been corrupted and place these checks whereever someone enters your module.
The above works well on a singled threaded application that has no exception handler or ISRs. If you suspect an exception handler, then just put the validation routine in the exception handler. If you suspect another thread, try bumping the priority of the thread that your module runs to be the highest in your process. If that doesn't work and you suspect and ISR, then try disabling the interrupts.
Debugging is part logical, luck, art, and intiutions. My strategy for debugging is not to find the root cause, but to narrow it down. Think of ways to quickly eliminate as many of the likely possibilites as possible. Your source code control is also a valuable tool. If the problem is a regression problem, find where the problem started to occur and see what check in occur from that point to the previous version of the code that did not have the problem. Managers and SQA love to use that latter method.