Tuesday, January 02, 2007
Quickly Root Cause Easy to Reproduce Problems
			  Once you figure how what is causing the problem, you need to detect the problem ASAIH(as soon as it happens).  To do this, create a MACRO or routine to check for that condition.  Now sprinkle that check all over the place or use the binary search method to find the problem. This is so quick and easy I don't see why more poeple don't do it. I run into a lot of guys who think this method is some kind of hack and real programmers need to read and understand the whole code first. They'll usually find 100 things they don't like about the existing code and make a bunch of "good" changes that may or may not solve the problem. This type of code review should be replaced by a group code review when a module has pass its unit testing, but most organizations skip this step.
If the crash goes away or changes behavior when you put your code in, then it's a random memory corruption issue or a race condition. Assuming it's not a race condition, see my post on Memory Corruption without MMU to hints to solve this problem. Of couse, if the piece of memory is written only once and you have a debugger, you can use the hardware breakpoint feature to catch who is corrupting this memory. Also try turning up the warning level on your compiler or run some form of lint. The latter may take some time to configure to run properly.
See future post for race conditions.
			  
			
 
  
If the crash goes away or changes behavior when you put your code in, then it's a random memory corruption issue or a race condition. Assuming it's not a race condition, see my post on Memory Corruption without MMU to hints to solve this problem. Of couse, if the piece of memory is written only once and you have a debugger, you can use the hardware breakpoint feature to catch who is corrupting this memory. Also try turning up the warning level on your compiler or run some form of lint. The latter may take some time to configure to run properly.
See future post for race conditions.

