Hunting Memory Bugs

Written by Ivan Skytte Jørgensen

Have you ever hunted a memory overwrite bug? If you have, you know how hard they can be to track down. Every programmer's nightmare is when such a bug suddenly creeps up in a large project.

The runtime libraries included with most compilers include a debug-version of the heap management functions, but they usually first detect the overwrite sometime after it has happened. Wouldn't it be nice if you could detect the overwrite when it happens?

In the following sections I will implement a complete debugging/tracing library which will significantly reduce the time you spend tracing heap management bugs. I will not tell you how to avoid heap errors only how to detect and find them.

Note: I primarily use Watcom C/C++, but the tools have also been tested with VAC++ 3.0. The sources may contain information on what you have to do and/or change if you use VAC++. All sources should be easy to adapt to other compilers (BC/2, EMX, Metaware, etc).

Catching memory overwrites
OS/2 implements protection of the memory. An example is that writing to an area containing the code for the program will trap. e.g.: This code will trap due to OS/2's memory protection. The area where the code is loaded is marked as read-only/execute-only.

OS/2 has an API to allocate and deallocate memory (DosAllocMem/DosFreeMem), but also an API to change the protection of a memory block: DosSetMem. The protection (or "attributes") can be set individually for each page. Each page is 4KB.

The most common form of memory overwrite is to allocate too few bytes, usually due to an off-by-one error, eg: char *p=(char)malloc(11); strcpy(p,"hello world"); This code allocates 11 bytes but puts a 12-byte string (remember the terminating NUL) into it. The program may work, and worse: the program may crash several hours later in a completely unrelated function.

By replacing the standard heap management functions with our own functions we can take full control of how the memory is allocated.

I will only use three attributes of memory protection features: unknown, committed and uncommitted. Now back to catching memory overwrites. If we could allocate exactly 11 bytes an ensure that accessing memory outside the 11 bytes would cause a trap it would be a lot easier to find where the memory overwrite happens.
 * unknown:If we have never allocated a page in any way the virtual addresses are unknown. Accessing such a page causes a trap.
 * committed:When a page is accessible (read/write) it is committed. Note: whether OS/2 really has reserved a physical piece of memory doesn't matter.
 * uncommitted (or "reserved"):It is also possible to only reserve range of virtual addresses without physical memory being reserved. Accessing such a page causes a trap.

This can be done without too much trouble. By reserving enough pages to hold the requested number of bytes + 1 page and only committing the requested number of bytes and returning a pointer into the area so that accesses beyond the requested number of bytes cases a trap, we can catch memory overwrites in seconds instead of hunting them for days.



Note that each allocation uses at least 8K address space and at least 4K memory. Since OS/2 DosAllocMem only allocates memory below 512MB you can only make 512MB/8K = 65536 allocations. It is even worse on some version of OS/2 (documentation says Warp3 and Warp4 server) where allocations reserves at least 64K address space (512M/64K = 8192 allocations). So if your program makes many allocations it may run out of address space. But that way you can also examine how well your program handles low memory conditions.

dbgheap1.cpp
The full source is in dbgheap1.cpp. dbgheap1.cpp implements replacements for new/delete/new[]/delete[]/malloc/free/realloc/calloc that use the mechanism described in the previous section.

Here are a few highlights: Each chunk of memory starts with a chunk_header. The chunk header contains information that we need when the chunk is to be DosFreeMem'ed. It also contains redundant information so we can check for overwrites of the chunk_header.

Next comes some (possibly zero) number of bytes that are unused. These are filled with FILL_BYTE (0xfe) so they can also be checked for overwrites.

Then comes the area where the 'user' memory is. It is positioned so that the end of that area is aligned with the end of a page.

Then comes the important thing: an uncommitted page. If the program tries to use more bytes than were allocated it will trap.

When the memory is to be freed, the freemem first converts the "user"-pointer (which points somewhere into a page) to a page-aligned pointer.

Then overwrite of the chunk_header is checked: if(chp->block_offset != chp->chunk_size-chp->block_size) { exit(0); } Redundancy helps debugging.

Then the "user"-pointer is checked. We only accept the exact same pointer that was returned from allocmem: if(p != pchunk+chp->block_offset) { exit(0); } Then the unused area that (hopefully) contains FILL_BYTE is checked: for(unsigned char *checkp=((unsigned char*)pchunk)+sizeof(chunk_header);    checkp<p;     checkp++) {        if(*checkp != FILL_BYTE) exit(0); } Then the standard heap management functions are replaced: new/delete, new[]/delete[] and malloc/free/realloc/calloc. The replacements just calls allocmem/freemem. (Except realloc which is a bit more complicated.)

The end of dbgheap1.cpp contains a test stub: int main { char *p = new char[12]; strcpy(p,"Hello world");                 //ok strcpy(p,"Hello my friend. What's your name?"); //error delete[] p;        return 0; } 12 bytes are allocated. Then a 12-byte string is copied to the allocated memory. This should work. Then a longer string is copied to the memory. This will cause the program to trap when strcpy crosses the page boundary and tries to write to the uncommitted page.

Gotcha!

Finding bugs with a debugger
Try compiling dbgheap1.cpp and run the program from within your debugger.

When I do this using WDW (Watcom's debugger) it says: "A task exception has occurred: general protection fault" and then it leaves me in an assembler view of the naughty instruction in strcpy. Knowing that strcpy does not contain the bug I then select to trace the call stack "Code|Calls". Hmmm-hmmm-hmmm. It seems that strcpy was called from line 143 in main in dbgheap1.cpp. What a surprise :-)

Finding bugs without the debugger
What if you did not run the program from within your debugger?

OS/2 roughly says: "A program has generated blah blah blah...". Select "show help" then it says: A program has generated a general protection fault at 0001041a DBGHEAP1.EXE 0001:0000041a" This is usually not very helpful. But not any more. The program has accessed the virtual address 0001041a. The code that did that was in DBGHEAP1.EXE, object 0001, offset 0000041a. Where is that?

When called within Watcom's IDE Watcom's linker by default generates a .MAP file (wcl386 needs /fm). (VAC++ needs ICC /Fm or ILINK /MAP). This map file is unsorted but is contains most of the information you need to pinpoint the error.

I have included a small program that extracts the interesting lines in Watcom's .MAP file and sorts them by object+address (code for VAC++s' MAP files are #ifdef'ed away). Try running: mapfilt dbgheap1.map map.tmp Now search for 0001:0000041a: The error obviously occurred in the function strcpy_. So now you know that the problem is a call to strcpy. I will use the MAPFILT program later.

Making a better bug-trap


The program dbgheap1.cpp catches most memory overwrites. But what about other bugs?

Classes of heap errors
There exists several bugs concerning heap management:
 * Double-free:
 * Sometimes the bug is that the program tries to free a memory block that has already been freed. The error can be the first free or the last free. If the error is the first free we need some sort of history to find out who did that.


 * Freeing uninitialized pointers:
 * Sometimes a program tries to free memory using a pointer that has not been assigned a value.


 * Accessing freed memory:
 * It is an error to access a memory block after it has freed. This can be difficult to detect since the memory block may have been used again.


 * Memory leaks:
 * Not freeing memory but just "forgetting it" is sometimes acceptable in very small programs (mapfilt.cpp is an example) but in programs that run for long periods of time this can be a major trouble. Imagine if a server program leaked 10 bytes for every file access. How long would it take before SWAPPER.DAT would have grown to 200MB and the server had to be shut down? Not very long.


 * Mix of heap management functions:
 * Memory allocated with heap management (malloc/free/realloc) cannot be deallocated with C++ heap management (new/delete) and vice versa. Moreover C++ single-object heap management (new/delete) cannot be mixed with C++ array heap management (new[]/delete[]). [Note: Some current compilers let you "get away" with this, but you should not rely on this behaviour, as it varies from compiler to compiler, and could change at any time. EDM]


 * Multiple heaps:
 * If you program consists of an .EXE and a .DLL, memory allocated in the DLL may be in a different heap that memory allocated in the .EXE. Some compilers support freeing what has been allocated in another module, but it is very bad practice, since this could change in the next version of compiler or by the way the modules were linked.

By extending the capabilities of the debug-heap we can catch all these hard-to-find bugs.

Tracing callers
As you may have noted in an earlier section trapping inside allocmem or freemem is not very informative. It would be a big improvement if we knew who called malloc/free/new/delete/etc.

This information is on the stack but it is extremely difficult to locate. It not only depends on the compiler but also whether or not debugging information is generated, which optimization level you are using and the version of the compiler.

This can only be done 100% error-free in assembler.

This opens another (but smaller) can of worms. Then we have to deal with name decoration, calling conventions and mangled names. Fortunately both Watcom and VisualAge contains information on most of these questions. callpeek.asm (callpeek.vac for VAC++) contains very simple implementations of new/delete/malloc/etc that simply grabs SS:[ESP] and jumps to/calls the real procedure dbg_new/dbg_delete/dbg_malloc/etc in dbgheap2.cpp

Tracing
In order to later find out why the debug-heap-management found an error (and to create nifty things later on) the heap operations must be logged. I decided to log to a named pipe \pipe\tracemon since gives extra flexibility. tracemon.cpp implements a named-pipe server that simply writes whatever it gets through the named pipe to a file and standard output.

dbgheap2.cpp logs several things:
 * TID:Multi-threaded heap errors can be a lot tougher to track down than single-threaded ones. Including the TID allows you to detect which thread allocated the memory and which thread freed the memory.
 * Heap:In order to detect multiple-heaps errors the heap has to be logged. The heap identification is simply the address of a static variable in dbgheap2. Since a .EXE and a .DLL which both have dbgheap2.cpp linked in statically will have separate static variables this is unique.
 * Operation:This can be malloc/free/new/delete/new[]/delete[].
 * Caller:This would also be nice.
 * Address range:This does not include the extra uncommitted page.
 * User pointer:The pointer that user-code sees/uses.
 * Size:The size of the allocation.

A few notes on dbgheap2.cpp
dbgheap2.cpp is derived from dbgheap1.cpp. These are the non-trivial changes:
 * The housekeeping structure chunk_header has been extended with two members: mgrclass and heap to track management class and heap.
 * A set of logging functions: log_operation, log_range, log_norange, log_userptr, log_operationfailure and log_operationsuccess.
 * heap_panic (located in callpeek.asm) is called when a heap error is detected. heap_panic issues an interrupt 3 which gets the attention of the debugger if the program has been run from within one, or causes OS/2 to display an error message and terminate the program.
 * freemem does not deallocate the memory any longer is just decommits it to ensure that the memory is not reused and guarantees that "access to freed memory"-errors cause a trap.

Testing
Create a program from callpeek.asm, dbgheap2.cpp and bugs.cpp. For instance:

Watcom: wcl386 /s /fm bugs.cpp callpeek.asm dbgheap2.cpp VAC++: alp callpeek.vac icc /Fm /B"NOE" bugs.cpp callpeek.obj dbgheap2.cpp Watcom needs to have stack checks turned off (/s), since malloc is called early in the initialization when the stack has not been initialized. VAC++ linker (/B"/NOE") needs the /NOEXTDICTIONARY otherwise it complains about duplicated symbols (malloc/free/etc).

Note: you can change the #ifdefs in bugs.cpp to test each bug type. Start tracemon: start /f tracemon Launch your debugger.

Start bugs.EXE

Just press 'GO'

TRAP!

Which kind of trap occurs and where the debugger leave you depends on the bug.

"Double-free", "free uninitialized pointers", "Mix of heap management" and "multiple heaps" are detected by dbgheap2.cpp and leaves you inside heap_panic. Switch to tracemon.EXE (which you remembered to start, didn't you?) and read the last message. Then let the debugger display the call stack.

"accessing deleted memory" and "memory overwrite" (past the end of the block) causes an "access violation" trap where the bug is.

"memory overwrite" (before the block) is detected when freemem is called.

Finding memory leaks
Detecting memory leaks can only be done after the program has stopped. The algorithm is simple: just find allocations that were not freed.

leakfind.cpp does exactly this.

Try changing bugs.cpp to use the "memory leak" portion. Compile and link bugs.EXE, restart tracemon and run bugs.EXE.

After bugs.EXE has finished running the log should contain something like this (except the header): The leaks are easy to spot in this small log but for large logs they are not.

Run leakfind: leakfind tracemon.log and leakfind says: The first three allocation are made by Watcom's runtime-library and there is nothing we can do about them. The last one is more interesting. The caller is 0001003f. Knowing that Watcom puts code into object 0001 and that object 0001 is loaded at 00010000 helps a lot. So the problem is at 0001:0000003f

Using mapfilt.exe bugs.map bugs.lst gives us this: The large gap between mail_ and dbg_new_ is consumed by static functions in dbgheap2.cpp. Watcom's linker by default does not generate map entries for static functions. But 0001:0000003f is suspiciously near main_. Let's have a look at main in bugs.cpp: Of course! p2 is not freed! What a surprise!

Statistics
The log generated by dbgheap2.cpp can be used not only for debugging but also performance tuning. heapstat.cpp generates a simple statistic. "heapstat tracemon.log" gives a result like this: Heapstat.cpp tells you for some preselected values how many allocations there were of that size and maximum number of allocation that were at any time. It also shows size and count for individual sizes.

This information can be used for performance tuning. Example 1: if there are extremely many allocations of a particular (small) size, it may be worthwhile to program a special-purpose allocator for that size. Example 2: If almost all allocations are larger than 16384 bytes it might be faster to code your own allocator that uses DosAllocMem directly.

Conclusion


Good knowledge of what OS/2 can do combined with a few simple tools can be very powerful. In fact, the tools described in this article can do more than most (all?) commercial heap-debug tools. And it's free!

What is missing? The capability to read debug-info to better pinpoint who called malloc/new/delete/etc, automating starting/stopping tracemon. Ooh yes and a sluggish GUI interface, limiting the tool to 1 or 2 compilers, a $600 bill and a 3-inch thick manual.

This is left as an exercise to the reader 

That's all folks.