Debugging on OS/2

Debugging
This is a small summary for beginners on how to debug a program (experts in turn should read it and then send me corrections and additions. Since this is a complex issue I just try to explain the first steps of locating errors which correspond to a "critical" signal, like SIGSEGV: this indicates a problem that is so crucial that the operating system notices about it and forces the application to quit. More subtle problems are not caught this way... Even non−programmers should be able to handle this, however one should be roughly familiar with the operating system being used.

Tools
Ideally you should have the source code for the application and all involved libraries. However I assume that standard libraries don't easily crash, but give some "reasonable" feedback so I concentrate on bugs within the non−standard parts. An essential prerequisite is a debugger, a tool which allows you to run an application step by step and examine the internal data. On many systems the GNU debugger (gdb) is present, however it may be inferior to commercial ones (like dbx, ladebug, etc.).

Two basic kinds of debugging applications are available:
 * Source code checker. The most famous example being lint. On the "freeware side of life" check out splint (former "lclint")!
 * Others, to work with the existing executable, might be runtime or even post−mortem analysis. Examples for such are malloc debugging libraries and profiling tools.

Procedure
I try to outline the procedure to be followed and specify some details within the brackets [] on a fictitious example written in C which assumes gcc (GNU C compiler) and gdb are present to create and debug an executable "foo.exe". Note that the short versions of the gdb command s are used here; these are less descriptive than the full command strings, but I'm too lazy to write them here ... The next step would be examining further details (stack itself, arguments to current function, invalid data, etc.).
 * 1) Prepare a set of debuggable binaries (i.e. libraries and executables).
 * This involves specifying a flag to the compiler (and linker) [−g] upon compiling and linking. Otherwise the compiler/linker may not store valuable information which shows the corresponding source to object code and vice versa.
 * gcc −o foo.exe −O0 −g foo.c
 * 1) You should disable optimization when compiling [−O0 is used for many compilers]!
 * Note that this may have apparently subtle effects:
 * 1) *Bugs may come and go due to internal flaws of the compiler when changing the optimization level.
 * 2) *Bugs may also be spurious due to common optimization procedures:
 * e.g. if your code contains a statement which would trigger an error on runtime it might just disappear when turning on optimization (e.g. by removal of unused code). The sample below is a candidate for that phenomena:
 * 1) *gcc has a shortcoming in its design: completely disabling optimization (−O0) limits its capability to detect potential problems in the source, so it won't issue warnings accordingly. Read the gcc manual!
 * 2) Run the debugger on the executable [gdb foo.exe]
 * 3) Optionally set a breakpoint on exit [b exit]
 * The debugger will stop execution when the given location (might relate to a line in the source code, a function, etc.) is being reached which has been assigned to a breakpoint.
 * Then the command prompt is presented again.
 * Check below for more information about exit. Stopping there is not always necessary, but required when looking for X protocol errors.
 * 1) Actually run the program [r].
 * Take care here if debugging X protocol errors.
 * In case you need to specify command line arguments you can do this before running the code either by an explicit call [set args bar] or "implicitly" [r bar].
 * 1) Now try to reproduce the bug in the most simple fashion, i.e. minimize the number of "actions" like keystrokes and mouseclicks, use most simple input data file, etc.
 * Remember exactly (even better write down) what you do here!
 * 1) The debugger commandline will tell you technically what has happened (e.g. a segmentation fault, SIGSEGV, has been caught). Now the important thing is to locate it in detail.
 * Produce a stack trace [where]. This will show the location within the program where the crash happened.
 * Attach a copy of this output to your bug report!
 * 1) Given the crash has happened in a function for which you have the source code (I assume so) you can now list the source code, i.e. the "bogus" line [l].

Advanced Issues
Though I can not explain further procedures in this recipe−like style, I want to give some "well−known" hints & ideas and address some standard tasks to be done.

Resolving X Errors
An X protocol error happens when some improper request was sent to the X server. It can happen when programming directly the Xlib, or if the used toolkit issues such a broken request. So if you're not actually programming low−level or working on an X11−based GUI library this error won't be your fault. This section is also useful if you're a newbie to debugging X11 programs in general, it's not only limited to X protocol errors only.

Within the debugging procedure outlined above you should check on how to set a breakpoint on exit. Check whether the application uses the X Intrinsics (Xt) library (e.g. run ldd foo.exe outside gdb within a shell). If it's listed there run the application with the −sync option [r −sync]. If not (pure Xlib) set the global variable _Xdebug from within a debugger or even within the source code near the begin of main. Note that this may also change the "location" and/or "appearance" of the bug or even cause it to disappear! Alternatively one may trigger on the interfaces which the X11 libraries call if there's a problem which they can detect: You may try setting breakpoint to the Xlib calls _XDefaultError, _XError, _XIOError and Intrinsic lib calls XtError, XtWarning. If things go wrong upon system/libc calls from within that libraries, those interfaces won't be called so you rely on your libc only. If your breakpoint on exit doesn't help when an X11 application crashes, check out the alternatives. An important issue while debugging X applications is that one can very easily lock up the displayso that mouse and keyboard may no longer accept input. To avoid these issues there are some workarounds:
 * Run the debugger on a console (or PM window/fullscreen session if on OS/2).
 * Run the application in Xvfb (X Virtual Framebuffer), Xnest or on another (local) display (e.g. ":1"). While the first one is a good idea only for code which may be examined non−interactively (e.g. check for memory−handling, profiling, etc.) the latter two alternatives are not only substitutes but almost better than running on the current X server. Especially Xnest is helpful if you need to take care of data like Atoms being stored in the X server.
 * Unfortunately both tools are not always available.

Further details on debugging X11 apps can be found at these valuable writings:
 * www.openmotif.org/tnt/#Debug_Breakpoints
 * www.rahul.net/kenton/perrors.html

Well−known Tricks
Everyone should know those tricks...
 * Use a malloc debugging library like dmalloc, dbmalloc or efence. Probably the majority of all bugs is due to improper memory handling. Also they might help to detect memory leaks.
 * Discover more warning flags of your compiler! Beyond some basic problems a verbose compiler might tell you about more subtle problems.
 * Ensure the compiler is in an ANSI conforming mode and not just K&R or something else.
 * Try compiling/building on a different operating system/architecture with a different compiler.
 * If not disabled (see "man ulimit") programs often produce a core dump when they crash. The resulting data file is an image of the memory when the app was stopped. You can perform a post−mortem crash analysis with this file. Run the debugger with the executable and the core dump as arguments [gdb foo.exe core] and proceed as explained above.
 * exit is a function which is being called usually upon program termination. Then a lot of internal clean up is done. Setting a breakpoint there might also help when you're looking for memory leaks. When debugging C++ apps crashing upon exit look for destructors being called there.
 * Sometimes an application crashes but you can't easily detect where and your breakpoint on exit doesn't help. Then you should check other "legal" procedures within libc a program may call on exit, including abort and _exit.
 * main has a similar meaning as exit: a lot of things will be performed which the simple−minded programmer might not be aware of. Actually as opposed to exit things will happen before the program's main is being called. So your program might crash due to all kinds of improper initialization (e.g. variable assignments) and all C++ constructors being called.

Compile Time Problems
You may even have problems to get some code compiled. In this section I will again list a couple of ideas that have helped me at least once to get things resolved. Some of them may depend on the compiler being used however.
 * Macros obscure the effective code. Try gcc −E to get the output as it's being passed from the preprocessor (cpp) to the compiler.
 * The former command does not tell you which macros are actually being set. Try gcc −E −dD and man cc for that.
 * Compilers may choke on source code which is not in the native text (DOS vs. un*x) format or includes some special characters. While the first are broken legacy compilers the latter is quite reasonable behaviour...
 * Often macros shield declarations and definitions in the system headers. If you're able to locate a declaration/definition of a symbol which misses in your compilation in the system headers try to check with the man page of your compiler. (obviously the direct approach is reading the headers, but the used macros may be deeply nested ...)

Portability Issues
Nowadays portability is very important. Writing clean code saves you a lot of time and also increases the chance to get it built easily on 64−bit machines as well. So in case you run across problems within an application it might be that the problem is not a source code which is totally broken, but it has just been written & tested in a single environment. From this point of view portability is a crucial thing to reduce the amount of useless debugging procedures.

In the following I briefly mention some famous portability issues.

Language Level

 * Never use pre−ANSI interfaces, e.g. those included within the headers memory.h,  or !
 * Don't use compiler or preprocessor extensions! Famous examples are the GNU extensions __FUNCTION__ and typeof.
 * Check whether compiler switches cause a different behaviour of the executable. An example are the compilers on machines based on the Alpha processor (AXP) which require −ieee or −mieee to get applications to work which rely on IEEE conformance and proper handling of numerical exceptions.

"Bitness"

 * The famous byte order issue: most widely used are LITTLE_ENDIAN ("1234", on i386) and BIG_ENDIAN ("4321").
 * Do not assume char to be signed or unsigned!
 * Do not assume sizeof(int) equals sizeof(long)!
 * Do not assume sizeof(void *) equals sizeof(int)! i.e. don't assign pointer values to int
 * Do not assume that the result type of the sizeof is int!
 * Its type (an unsigned integer type) is size_t which is defined in the  header.


 * Always use full prototypes and make your function calls to fulfil the specified signature with explicit type casting.
 * The former rule is important when using varargs interfaces (variable number of arguments), e.g. va_start from  or from other libraries like the XtVa* interfaces from the X Intrinsics library. Usually they need a NULL pointer to indicate the end of their argument list. Implicit conversion from 0 won't work, so you have to use NULL (though even a strange defined NULL may cause trouble).
 * Even in the beginning of the 200x decade using C9x−features is not a good idea: at least the full set of features is rarely implemented and in even more rare occasions you will actually find such a compiler on your target machine ...

Bug Reporting
If you're going to write a proper bug report you need to consider of couple of things:

First tell exactly which version of the code you are using, give the exact version of that distribution or the CVS checkout date. You should specify all libraries which are involved. Run ldd foo to see all shared libraries linked to the executable. (Note that ldd is not a standard tool, it may have different names on other platforms or even do not exist on your installation!) If the error happens while compiling/linking always give the full command line, perhaps even the complete output of some "make" command.

In addition you need to fully specify your system. "uname −a" should give you the details about your operating system as well as the basic hardware (CPU architecture).

Happy debugging!

Debugging on OS/2
Previously I wrote some notes about basic debugging issues. Here I want to briefly mention some OS/2 and EMX/gcc specifics. Peculiarities of debugging PM apps are not covered here.

Tools and Helpers
For good reasons I only refer to "no cost" products here...
 * There are some debuggers available for free:
 * gdb (pmgdb) (for a.out objects; part of EMX) GNU software; available for most platforms
 * sd386 (OS/2 software sites) (for OMF objects) - from IBM
 * Resolving segmentation faults (SIGSEGV):
 * Watch out for debugging implementations of malloc like dmalloc.
 * dbmalloc
 * is available from OS/2 software sites. It's rather old now (>= 7 years), but still builds and the supplied test examples work.
 * If you're going for a quick check it is sufficient just to link against this library:
 * gcc −Zmt −Zexe −o foo foo.o −lbar −ldbmalloc
 * But often you will prefer to get a more helpful output than the one supplied in this case and add #include  to all your source files (actually you don't need to do so for all, but it is a good idea ...) and rebuild.
 * dmalloc
 * I'm offering a build for OS/2 EMX. Usage is quite similar to dbmalloc.
 * libefence
 * a well−known one on un*x, isn't available on OS/2.
 * ccmalloc
 * nice one for i86 linux, isn't available on OS/2.


 * Operating System/2 API Trace (os2trace.zip on OS/2 software sites):
 * Enables, customizes, controls and summarizes the tracing of OS/2 APIs imported by a 16−bit or 32−bit executable file without affecting its source code or requiring recompiling or relinking.

Debugging Basics
Though called "basic" the following methods may not be obvious even if you already have some experience with debugging in general.
 * To set a hard−coded breakpoint (see SIGTRAP) within your application you may use this macro:
 * 1) define BREAKPOINT __asm__("int3");
 * core dumps are images of the current process (mainly the memory) written to the disk. On EMX the're only available if using a.out objects, and so could afterwards be debugged using GDB. Core dumps are even available upon request, see _core.
 * Don't mix OS/2' and EMX' memory handling!
 * Breakpoint on exit:
 * If an executable is linked against emxlibcm.dll (or emxlibcs.dll; I will use the term "emxlibc*.dll" in the following) exit and other libc functions (also those related to exit like abort and _exit) are not known symbols to gdb. Unfortunately this usually happens when building X11 apps (see XFree86 OS/2 FAQ). exit is located in emxlibc*.dll, which was linked from omf objects using LINK386. gdb is unable to resolve symbols by name from DLLs of these kind. Therefore it doesn't know the address of exit and cannot set a breakpoint on it. Look up its offset in \emx\etc\emxlibc*.map, e.g.

0001:0000DB78 exit      exit
 * Use set show-dlls to make gdb stop upon accessing the DLL for the first time and set a specific breakpoint on that specific DLL by using dll-break emxlibcm. From that output, e.g.

[Load DLL: E:\PROGRAM\EMX\DLL\EMXLIBCM.DLL] [.text: 0x1db80000 - 0x1dba9a80] [.data: 0x187b0000 - 0x187b6060] [.bss: 0x187b6060 - 0x187b97c0]
 * extract the .text (i.e. executable code) base address. Then set the breakpoint on the address calculated as the sum of base and offset:

b *(0x1db80000+0xdb78)
 * The EMX docs claim that "due to a bug in OS/2 the breakpoint will apply to all programs using emxlibc*.dll". I couldn't verify this with recent versions of OS/2...


 * Resolving X Errors
 * See debugging.html for some general comments on this issue.
 * Since setting breakpoints in the X11 libs instead of libc (emxlibc*.dll) requires to link a set of debuggable X11 libraries statically and so doesn't help much with XFree86 OS/2/EMX usually.

Misc Hints
This is intended to be a collection of ideas if you run out of them while trying to resolve a problem. Here I could really need more input from other developers!
 * Subtle segmentation faults are sometimes triggered by running out of stack space.
 * An endless recursion is a good candidate for this. Or the stack size was given a too small value (see −Zstack option of gcc). Since this can not be fixed on OS/2 during runtime (but only while linking the application) just make it big enough for any possible situation. OTOH setting it too big may end up with more subtle errors, like sys1059 when trying to start an application...


 * Another candidate is usage of stale pointers: references to memory/variables of storage class auto. Hard to debug, since few tools will tell you when trying to free a reference to such a pointer.
 * If some problems happen with input being read (binary data like images, or text files as well) check whether the file (socket, pipe) is being read in the correct mode (see fopen, −Zbin−files). You may have to write explicit code to read in text files which may be in either DOS or un*x format.
 * Make sure a program which accesses DLLs (via import libs or dynamically) loads the correct versions of those libraries. You may use ldd foo.exe to see the shared libs of that executable (check the porting FAQ for that command).
 * Applications which depend on signals (including usage as a timer, for data acquisition, animations) and which don't work properly might suffer from using the wrong signal model (see Porting FAQ).
 * Sometimes one forgets that fork doesn't work in an executable based on OMF−objects/linked with link386.
 * If you believe to have discovered a bug in your current EMX/gcc try the alternative versions.

Other Resources
Here I collect references to other information resources which have not been mentioned so far.
 * IBM OS/2 Debugging Handbook (INF format)
 * More info about traps, the native OS/2 error mechanism, can be found in
 * The Control Program Guide & Reference, under Exception Management
 * an article from Frank Meilinger and
 * the short summary about traps from Steven Grim.
 * except3.zip
 * Contains sample code using exception handling for debugging purposes.