Dump Analysis 101
By Gary Jarman
Introduction
This document was written by
- Gary Jarman,
- P O Box 74202,
- Turffontein,
- 2140,
- South Africa.
Welcome to Dump Analysis 101. This document is intended to introduce you to the world of 16 bit application dumps. If you use this document please send me a postcard with your opinions. There is no catch involved, I would just like to know if anyone out there has used this file, and if they have, what they thought of it.
I am hoping that this will be a learning experience for both you and I. If you see any errors, or can add to what I have written, please let me know.
In the meantime I am starting work on the 32 bit application dump document (Dump Analysis 102), to be followed by a document dealing with system hangs and application loops (Dump Analysis 103).
Enjoy your reading, Gary, 13 August 1995.
Changes between versions
Rev. 1
The main correction here is that in the original, I had the SP pointing to the next available space on the stack. This is incorrect. The SP always points to the last parameter pushed onto the stack. I have corrected all diagrams and all the references that I could find.
Stacks
The key to reading dumps is the ability to read stacks. Every slot displayed in the dump (use .p to display all slots in your current dump) will have a stack. Each slot corresponds to a thread that was running in the system at the time the dump was taken. Stacks are used to store local variables and to pass parameters when functions are called. They are also used by the system to save information on where to return control to when a function is completed.
How is a stack allocated by OS/2?
Each thread that is created has a stack. The stack is created in a segment. It is created as a sparse object. What this means is that the operating system allocates the area of memory, but does not commit the entire area allocated. Only the last page (OS/2 handles memory as pages, a page is 4KB in size) is committed, while the page above that is committed as a guard page. OS/2 then uses the guard page technique to allocate the memory as it is required. When an operation occurs that will exceed the boundaries of the committed page and go into the guard page, a guard page fault will occur. OS/2 has a default guard page fault handler that will commit what was the guard page and make the next available page the guard page. This technique helps prevent excessive swapping as memory that is not actually being used (even though it has been allocated to a process), is not committed and has thus not used up physical memory. This method of allocating memory for a stack is used for all stacks except that of the first thread created by the process.
The fact that OS/2 commits the LAST page in the segment (the one with the highest addresses) first also explains why the earlier functions have higher value addresses on the stack than the last accessed functions.
A segment can be a maximum of 64KB and that is the maximum size a stack may be. This is the maximum that a 16 bit system can address (in the 32 bit system this can be up to 4GB).
How is a stack populated?
As mentioned earlier stacks are used to store local variables as well as to pass data between functions. This section attempts to explain how this is done.
The example will assume that the first function to be kicked off is function_one, it performs some processing and then makes a call to function_two, passing it two parameters. The example also shows how the space for the local variables of function_two is reserved. For function_one I have given the required lines of example C code and their corresponding assembler instructions. For each instruction there is a diagram that represents the stack after that instruction is executed.
The example assumes that when we enter function_one the SS:BP is 1F:E95E, the SP would also be E95.
Example C source code:
VOID function_one (MPARAM ....)
  { USHORT usVar1,
           usVar2,
           usVar3,
           usVar4,
           usVar5;
     .
     .
     .
     function_two(usVar2, usVar4);
     .
     .
     .
   }
Example assembler code:
bd0f:e318 ....  enter a,0
bd0f:e320 ....
.
.
bd0f:e44c ....  push word ptr [bp - 08]
bd0f:e44f ....  push word ptr [bp - 04]
bd0f:e452 ....  call bd36:0b5c
bd0f:e457 ....  add sp,+04
.
.
To view the stack contents as we enter function_one refer to fig. 1
Function_one has 5 local variables, all of type USHORT, space on the stack must be reserved for these variables as the system enters function_one. To do this the first instruction is an ENTER instruction. This changes the SP value by the value of its first parameter (in this case a (hex) 10) to give us an SP of E954 (E95E-(5*sizeof(USHORT))). The ENTER command is the equivalent of the following instructions:
- PUSH BP
- MOVE BP,SP
- SUB SP,nn.
Compare Fig. 1 to Fig 2 to see the changes after executing the enter instruction. This is the same method used for entering all functions.
We have done some processing in function_one and now we want to call function_two, we are going to pass two variables, also of type USHORT.
Refer to Fig. 3 to see stack contents after the first parameter is put on the stack, and Fig. 4 for the second parameter being put on the stack.
Take note that in this example I have shown the assembler code as putting parameter 2 on the stack and then parameter 1. This will depend on the stack conventions used by the compiler the program was compiled with.
This example also assumes that the five variables were loaded to the stack in the order that they were defined. This is not always the case, refer to Example of compiler generated files.
The parameters are put onto the stack using the PUSH command, data is always put onto a stack using the PUSH command (data is removed from the stack using the POP command). The PUSH statement will first decrement the SP by two before putting the data on the stack.
When the first push was executed the SP would change from E954 to E952. The second USHORT would be pushed the same way and the SP would point to E950. The next instruction would be a CALL instruction (since we are now going to go to function_two). When the call instruction is handled by the hardware, the IP is pushed onto the stack and, if it is a far call, the CS is also pushed onto the stack. Note that at the stage the CS:IP are pushed onto the stack they are pointing to the next instruction (in function_one) to be executed following the call.
The fact that the address we are calling is displayed as both a selector and offset indicates that we are making a far call. A far call occurs when control is transferred to a piece of code contained in a selector different to that of the calling code. If we were calling a piece of code that was contained in the selector bd0f (ie. the selector we are currently running in) it would be referred to as a near call. A near call would look as follows: bd0f:e457 .... call e56a and the address of the entry point for that called function would then be bd0f:e56a. Also with a near call, only the offset of the next instruction would be loaded to the stack.
Refer to Fig. 5 to see the stack contents once we have left function_one but have not yet executed any instructions in function_two.
The first instruction that we would encounter in function_two is an ENTER. What this does is to push the BP onto the stack and then move the current value of SP to BP. Before the ENTER is executed the SS:BP is 1F:E59E, the SP is E58C. The enter command will put the value of BP (E59E) on the stack (in so doing the value of SP would be decremented by two, giving E58A) and then subtract the value of the first parameter (the length of the local variables) from the SP, thus giving us E572.
Example assembler code (at entry of function two):
  .
  .
  .
bd36:0b5c ....  enter 16,0
  .
  .
  .
Refer to Fig. 6 to see the stack contents when the enter command has been executed.
Stack Registers
From the above we can see that there are three registers used with the stack:
- SS - stack selector, this points to the segment containing the stack.
- (E)BP - Base Pointer, points to the current base of the stack.
- (E)SP - Stack Pointer, this points to the current position on the stack where data would be moved to in the event of a PUSH statement.
Stack frames
The stack is broken down into stack frames. A stack frame can be thought of as a unit containing all of the data required by a function. A frame consists of the parameters passed, the return CS:IP, the return BP and the local variables.
                                  ^
        │                     │   │                             ^
        │                     │   │                             │
        │  local data         │   │  <───── higher value        │
        │                     │   │         addresses           │
        │                     │   │        (eg. 1f:e206)        │
        ├─────────────────────┤   │                             │
        │                     │   │                             │
        │  parameters         │   │                             │
        │                     │   │                             │
        │                     │   │                             │
        │                     │   │                             │
        ├─────────────────────┤   │                             │
        │  return IP          │   │                             │
        ├─────────────────────┤   │                 Unwinding a │
        │  return CS          │   │                 stack takes │
        ├─────────────────────┤   │                 you in this │
    ┌──>│  return BP          │───┘                 direction.  │
    │   ├─────────────────────┤                                 │
    │   │                     │                                 │
    │   │                     │                                 │
    │   │                     │                                 │
points  │                     │                                 │
to  │   │  local data         │                                 │
previous│                     │                                 │
stack   │                     │                                 │
frame   │                     │                                 │
    │   ├─────────────────────┤ ────┐                           │
    │   │                     │     │                           │
    │   │  parameters         │     │                           │
    │   ├─────────────────────┤     │                           │
    │   │  return IP          │     │                           │
    │   ├─────────────────────┤     │                           │
    │   │  return CS          │     │                           │
    │   ├─────────────────────┤     ├── stack frame             │
    └───│  return BP          │     │                           │
        ├─────────────────────┤     │                           │
        │                     │     │                           │
        │                     │     │                           │
        │                     │     │                           │
        │                     │     │                           │
        │  local data         │     │                           │
        │                     │     │                           │
        │                     │     │                           │
        │                     │     │                           │
        ├─────────────────────┤ ────┘  <─── lower value         │
                                            addresses
                                           (eg. 1f:054a)
Be careful of how you picture the growth of a stack. As mentioned in How is a stack created OS/2 starts allocating pages from the end of the segment allocated for the stack. This means that the stack frame that was allocated first, has the highest address (in terms of numerical value) while the last allocated frame has the lowest numerical address but is at the top of the stack.
From the illustration above there are a few things that can be used as guidelines for dump analysis.
- When unwinding a stack, the base pointers you are using should always get higher in numerical value (ie. you are working from the lowest value to the highest). If this trend is reversed, the chances are you have a corrupted stack.
- If an assembler instruction references BP - ??, 90% of the time it will be referring to local variables of that function. BP + ?? will be referring to parameters passed to that function.
Unwinding a stack
This is probably the most important thing you will need to know in order to analyse a dump. The stack can tell you the route that the application took to get to its sudden, unexpected death. If you have read the section How is a stack used? and Stack frames then you should have a fairly good idea of the structure of a stack. If you understand this then it is pretty simple to understand how to unwind the stack.
When you first look at the dump, go to SS:BP. This will point you to the middle of the stack frame that was active when you trapped. What it will point you to is the BP that was used by the function that called the failing function. The 4 (or 2 depending on whether it was a far or near call (refer to Near and far calls)) bytes after that BP will be address that the function would have returned to, if only it had had the chance.
Note the current BP contents and the failing CS:IP. Get the BP, and CS:IP pointed to by SS:BP and write those down. Use the BP that you have just got off the stack and together with the SS it will point you to the middle of the next frame, from where you can get the BP used by the function that called the function that called the function that died (and so it carries on).
You have reached the bottom of the stack when the return BP is 0.
Why do you want to do this?. When you are looking at a dump, there are often times that it was a parameter that was passed that caused the problem. So you need to at least go back to the second last function to see what parameters it passed. In some cases you'll have to go back further to get to work out how the parameter you suspect actually got to contain the value it did.
Refer to Fig. 7 for an example of a stack and how the frames link together, and Simplified look at unwinding the stack for more information.
When you are looking through a function other than the one that failed, always specify a value rather than a register name. Using the register name will cause PMDF to use the value in that register at the time of the abend. This is fine for the SS which is the same for all functions, but the BP will change with each function.
Approach to looking at dumps
There is no fixed approach to looking at a dump. Each dump has to be looked at in its own way. There are however a few guidelines that I can give to help track down the problem.
What PMDF gives us
When you first load a dump PMDF gives you a display showing the contents of all the registers of the thread that was active at the time of the dump (in the case of a trap D this will be the thread that abended). It also gives you the assembler instruction that failed, and why it failed.
You can redisplay the register contents, and the assembler instruction using the .r command. Refer to Initial display and output of .r.
Above the register information will be a line that will give you varous handles associated with the thread and its process. On the far left hand side is the slot number. If you were to do a .p, PMDF would display all the slots that were running in the system at the time it trapped (a slot corresponds to a thread).
On the far right of the top line is a name, this is the name of the dll/exe that was running the thread.
Make a note of the slot number, it could be used later on.
What have we learnt so far?
- The name of the dll/exe that failed.
- The assembler instruction that failed.
- Why that assembler instruction failed.