The Case of the Invisible Corpse

Preface

The following is a story about debugging in the OS/2 environment. It is meant to be a training piece, which can be used to give some insight and ideas to those new to the field. The reader is assumed to have some knowledge of OS/2 and Intel 386 architecture. The focus of the article is on the Trap0002, which is also known as the "Non Maskable Interrupt" or NMI.

Many of us write about interesting or obscure problems in the form of an annotated debug log, where we show the results of entering commands to the "kernel debugger" and then follow leads based on the results of those commands, until we form an opinion as to what the problem is.

This time though, I thought I would relate my debugging experiences in a different way. A way that imparts some of the true excitement and sense of adventure that I feel about my work. I did include a debug log at the end of the narrative, so those with debugging experience could see the same things I did when trying to solve the problem.

The story is told from the point-of-view of a private investigator assigned to a murder case. I hope you enjoy it.

"The Case of the Invisible Corpse"

I Love a mystery.

The kind of mystery you read on a wintry evening, curled up in your favorite leather armchair, next to a roaring fire.

Anyway, you get the idea. Like most mystery buffs, I try to figure out the ending before getting to the climax and I often see myself as the investigator in the story, trying my best to "crack the case".

Having been an OS/2 debugger for a while now, I have had the opportunity to investigate many different types of "mysteries". I recently had the chance to work on a problem involving a Trap0002. I've worked on Trap0002 cases before and while they are always interesting, they are many times, misunderstood.

I was assigned to a case in Germany recently where an OS/2 system was experiencing an observable slowdown during the startup process of one of the servers on a four server network. By observable, I mean that the system exhibited all the signs of a "hard hang". In this kind of hang the desktop will not respond to any keystrokes, nor will the mouse pointer move. However, if the operator waited for a short time, the system would magically come back to life as though nothing happened. I did not know these facts beforehand and would discover them myself later in the investigation. Therefore, when I arrived on-site, I was completely "cold" and had no preconceived ideas about what the problem was.

When I was shown the scene of the crime, I saw an OS/2 system in a hard hang. I was told that a debug kernel had been installed on the hanging system, so I immediately connected my ThinkPad to the "victim" (via serial ports and a null modem cable) and proceeded to interrogate the best witness I had, the debugger.

The first thing my witness told me was that our victim had been struck down by a Trap000D. Traps are some of the ways crimes can be committed in the world of the Intel chip. A trap indicates that a "rule" or "law of the land" has been broken and something has to be done about it. That "something" is usually to find the direct cause of the Trap and who the culprit is so that the problem can be corrected.

In most cases, a Trap000D means the chip encountered an instruction that, if allowed to execute, would violate the integrity of another process1. Trap000D's are not always fatal to the operating system, but most of the time, are fatal to a process. When the operating system is "killed" by a Trap000D, it will be a trap in ring 0 (the operating system or device driver) code.

As my investigation continued I discovered that, indeed, the Trap was in ring 0. My witness told me that the system was in a routine that causes a string of data (a message) to be written to the debug terminal. In fact, the only reason the system was trying to display this message at all was because the debugger was running. This was an extremely odd situation. To complicate matters further, no evidence could be found as to the reason for the trap. The "instruction", indicated by the debugger as the immediate cause, cannot and will not cause a trap.

This means that what I saw while examining the scene of the crime was something like an invisible corpse. It was like finding a murder victim in a dark alley, hearing a noise which catches your attention, and upon turning back, the body is "gone", vanished. This was more than just a disappearing corpse however, in this case the body was already gone while I was looking at it. It was there, and not there, at the same time.

In any investigation, there is standard list of routine questions to ask, not only of the witnesses, but questions to ask yourself. Things related to "means, motive and opportunity". These questions are most relevant when interrogating suspects. Motivation however, also applies to the victim. For example; why did the victim venture down the dark alley at so late an hour, etc. When looking for motivation after a trap, the investigator should try to determine how the system got itself into the present state. When a ring 0 trap occurs, it is reasonable to find out what was happening at ring 3 that caused the system to be in ring 0 at the time of the incident. Operating systems, by their very nature, tend to be simply service programs. Most of the time, when code is executing at ring 0, it is at the request of an application program. Therefore, when looking for motivation, it is wise to see what, if any, request was made at ring 3.

I proceeded to look for a ring 3 witness, a context for an application program, that might supply a reason for our victim being in the wrong place at the wrong time. I was in store for yet another startling discovery. There was NO ring 3 context. It simply was not there. "How can this be?" I asked myself. Am I sure this is an OS/2 System? The more I probed, the more panic stricken I became. No ring 3 context. No request made to the operating system. No way to determine motivation.

After all these years of doing this kind of work, I still hate an audience. People standing around, looking over your shoulder, hoping to see some magic incantation typed on the screen by the "wizard" from the lab, that can be used later to solve all sorts of problems in the future. What most people don't realize is that debugging is nothing more than educated guesswork. Anyway, with a room full of people looking over my shoulder, I began to feel very insecure and unsure of myself. There was something important happening here and I didn't see it. What was I missing?

All I had so far was an invisible corpse and a missing ring 3 context. It was time to get some real facts that could help be begin this case. Going back to basics, I asked my witness to tell me about anyone suspicious he may have seen just before the murder. He told me that a program named WKSTAHLP had asked him for something immediately prior to our witness finding the corpse. I was finally getting somewhere. My first fragment of a lead. I wrote this information down and placed Mr. Wkstahlp at the top of my suspect list. I assumed I would have to do some extensive background checking of this individual before the investigation was completed. I was wrong.

Trying to locate my first, and only, suspect so far was not going to be easy. Mr. Wkstahlp was last seen in ring 3 but was now among the missing. I had to find him fast, if for no other reason than to give my audience something to marvel at. I decided to try to locate my suspect by following his trail through the system. This is called a "surveillance" in the PI business. To perform a surveillance an investigator or "operative" will follow an individual around hoping to gather some useful, or even incriminating, bit of evidence that can later be used at the trial. In the Intel world, one surveillance technique is to examine the "stack" and follow the suspect in reverse. That is, we cannot follow where he is going, but we can follow where he came from. This was the technique I would use. All I had to do now was follow his trail.

While looking for the stack, I was reminded that there was no ring 3 context. Things that should have been there were not. It was time for a reality check. There was a way to tell where the application made the request of the Kernel. In order to keep track of the many individuals moving in and out of the operating system, the Kernel will log their activity in a stack it keeps for each of them. I would simply ask the kernel to show me these log entries, which should give me a clue as to what Mr. Wkstahlp was up to.

I wasn't going crazy after all. The kernel showed me that there was a request made by our chief suspect. It was now up to me to find out what that request was. I was determined to find out why there was no ring 3. I needed to find a door. A door, that when opened, would let me into Mr. Wkstahlp's life. A door that would show me all about him, including his most intimate secrets. That door came in the form of the "Local Descriptor Register" or LDTR. The LDTR can be thought of as the main entrance to a large building. While looking for the ring 3 context, I was looking at a building whose door had been sealed over. This was a very unusual phenomenon indeed. I discovered that the LDTR had been changed to zero, which effectively sealed the ring 3 building like a tomb.

Everyone knows that private investigators cannot break any laws while on a case and the good ones never do (right!). But, in the world of Intel, we can do pretty much anything we want to. So, I got out my "skeleton key" and attempted to locate the door. Seldom do things work out the way you think they should, so it was with much apprehension, that when I found what I thought was the door and using my key, it opened on the first try. I was astounded. Things were beginning to make sense. I was in. I could now see all there was to know about my prime suspect, the program named WKSTAHLP.

I began asking my suspect all the routine questions. The most important of which was "where were you at the time of the murder?". It told me he had just made a request of the kernel and was waiting until the kernel got around to servicing it. This was the electronic equivalent of ordering a pizza and waiting for the delivery man. It seemed like he was telling the truth. Anyway, all I needed to corroborate his story was to check back with the kernel. I continued questioning him but, I knew I was getting nowhere. Mr. Wkstahlp knew nothing about the crime. In fact, he was not even aware that a crime had been committed.

Checking the kernel's records, I verified that Wkastahlp had been telling the truth. Left with no other suspects, I decided to go back to the scene of the crime. One thing was still gnawing at me though. Why had the door to the ring 3 building been sealed and, more importantly, who sealed it.

Back at the crime scene I examined my "invisible corpse" again trying to decided if there really was a crime here. Using a technique called the "single step", I decided to let the computer help me to understand why we were all standing around looking at this corpse in the first place. The "single step" or "t" command will let the system execute one instruction and stop. If that instruction is causing a Trap000D however, upon execution, it will trap again and look exactly as it had before. So, I took a chance and entered a "t" .I was not surprised when the computer actually did execute the very instruction that it said had caused the trap. I was not surprised because I could not see any reason we had trapped in the first place, and now the computer, didn't seem to know the reason either (this is why I named this case "..the Invisible Corpse") .We had a crime and victim one minute and not the next.

Was it possible that I had interfered with things? I mean, in my search for the ring 3 context, had I changed something by using my skeleton key to find, and open, the door to the ring 3 building. I had changed the LDT register back to a meaningful value. Though I could not see the reason, I concluded (erroneously) that this was why the corpse had disappeared. I thought the task at hand was to find out why the LDT had become zero in the first place. I mean, someone thought there had been a murder. If not, the system would not have stopped and said there was one. So deciding I had enough for now, and that a good nights sleep was in order. I let the system go (the "g" command), and started to pack my gear away. It was then I received another surprise.

The system ran for what seemed like two or three seconds when the debugger displayed a message on the screen indicating the presence of another problem.

"NWD WARNING: a failsafe timeout was detected at cs:eip = 520:54e", was what the debugger told me (the "cs:eip=" part of the message is a clue to the name of the culprit). This meant that a certain kind of Trap0002 had occurred, but no damage had been done and the system recovered. The system continued to run and there were no more messages. I sat there for a while looking dumbly at the screen. Nothing happened.

Technical Note:

There are many different kinds of traps that can occur in OS/2, but the trap0002 is very rare. We are all aware that OS/2 is an "interrupt driven" system. This means that once the system starts following a series of instructions, it will continue to do so until it is interrupted. It's sort of like sitting in your favorite chair watching TV, when the phone rings. Someone is trying to get your attention. If you answer the phone, you've been interrupted and must do something before you can get back to your TV show. You can, however, ignore the telephone and let your answering machine take the call. In this case, someone wants your attention but, you'll get around to it after your show ends. Well the operating system can do the same thing. If it does not want to be bothered, he will let his answering machine take the interrupt (by the way, its answering machine is called the "Programmable Interrupt Controller, or PIC). Then it can listen to its messages later. However, if something wants the systems attention so bad, that if it doesn't get it right now, there will be serious trouble, then the Trap0002, or NMI is the method used. The NMI is the equivalent of someone throwing a rock through your window. I mean, unless your dead, this will get your attention, TV or no TV.

The way the kernel ignores its phone calls is by using the "cli" instruction. This stands for "Clear Interrupts". When this is executed, all interrupts to the system will be held by the PIC (answering machine), until the "Start Interrupts", or "sti" instruction is executed. There is one interrupt that the "cli" will not turn off. This is the "Non Maskable Interrupt", or "NMI" .This is the rock through the window. Interrupt number 2. Called Trap0002 by OS/2. The NMI is used to report very serious problems to the system. For example, if a memory parity error is found by the hardware, the NMI is reported to the system. A "Channel Check" can also trigger an NMI. There is however, a programming logic error that can cause an NMI. Let's say for example, that you are a program that thinks your work is very important. So important that once a particular task is started, it cannot be interrupted until the task is completed. You can use the "cli" instruction to insure that you will not have an interruptions. If, however, you do not allow interrupts to be serviced for too long a period a time, a rock will come flying through one of the kernel's windows in the form of an NMI. The thrower of the rock will be the "Watch Dog Timer" .This is the Watch Dog's only role in the system. When this timer "pops", the debugger reports it like this: "NWD WARNING: a failsafe timeout was detected at..."

End Technical Note.

The "NWD WARNING" message was telling me that some renegade program had tried to stop the system from servicing interrupts for too long a period of time, but had been caught in the act by the system's faithful watch dog, before any damage could be done. While staring at the screen, I was trying to decide what, if anything, this might have to do with our invisible corpse.

At this point, I had to wait for our criminal to strike again. There was nothing more I could do, since the system was now running. The only thing I was really sure of was that I had a mystery to solve.

Upon arrival the next morning, at the scene of the crime, I discovered that our culprit had indeed been back. Connecting the debug terminal again to the failing system, I saw the same problem I had seen the day before. The Trap000D had occurred again. There was one bit of good news though, I had lost my audience. Having decided that this guy from the States probably didn't know any more about the problem than they did, and becoming disinterested in all the magic incantations on the screen that look like Egyptian hieroglyphics to most people, they had all disappeared the night before. I was alone. I liked it that way.

Having made all the usual inquires of the system to be sure I was looking at the "exact" same problem, I told the system to re-execute the trapping instruction. Sure enough, it executed. Just like before my invisible corpse was back.

There is a technique used by investigators to shed light on the circumstances involved in the commission of a crime. It's called "recreating the scene". When we recreate the scene in the world of Intel, we actually get the opportunity to see the crime as though it were happening for the first time, right before our eyes. Therefore, having decided there was nothing else I could get from my witnesses at the moment, I decided to recreate the scene. This took the form of rebooting the system.

Why reboot you ask? Well one of the not very technical but pretty accurate ways to tell when the system was last booted, is to check the time stamp on the swapper. This only works if the swapper hasn't grown since the boot (I said it wasn't very technical). The time was just after I had left last night. It appeared this failure occurred shortly after a boot up.

Technical note

The correct way to tell when the system was stopped by the trap is to check the "Global Info Segment", or InfoSeg, using the debugger.

End Technical Note

This time, with the debug terminal already connected, I was about to see the crime committed while I watched. I booted the machine. a few minutes after the desktop came up, the terminal displayed: "NWD WARNING: a failsafe timeout was detected at cs:eip = 27e8:16a8" This was immediately followed by our Trap000D in ring 0. The watchdog had caught a different device driver this time. I could tell from the values in the cs:eip part of the message, that I was dealing with two different individuals who were each trying to steal off with the system's resources. This was the clue I needed to "crack the case".

I still didn't know why the trap occurred, but first things first. After giving the watchdog a pat on the head, I placed all the programs in a lineup (".lmo" command) and tried to identify who the villains were. I used pieces of torn clothing (cs=27e8 & cs=520) that our watchdog grabbed from the fleeing felons to make a positive ID. It turned out that the two culprits were IBMFE.OS2, and CLOCK01.SYS. Now catching the clock driver lurking around in the back yard was a little like catching the mayor of the city. I mean, no-one is more above suspicion than the clock driver. I decided to focus my attention on IBMFE. I brought him down from the line up and asked him, point blank, why was he trying to steal the systems resources.

Like many good citizens who make mistakes, he thought he was doing the right thing. He thought that by keeping interrupts off while he waited for an important piece of information to be updated, he was actually helping the system. After all he was responsible for running the network card, and we all know how important that is. I decided to let him off the hook. There was no evidence, so far, that implicated him in our murder (well, almost murder). So I turned my attention back to the victim lying at my feet. Again, I thought there is something I'm missing. Something right in front of me that I can't see. And it had to do with why I could not find the ring 3 context.

Just as before, the LDT register (remember the door to the building) contained zeros. Thinking about how this could be, I remembered that a ring 0 program could actually load a value into the LDTR. Well since Mr. Ibmfe was no longer a suspect, I was faced with the possibility that some other "privileged" program was fooling around with the LDTR. Should I start the process of examining all the device drivers in the system? A task that could take days, or even weeks. No, there had to be something else. Then it hit me. The Trap0002 is a very special character. What if while the system was processing an NMI (remember trap0002 and NMI are the same), another NMI came in? This was what I could not see.

Most traps on the system cause the system to interrupt. In fact the name of the interrupt is also the name of the trap. Trap0000 is an interrupt zero, or int0, trap0001 is an int1, etc. Most traps cause the system to go through a "Trap Gate" in order to report a trap. Like having to go to police headquarters to report a crime. Trap0002 however is different. When an NMI occurs, the system goes through a "Task Gate" to report a trap. This is because NMI's used to be considered catastrophic errors, and nothing in the system could be trusted when one of them occurred. We know this is not true when our good old watchdog finds someone creeping around in the back yard. But, other kinds of Trap0002's truly are system killers. For this reason, when the system reports an NMI, it creates a completely different environment for itself to try to analyze the problem. This is like going in through the back door of police headquarters to speak directly with the detectives. Most of the system's registers are loaded automatically with different values so that the system is completely isolated for the short time it needs to see if the problem is indeed fatal, or if it just has to pull the watchdog off another errant device driver.

O.K. fellow mystery enthusiasts, you have more than enough information to solve the mystery. Is the corpse real or not?

The corpse is real. The trap000d in ring 0 is real and would normally be fatal except for one thing. Except for the location of the trap nothing else could be done with it. The reason is, since the system had gone through a task gate when the first NMI occurred, the second NMI, who also tried to go through the gate was not allowed through. One of the most obscure reasons for a trap000d, or "General Protection Fault" is that the system will not load a new TSS "Task State Segment" if it is already running from that TSS. Remember when I said that when the NMI occurs the system tries to isolate itself from everything else in the system because it does not know whom to trust? This isolation is performed by going through a task gate. When passing through this gate, one of the many registers loaded is the "Task Register", or TR. It's the task register that allows the changing of all the other registers when passing through a task gate. Including the LDTR. The reason for the trap000d was because when the second NMI occurred, before the system could dispose of the first, it could not get back through the gate. It was already through it servicing the first NMI. The hardware would not allow passage through the gate because the TSS was the same as the one that was in use at the time.

You should have also figured out the final part of our mystery. What about ring 3? Why no ring 3 context? Well when examining the Trap0002 Task state segment, I discovered that the LDTR was loaded from there with zeros. This was because the isolated part of the system that analyzes NMIs had no need of ring 3. Therefore no LDTR.

A few more things. The reason why the system recovered from the trap000d was because nothing serious really happened. The system was going along happily trying to analyze the first NMI when a rock came crashing through the window (the second NMI), all the debugger did was report the trap000d and when I told it to continue, it just went back to whatever it was doing. There is another interesting point. When the debug kernel was replaced by the retail kernel, the system did not experience the Trap000D. Apparently, the retail kernel was able to handle the multiple Watchdog NMIs quick enough to avoid the whole issue. The only way an observer would know something unusual was happening during bootup, was if they happened to be moving the mouse around on the desktop. The mouse cursor, on the screen, would stop moving for two or three seconds, even though the observer was still moving the mouse on the mouse pad..

Feeling very pleased with myself for having solved another case, I began to pack up my laptop, cables and the like when, a thought started to gnaw its way into my mind. A thought that I knew I wasn't going to shake for a while.

Was the second suspect, CLOCK01, that was caught by the watchdog, the second, or the third NMI?

I love a mystery.

References

Pentium Processor User's Manual, Volume 1: Pentium Processor Data Book - Intel Corp. Order Number 241428
Pentium Processor User's Manual, Volume 3: Architecture and Programming Manual - Intel Corp. Order Number 241430
The Design of OS/2 by H.M. Deitel and M.S. Kogan

Footnotes

In OS/2 a process can be thought of as a "program". Programs run at Ring 3, the lowest privilege level.
Ring 0 is the highest privilege level. OS/2 itself, runs at this level, as well as "trusted code" like device drivers, installable file systems etc.

Technical Addendum

The following text is an annotated debugger log. It contains the actual output from the debugger. I have enhanced certain areas for readability and assume readers of this addendum have had some experience with the debug kernel.

Trap000D in ring 0. This is the invisible corpse, and was the first thing I saw.


Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
eip=00003d36 esp=00000364 ebp=0000036c iopl=0 rf -- -- nv up di pl nz na po cy
cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
1000:00003d36 1f             pop     ds

At first look, this gives the appearance of a fatal ring 0 trap. At this point, all you fellow mystery enthusiasts are probably thinking of ways to solve the case yourselves. Well you're just going to have to bear with me while the story unfolds.

As I said, this looks like a fatal problem. The next thing I did was an "ln" command in hopes that we would be lucky enough to have some symbols.


 ##ln
 1000:00003d1c os2krnl:DOSCODE:DPRINTF + 1a
 1000:00003d38 h_dputs - 2

So far so good. It appears that we are in the kernel in a routine named DPRINTF. If you are familiar with the debug kernel you might be aware that this is a routine that is responsible for the display of a string of data on the debug terminal, which is the very terminal we are using to examine the victim. Like I said, "I love a good mystery".

Why did we trap? Not because of ss:sp addressability for the "pop" instruction.

 ##dw ss:sp l8
 1e98:00000364  0128 3d45 8adc 0128 7436 03d0 3da5 0002

There must be another reason. I then began to look for motivation.

Where were we in ring 3?


 ##.r
 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000245e ebx=00000000 ecx=00000000 edx=00000a93 esi=00000d11 edi=0000328d
 eip=0000063a esp=00002436 ebp=00002462 iopl=2 -- -- -- nv up ei pl nz na pe nc
 cs=000f ss=0017 ds=0017 es=0017 fs=150b gs=0000  cr2=9b797000  cr3=0026b000
 000f:0000063a                Invalid selector

There is no ring 3 context (this will become more apparent in a short while)...

Who was the last person seen by the operating system before the crime? Or, to put it another way, what was the name of the current ring 3 application?

 ##.p *
  Slot  Pid  Ppid Csid Ord  Sta Pri  pTSD     pPTDA    pTCB     Disp SG Name
 *0043# 0029 0000 0029 0001 run 081e 9b793000 9bbd9558 9bb59220 0c0c 13 wkstahlp

Our witness told us about Mr. Wkstahlp. and the request made of the system.

I started my surveillance by examining the stack at ring 3...

##dl 17
Invalid selector: 0000:00000000

opps... no ring 3 context.

When looking for the ring 3 request, it is customary to examine the code immediately before the instruction pointed to by the ".r" command. However, in this case this is not possible since the system cannot validate the CS selector. This usually means that the code has been swapped out. So why don't we look at the ring 0 stack and see if we can find our return address. Remember, we are looking for a request made by the application which brought us to our current trap (motivation).

To find out where we entered ring 0, look at the Task State Segment, or TSS. OS/2 uses the TSS pointed to by selector x'10' for the processing of application programs. Looking into the TSS at:

##dw 10:0 l8
0010:00000000  0000 0000 53f4 0000 0030 0000 0000 0000

We see that at offset +4 the address 0000 0030 0000 53f4 is present. This is the bottom of the ring 0 stack that was, or will be, used by the current ring 3 context. By displaying this stack, we can determine where our application made its request of the operating system. address 30:53f4.


 ##dw 30:53f4-80
 0030:00005374  0000 0000 0362 0488 0000 0000 0000 0003
 0030:00005384  328d 0000 0d11 0000 2462 0000 53a4 0000
 0030:00005394  0000 0000 0a93 0000 0000 0000 245e 0000
 0030:000053a4  0000 0000 150b 0000 0017 0000 0017 0000
 0030:000053b4  5a9b 0024 0000 76a2 2206 0000 063a 0000  <== Notice the
 0030:000053c4  000f 0000 0000 0000 0003 ffff 0362 0488  <== return
                                                             address
                                                             000f:063a
 0030:000053d4  8053 245e 0017 0000 0000 0003 2460 0017
 0030:000053e4  0000 0000 0003 0017 2436 0000 0017 0000

Once we see our ring 3 return address on the ring 0 stack, we know that our application made a request to the operating system.

Checking on things that I knew should be there but, were not... I checked the selector in the LDT that points to the LDT itself.

##dl 7
Invalid selector: 0000:00000000

It was not there.

Is there an LDT at all? Well let's look in the "Global Descriptor Table" and find out.

##dl 28
GDT
0028  LDT     Bas=8c167000 Lim=0000ffff DPL=0 P

At this point I decided to do something I should have done earlier. I hadn't thought of it before because the absence of and LDT is not something that you will encounter very often. I used the "regterse" command to show me all the system's registers.


 ##y regterse

 ##r
 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d36 esp=00000364 ebp=0000036c iopl=0 rf -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 gdtr=9c7e5000 4fff  idtr=ffe1c1f0 03ff  tr=1e90 ldtr=0000 cr0=pg et ts -- mp pm
 dr0=00000000 --e1-  dr1=00000000 --e1-  dr2=00000000 --e1-  dr3=00000000 --e1-
 tr6=00000 v=0 d=00 u=00 w=00 c=w  tr7=00000 ht=0 rep=0  dr6=-- -- --  dr7=-- --
 1000:00003d36 1f             pop     ds

There it was, the reason I couldn't see ring 3. The LDT was zero. The door was sealed shut.

By using selector 0028 as my skeleton key, I opened the door. In OS/2, selector 28 is almost always used as the selector for the LDT.


 ##rldtr 0028

 ##r
 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d36 esp=00000364 ebp=0000036c iopl=0 rf -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 gdtr=9c7e5000 4fff  idtr=ffe1c1f0 03ff  tr=1e90 ldtr=0028 cr0=pg et ts -- mp pm
 dr0=00000000 --e1-  dr1=00000000 --e1-  dr2=00000000 --e1-  dr3=00000000 --e1-
 tr6=00000 v=0 d=00 u=00 w=00 c=w  tr7=00000 ht=0 rep=0  dr6=-- -- --  dr7=-- --
 1000:00003d36 1f             pop     ds

 ##dl
 0007  Data    Bas=8c167000 Lim=0000ffff DPL=3 P  RO
 000f  Code    Bas=00010000 Lim=00001097 DPL=3 P  RE    A
 0017  Data    Bas=00020000 Lim=0000328f DPL=3 P  RW    A
 001f  Data    Bas=00030000 Lim=00000d33 DPL=3 P  RW    A
 0027  Data    Bas=00040000 Lim=00000fff DPL=3 P  RW
 002e  Data    Bas=00050000 Lim=00000fff DPL=2 P  RW    A
 0036  Data    Bas=00060000 Lim=00000fff DPL=2 P  RW
 003e  Data    Bas=00070000 Lim=00000fff DPL=2 P  RW
 0046  Data    Bas=00080000 Lim=00000fff DPL=2 P  RW
    ETC...

With the entire ring 3 context was now visible. I proceeded to question the suspect.

 ##dl 0f       < == the code segment is now visible.
 000f  Code    Bas=00010000 Lim=00001097 DPL=3 P  RE    A

Interrogating WKSTAHLP I discovered that all he did was a DOSFSCTL, and was waiting for a reply.


 ##.r
 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000245e ebx=00000000 ecx=00000000 edx=00000a93 esi=00000d11 edi=0000328d
 eip=0000063a esp=00002436 ebp=00002462 iopl=2 -- -- -- nv up ei pl nz na pe nc
 cs=000f ss=0017 ds=0017 es=0017 fs=150b gs=0000  cr2=9b797000  cr3=0026b000
 000f:0000063a 8946fa         mov     word ptr [bp-06],ax           ss:245c=0c70

 ##u F:63a-10 63a
 000f:0000062a 687e00        push    007e
 000f:0000062d 6aff          push    -01
 000f:0000062f 6a03          push    +03
 000f:00000631 6a00          push    +00
 000f:00000633 6a00          push    +00
 000f:00000635 9a0000131b    call    1b13:0000    <== ordering a pizza..
 000f:0000063a 8946fa        mov     word ptr [bp-06],ax

 ##dl 1b13
 GDT
 1b13  CallG32 Sel:Off=0158:00005a96     DPL=3 P  DWC=9

 ##ln %158:5a96
 %fff37a96 os2krnl:DOSHIGH3CODE:DOSFSCTL

No leads here. I had to cross him off my suspect list.

Back at the crime scene.


 ##r
 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d36 esp=00000364 ebp=0000036c iopl=0 rf -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 1000:00003d36 1f             pop     ds

Single step.


 ##t
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d37 esp=00000366 ebp=0000036c iopl=0 -- -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0128 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 1000:00003d37 c3             ret

The trap is gone.

Giving up for now, I entered the Go command and this is I What I found:

 ##g
 NWD WARNING: a failsafe timeout was detected at cs:eip = 520:54e

As you can see, this is a warning. It indicates a potential problem. Oh, I forgot to tell you, this is a "Watchdog" timer pop. One of the causes of the Trap0002, or if you prefer, the dreaded "Non Maskable Interrupt" or NMI.

Upon arrival, the next morning, I saw the same trap again.


 ##r
 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d36 esp=00000364 ebp=0000036c iopl=0 rf -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 1000:00003d36 1f             pop     ds

With the debug terminal already connected, I rebooted the system. Shortly after the painting of the desktop the debugger displayed:


 NWD WARNING: a failsafe timeout was detected at cs:eip = 27e8:16a8
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d36 esp=00000364 ebp=0000036c iopl=0 -- -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 1000:00003d36 1f             pop     ds

This was our trap again, but this time, immediately prior to the trap there was a message from the Watchdog timer. This time, there was a different cs:eip value. This was a different driver than I saw before.

The "line up" is in the form of the ".lmo" command. This command tells you about every program in the system, Device Drivers, and DLLs, included. This is just an excerpt so you can see how to find the program you are looking for.


 ##.lmo
 hmte=0c7d pmte=%fdf5a558 mflags=00903150 c:\ibmlan\services\wkstahlp.exe
 obj   vsize    vbase    flags   ipagemap cpagemap hob  sel
 0001 00001098 00010000 80001075 00000001 00000002 0c7c 000f r-x disc shr prel alias
 0002 00002480 00020000 80001053 00000003 00000001 0000 0017 rw- disc prel alias
 hmte=0c54 pmte=%fdef2910 mflags=00903150 c:\ibmlan\services\wksta.exe
 obj   vsize    vbase    flags   ipagemap cpagemap hob  sel
 0001 00000684 00010000 80001035 00000001 00000001 0c53 000f r-x disc shr alias
 0002 0000054e 00020000 80001035 00000002 00000001 0c55 0017 r-x disc shr alias
 0003 00002910 00030000 80001075 00000003 00000003 0c56 001f r-x disc shr prel alias
     .
     .
 hmte=0390 pmte=%fdee1908 mflags=0808f1c9 c:\ibmcom\macs\ibmfe.os2
 seg  sect psiz vsiz hob  sel  flags
 0001 0001 1f02 ff1d 0000 27e0 8d41 data prel rel
 0002 0011 340a 3410 0000 27e8 ad60 code shr prel rel
     .
     .
 hmte=009a pmte=%fde43fa4 mflags=0008e1c9 c:\clock01.sys
 seg    sect   psiz   vsiz  hob   sel     flags
 0001 0084 005c 0098 0000 0518 8c49 data iter prel
 0002 00b2 0914 0914 0000 0520 8d60 code shr prel rel
 0003 0549 03fb 03fb 0000 0528 8d60 code shr prel rel

After obtaining the list of programs, you then use the "cs" values to search through the list. This will show that "520" belongs to Clock01.sys, and "27e8" belongs to Ibmfe.os2.


 ##di 2
 0002  TaskG   Sel:Off=1e90:00000000     DPL=0 P

 ##dw 1e90:0
 1e90:00000000  0010 0000 0000 0000 0000 0000 0000 0000
 1e90:00000010  0000 0000 0000 0000 0000 0000 b000 0026
 1e90:00000020  376c fff5 0000 0000 0000 0000 0000 0000
 1e90:00000030  0000 0000 0000 0000 0400 0000 0000 0000
 1e90:00000040  0000 0000 0000 0000 0168 0000 0170 0000
 1e90:00000050  1e98 0000 0168 0000 0000 0000 0000 0000
 1e90:00000060  0000 0000 0000 ffff
 Past end of segment: 1e90:00000068

Formatted with the "DT" command, the TSS looks like this:


 ##dt 1e90:0
 eax=00000010 ebx=9b765f90 ecx=00000000 edx=00000000 esi=ffe1b6e7 edi=00004d68
 eip=fff537f6 esp=00000400 ebp=00000000 iopl=0 -- -- -- nv up di pl nz na po nc
 cs=0170 ss=1e98 ds=0168 es=0168 fs=0000 gs=0030  cr3=0026b000
 ss0=0000  esp0=00000000  ss1=0000  esp1=00000000  ss2=0000  esp2=00000000
 ldtr=0000  link=0010  tflags=0000  i/o map=ffff
 ports trapped: 0-ffff

Notice, the LDTR is zero.

One thing not mentioned in the narrative is that of the "error code". The chip will save an error code whenever it enters a "trap gate". The debugger displays the error code in the first line of the trap display. If we look at the first trap message:


 Trap 13 (0DH) - General Protection Fault 1e91, External, GDT
 eax=0000000a ebx=ffe21ac2 ecx=00000000 edx=ffe2196c esi=ffe48adc edi=ffe48b1e
 eip=00003d36 esp=00000364 ebp=0000036c iopl=0 rf -- -- nv up di pl nz na po cy
 cs=1000 ss=1e98 ds=0400 es=0168 fs=0000 gs=0000  cr2=9b797000  cr3=0026b000
 1000:00003d36 1f             pop     ds

We can see, in the "error code", that is; the value immediately following the word "Fault" in the Trap message, we find "1e91". If we think of 1e91 as a selector (change the last two bits to 0), we get selector "1e90". This selector is referenced in the Trap0002 Task Gate as the selector for the TSS (as we have seen before). The system was giving me a clue all along.

 ##di 2
 0002  TaskG   Sel:Off=1e90:00000000     DPL=0 P

One of the many causes of Trap000D, as described in the "Intel486 Microprocessor Family Programmer's Reference Manual", is "Switching to a busy task". The Trap000D I've called the Invisible Corpse is this type of Trap000D.

 ##dl 1e98
 GDT
 1e98  TSS32   Bas=ffdfba8e Lim=00000067 DPL=0 P  NB

The Reasons for an NMI

The hardware (when it's working properly) will place values in ports 0461, 0061 and 0092 when an NMI occurs.

As of this writing, the kernel does not save the values from the ports that must interrogated for their values after an NMi occurs. The kernel will tell you if it had a Watchdog timer pop as we have seen in the accompanying narrative, as well as this addendum. However, the kernel does not save these values. Therefore, if you are lucky enough to be working on an NMI problem, you will want to set a breakpoint somewhere after the new Trap0002 TSS is loaded and before the ports are read. You can find the trap0002 handler by doing the following:

(These are debugger commands by the way.)

 di 2                 <== this shows you the trap0002 task gate.
 dt <value>           <== "value" is the "cs" value in the task gate.

Look for the formatted EIP value. You can then set a breakpoint at this value so that you may interrogate the ports yourself.

There are four reasons for an NMI (there are a few others but I will not cover them here). By far and away the main cause of NMIs is the watchdog timer. The Watchdog is hooked to IRQ0, which is the system timer. Every time IRQ0 is asserted the WD timer is reset to the value in port 71. This value will count down to zero, upon which it, when becoming zero, will cause an NMI. The reason for this is, it seems prudent to warn the user that some piece of trusted code is holding all system resources for so long that it is impairing the ability of the rest of the system to function. To determine if this is you're problem, you must check the bit settings in port 461, if the machine is a 3xx pc.

Ok, so you've read this far. I assume you want to know the other three reasons for NMIs. Here they are:

Reason Number Two - Memory parity error. Check port x'0061' for the 7 bit to be on. This is the high order bit, it would look like an x'80' if no other bits were on.

Reason number three - Channel Check. The "6" bit will be on if there has been a Channel check. That is: you will see a x'70', x'40 ' or x'60' when you use the "in" debugger command.

Reason number four - EISA bus timeout. This is also reported in Port 0461. These bits can be machine specific. Check the hardware manual for the system you are dealing with.

If you think you have an NMI problem, give me a call and I will be more than happy to discuss it with you. I can be contacted via E-Mail at: DSposato1@aol.com

Remember, I love a mystery..

The Case of the Invisible Corpse

Contents

Preface

"The Case of the Invisible Corpse"

References

Footnotes

Technical Addendum

The Reasons for an NMI

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools