Feedback Search Top Backward Forward
EDM/2

Inside the OS/2 Kernel

Written by David C. Zimmerli

  [Note: Due to the nature of the material in this article, the width of the preformatted text likely exceeds normal browser dimensions. I apologize for this, but for this particular article, it was unavoidable, given the current capabilities of HTML. Ed.]

I. Introduction

In this article, I aim to take OS/2 users and developers on a figurative "journey to the center of the earth"-- an expedition into the little-seen but fundamental workings of the system kernel.

The main tool used in researching this article was the Kernel Debugger (KDB), helpfully provided by IBM Corporation, and the accompanying 4-volume "OS/2 Debugging Handbook". My previous article, "Adventures in Kernel Debugging" (EDM/2 Nov. 1996, http://www.edm2.com/0410/kdb.html), gave an introduction to the setup and use of KDB. To get the most out of the present article, you should have an understanding of that material as well as a working knowledge of the Intel 80x86 architecture and memory management features.

For a detailed look at the kernel, it is of course necessary to settle on a specific version of OS/2, since the kernel code has been modified and upgraded along with the other system components in the various versions and fixpack levels. I have chosen OS/2 2.1 for Windows, build 6.514, as the least common denominator of the systems that are likely to be still running. Since most users will be running newer versions, the memory locations, file sizes and so forth will be different from those shown here. But the major data structures and code components have not changed appreciably.

II. Components of the kernel

The main bulk of the kernel code is in the file OS2KRNL, which on the example system is about 730K (retail) or 1Mb (debug version). The supporting cast consists of several smaller files: OS2LDR (28K retail, 37K debug), and the base device drivers SCREEN01.SYS, PRINT01.SYS, KBD01.SYS, and CLOCK01.SYS (3-30K each). Since the source code for the base device drivers is provided on the IBM DDK Developer Connection CD-Roms, we will not delve into them here.

The system DLLs DOSCALL1.DLL, KBDCALLS.DLL, QUECALLS.DLL, SESMGR.DLL, OS2CHAR.DLL and so forth are not part of the Presentation Manager but are not strictly part of the kernel either. Much of this "code" is simply forwarder entries to the Dos* kernel APIs. The SESMGR code does deserve some comment, but I will postpone it to a future article in order to keep this one to a manageable length.

The file OS2KRNL is a standard "LX" format 32-bit segmented-executable file, and as such can be examined with an EXE-file formatting utility. But it is easier and more informative to use the KDB "lg" command, which reads the symbol files supplied by IBM. This reveals the following segments:

  
  ##lg os2krnl
  os2krnl:
  0400:00000000 DOSGROUP
  1100:00000000 DOSCODE
  0120:00000000 DBGCODE
  0128:00000000 DBGDATA
  0030:00000000 TASKAREA
  0138:00000000 DOSGDTDATA
  0140:00000000 DOSINITDATA
  0148:00000000 DOSINITR3CODE
  %00110000 DOSMVDMINSTDATA
  %00120000 DOSSWAPINSTDATA
  %ffeff000 DGROUP
  0150:00000000 DOSHIGH2CODE
  0158:00000000 DOSHIGH3CODE
  0160:00000000 DOSHIGH4CODE
  %fff3f000 DOSHIGH32CODE
  ##
Of these segments, the 16-bit code segments are: DOSINITR3CODE, DOSCODE, DOSHIGH2CODE, DOSHIGH3CODE, DOSHIGH4CODE, and DBGCODE. The 16-bit data segments are DOSGROUP, DOSINITDATA and DBGDATA. (The DBGCODE and DBGDATA segments exist only to support KDB and are not found in the retail kernel.) The 32-bit code segment is DOSHIGH32CODE and the 32-bit data segment is DGROUP.

The "DOS" segments pertain, naturally, not to DOS but to the Control Program, the protected-mode successor to DOS. The "HIGH" segments are those which will be loaded high, that is, in physical memory above the 1Mb line.

Four segments require special comment. TASKAREA is an expand-down data segment which simply maps the current PTDA (Per-Task Data Area), as detailed in the Debugging Handbook. DOSGDTDATA maps the system GDT, which contains call gates for kernel API calls, and entries for segments used by some of the 16-bit kernel code. The symbol file has names for most of these segments, which give clues as to their functions (type "ls dosgdtdata" at the KDB prompt). DOSMVDMINSTDATA and DOSSWAPINSTDATA, as their names imply, support Multiple VDMs (Virtual DOS Machines), discussed in section V.

The following diagrams give an idea of the relative sizes of these segments. (The 16- and 32-bit segments are not drawn to the same scale, however.) The shaded areas represent pieces that are discarded after system initialization is complete.

The 16-bit code segments appear to be hand-coded in assembler, whereas the 32-bit code segment is largely compiled C code. Here is a synopsis of the major source modules making up each code segment.

III. The OS/2 start-up sequence

In this section we will take a close look at the kernel's bootstrap mechanism. This type of analysis is often useful in diagnosing hardware failures or file corruption which can cause OS/2 to fail to boot properly. For some more details on this phase of the system, see also the section "Remote IPL/ Bootable IFS" in the "Installable File Systems" (IFS.INF) file on Hobbes.

As every English schoolchild knows, when an IBM-compatible computer is turned on or rebooted, the appropriate boot sector is read into memory at location 07c0:0. Execution begins in real mode at this address, which is just below the 32K line. (This location was chosen so that the early IBM PCs could theoretically operate, though of course not run OS/2, with as little as 32K of RAM.)

The boot sector begins with a 3-byte JMP instruction, followed by a data structure of drive parameters which is common to both DOS and OS/2. This data structure is documented in a number of books on DOS and PC internals and will not be further discussed here. (See, for example, Ray Duncan, Advanced MS-DOS Programming, 2d. ed. 1988, Microsoft Press, p. 180.) The boot sector loads a small (1K) file called OS2BOOT at 0800:0, and OS2BOOT in turn loads OS2LDR at 2000:0. Of course, this file loading is done with real mode BIOS calls, since no file system is available yet.

The OS2LDR file is one of the most obscure and least documented parts of the kernel. There is no symbol file for this code, nor can one step through it in the Kernel Debugger, since the Kernel Debugger resides in the debug version of OS2KRNL, which has not been loaded yet. However, the debug version of OS2LDR generates a wealth of data on the debug terminal, and by analyzing this output along with code disassemblies, one can get a good idea of this module's activities.

We start with chip tests and basic system initialization-- querying available memory and installed drives, testing the CPU clock speed, and storing available video modes. Along the way, we get some cryptic progress indications on the debug terminal:

  IODel 000a
The I/O delay-- the time to wait between IN and OUT instructions-- has been set to 10 based on the speed of this processor.
  Int12 st 00000000 end 0009f7ff
  Int1588 st 00100000 end 03ffffff
The BIOS reports 640K (approx. 9f7ff hex) of real mode memory, and 64Mb (03ffffff hex) of extended memory.
  CPUUsable = 00000001
  CPUWeAre = 00000001
The CPU is an 80486 (0 = 386, 1= 486, etc.)
  SLFrm len a342
The length of the OS2LDR segment, including stack at the end, is 0a342 hex.
  cgvi
We are calling the "get video modes" routine.
  cldr
And now we come to the activity for which the loader is named: loading OS2KRNL, or, if this file is not found, OS2KRNLI, the install kernel.

The loader routine first gives us an inventory of the OS2KRNL segments:

  ob     flags    oi-flags   paddr/sel    glp     laddr/fladdr     msz/vsz       Object name
  01  rw--sfTLaA  00005063  004000/0400  0001  ffe00000/ffe00000  009000/00c5b3  DOSGROUP
  02  r-x-sfTLa-  00001065  011000/1100  000a  ffe0d000/ffe0d000  00c000/00bfb0  DOSCODE
  03  r-x-sf-LaA  00005025  01d000/0120  0016  ffe19000/ffe19000  00b000/00aeea  DBGCODE
  04  rw--sf-LaA  00005023  028000/0128  0021  ffe24000/ffe24000  009000/0085c0  DBGDATA
  05  rw--sN-LaA  0000d0a3  031000/0130  002a  ffe2d000/ffe2d000  010000/010000  stack
  06  rw--sN-LaA  0000d023  041000/0138  003a  ffe3d000/ffe3d000  002000/001e50  DOSGDTDATA
  07  rw--sf-LaA  00005023  043000/0140  003c  ffe3f000/ffe3f000  002000/004b4e  DOSINITDATA
  08  r-x-sf-LaA  00005025  048000/0148  003e  ffe44000/ffe44000  002000/001fe8  DOSINITR3CODE
  09  rw-BPf-h--  00002213  100000/0000  0040  ffefc000/00110000  001000/0001ac  DOSMVDMINSTDATA
  0a  rw-BPf-h--  00002013  101000/0000  0041  ffefd000/00120000  002000/001948  DOSSWAPINSTDATA
  0b  rw-Bsf-h-A  00006033  103000/0000  0043  ffeff000/ffeff000  012000/015326  DGROUP
  0c  r-x-sf-ha-  00001035  119000/0150  0055  fff15000/fff15000  010000/00fdcc  DOSHIGH2CODE
  0d  r-x-sf-ha-  00001035  129000/0158  0065  fff25000/fff25000  00a000/009a08  DOSHIGH3CODE
  0e  r-x-sf-ha-  00001035  133000/0160  006f  fff2f000/fff2f000  010000/00f304  DOSHIGH4CODE
  0f  r-xBsf-h--  00002035  143000/0000  007f  fff3f000/fff3f000  081000/080628  DOSHIGH32CODE
The rightmost column in the above table contains my own annotations, correlating the objects listed with the segments discussed in the previous section.

As an aside, the term "object" here means the equivalent in a 32-bit executable module of a "segment" in a 16-bit module. An object is more complex than a segment, since it consists of pages which can be read in and swapped to disk independently of one another, and this may be why a new term was felt to be necessary. But for most purposes, an object is simply a segment which is not limited to 64k in size. I will use the two terms more or less interchangeably.

At any rate, the table above gives attributes, physical and linear addresses and selectors, and sizes for each segment in OS2KRNL. The actual reading in of the file proceeds by first loading the "high" segments into low memory, and then moving them to their proper places using the BIOS Int 15h/87h "Move extended memory block" function. (Remember, we're still running in real mode here!) On the second pass, the low memory segments are loaded.

Wrapping up, OS2LDR gives us a map of physical memory:

  pa=00000000 sz=00001000 va=00000000 sel=0000 fl=2000 of=00000003 ow=0000  Real mode IVT
  pa=00001000 sz=00002300 va=ffef9000 sel=0100 fl=2014 of=00001004 ow=ff6d  OS2LDR 32-bit int dispatch
  pa=00004000 sz=0000c5b3 va=ffe00000 sel=0400 fl=2144 of=00005063 ow=ffaa  DOSGROUP
  pa=00011000 sz=0000bfb0 va=ffe0d000 sel=1100 fl=2244 of=00001065 ow=ffaa  DOSCODE
  pa=0001d000 sz=0000aeea va=ffe19000 sel=0120 fl=2344 of=00005025 ow=ffaa  DBGCODE
  pa=00028000 sz=000085c0 va=ffe24000 sel=0128 fl=2444 of=00005023 ow=ffaa  DBGDATA
  pa=00031000 sz=00010000 va=ffe2d000 sel=0130 fl=2544 of=0000d0a3 ow=ffaa  stack
  pa=00041000 sz=00001e50 va=ffe3d000 sel=0138 fl=2644 of=0000d023 ow=ffaa  DOSGDTDATA
  pa=00043000 sz=00004b4e va=ffe3f000 sel=0140 fl=2744 of=00005023 ow=ffaa  DOSINITDATA
  pa=00048000 sz=00001fe8 va=ffe44000 sel=0148 fl=2844 of=00005025 ow=ffaa  DOSINITR3CODE
  pa=0004a000 sz=00000ac8 va=00000000 sel=4a00 fl=2001 of=00000000 ow=0000  OS2DUMP
  pa=0004b000 sz=00049000 va=00000000 sel=0000 fl=2002 of=00000000 ow=0000  unused
  pa=00094000 sz=0000a762 va=ffeee000 sel=0000 fl=2054 of=00001003 ow=ffab  OS2LDR (relocated)
  pa=0009f000 sz=00000800 va=00000000 sel=0000 fl=2002 of=00000000 ow=0000  unused
  pa=0009f800 sz=00000800 va=ffeed800 sel=0000 fl=2004 of=00000000 ow=ff37  romdata
  pa=000a0000 sz=00060000 va=00000000 sel=0000 fl=0001 of=00000000 ow=0000  video/BIOS area
  pa=00100000 sz=000001ac va=ffefc000 sel=0000 fl=0944 of=00002213 ow=ffaa  DOSMVDMINSTDATA
  pa=00101000 sz=00001948 va=ffefd000 sel=0000 fl=0a44 of=00002013 ow=ffaa  DOSSWAPINSTDATA
  pa=00103000 sz=00015326 va=ffeff000 sel=0000 fl=0b44 of=00006033 ow=ffaa  DGROUP
  pa=00119000 sz=0000fdcc va=fff15000 sel=0150 fl=0c44 of=00001035 ow=ffaa  DOSHIGH2CODE
  pa=00129000 sz=00009a08 va=fff25000 sel=0158 fl=0d44 of=00001035 ow=ffaa  DOSHIGH3CODE
  pa=00133000 sz=0000f304 va=fff2f000 sel=0160 fl=0e44 of=00001035 ow=ffaa  DOSHIGH4CODE
  pa=00143000 sz=00080628 va=fff3f000 sel=0000 fl=0f44 of=00002035 ow=ffaa  DOSHIGH32CODE
  pa=001c4000 sz=00e3c000 va=00000000 sel=0000 fl=0002 of=00000000 ow=0000  unused
  pa=01000000 sz=00000000 va=00000000 sel=0000 fl=0001 of=00000000 ow=0000  unused
  pa=01000000 sz=03000000 va=00000000 sel=0000 fl=0002 of=00000000 ow=0000  unused
  pa=04000000 sz=00000000 va=00000000 sel=0000 fl=4000 of=00000000 ow=0000  limit of physical memory
Here I have again added annotations in the right-most column. The second to last column is the "System Object Id" for which a key can be found in section 4.6 of the Debugging Handbook, Volume IV.

We set the 8259 PIC chips so that IRQs 0 through 7 map to interrupts 50h-57h, and IRQs 8 through 0fh map to interrupts 70h-77h:

  rPIC
And we jump to syiInitializeOS2 in the DOSCODE segment of OS2KRNL:
  j syi
Now that we are in the kernel proper, we can step through most of the rest of the initialization code with the Kernel Debugger. A special KDB facility enables us to press and hold "R", "P", or <space> at the debug terminal and interrupt the startup code either before the switch to protected mode, after the switch to protected mode but before loader and pager initialization, or after loader and pager initialization, respectively. (Be sure to set the keyboard repeat rate on the debug terminal to a high value or else the keystroke may be missed by the COM port polling routines.)

The bulk of the system init part of DOSCODE consists of logic to parse CONFIG.SYS. There is also a component called the "system init file system" (abbreviated sifs), used to read in CONFIG.SYS as well as various BASEDEVs and other files needed before a full-fledged file system can be set up. Each major piece of the kernel (the loader, pager, scheduler, and so on) has an initialization routine which is called at this point, and we "return" to the syiProcess routine in the DOSINITR3CODE segment.

syiProcess then loads and initializes the regular (non-base) installable device drivers, loads the system DLLs, and starts the shell. This is the first place where we have available the ADD drivers used by the fully initialized system to access the hard disk. Since one of the first routines called by syiProcess is in the inicp.asm module (initialize codepage), it is here that we can get the infamous "Cannot find COUNTRY.SYS" error message, even when there is no problem with the COUNTRY.SYS file, if the base device drivers have not installed properly.

IV. The file access code

I want to take a brief look at the sequence of calls connecting a Dos* file I/O call with the hardware access code. For more details, please consult the "Storage Device Driver Reference" found in the DDK and the IFS.INF file on Hobbes.

In the first scenario, we call DOS32READ with the handle of a file on a FAT partition of a SCSI hard drive. This is an entry point in DOSCALL1 which soon passes through a call gate to the 32-bit kernel and invokes FS32IREAD. For FAT access, FS32IREAD then calls the 16-bit routine h_DOS_Read in DOSHIGH4CODE, which, after ascertaining that the requested data is not in a previously read buffer, formulates a "request list" and sends it to the OS2DASD.DMD device driver. The request list is contructed in the _BufReadExecute and _ReqListExecute routines of DOSHIGH32CODE, and consists of a single request with the extended command code 1Eh for Read.

The request specifies the start block, number of blocks to read, and addresses of buffers to hold the data. OS2DASD.DMD then calls the appropriate *.ADD device driver-- for the example system, AHA152X.ADD-- to access the physical hardware.

The second scenario is the same except that the file we are reading resides on an HPFS partition. In this case, FS32IREAD bypasses the legacy 16-bit FAT routines in DOSHIGH4CODE, and calls instead the FS_READ entry point of HPFS.IFS. The HPFS file system then takes care of buffering data and interfacing to the OS2DASD.DMD module.

The third scenario involves reading a file on a floppy disk. This is the same as the first scenario as far as the kernel is concerned; however, the OS2DASD.DMD code will pass the request down to IBM1FLPY.ADD (or IBM2FLPY.ADD for a Micro Channel machine), rather than to AHA152X.ADD.

Finally, to read data from a SCSI CD-Rom, FS32IREAD calls FS_READ on the CD-Rom file system driver CDFS.IFS. CDFS will again perform buffering services, then send a request list to OS2CDROM.DMD, which will invoke the appropriate BASEDEV-- in the example system, LMS206.ADD for a Philips CD-Rom drive.

V. The context-switching mechanism

A context switch generally takes place at the trailing edge of a kernel API call. Before returning from kernel mode to user mode, the system will call a special routine called KMExitKmodeEvents. This routine examines the global variable Resched, which indicates whether other threads of sufficient priority are ready to run. If Resched is non-zero, the next stop is the _tkSchedNext routine.

_tkSchedNext invokes the scheduler apparatus (_SCHGetNextRunner) to decide which of the ready threads will be the next to receive a timeslice. Some aspects of the scheduler, with its thicket of states and transitions, priority queues, sleep queues, and so forth, are documented in the Debugging Handbook. For now we simply note that _SCHGetNextRunner returns, in the EAX register, a pointer to the TCB of the new, or incoming thread. This pointer then becomes the single argument to the _PGSwitchContext routine.

The _PGSwitchContext code occupies 559 bytes in DOSHIGH32CODE, and it is worth close study. We cannot step through this code in the Kernel Debugger, since the page tables and system structures are in a transitional state which the debugger cannot make sense of. But by examining the disassembly we can understand its operation and gain significant insight into the OS/2 process architecture.

The path we take depends to some extent on whether we are switching to a different process or merely to a different thread within the same process. If it is a process switch, we must rewrite the portion of the page tables corresponding to user memory (typically up to about 256 Mb) to show the new physical addresses. A process switch also requires pointing the LDTR to a new value, since the LDT tiling can be different for different processes.

For either a process or thread switch, we must remap the TASKAREA segment (selector 30), since this selector addresses the current TCB and TSD as well as the PTDA. We must also update various system global variables: _pPTDACur, _TKSSBase, _TKTCBBias, _pTCBCur, _pTSDCur, and the ring 0 and ring 2 stack pointers.

For some more details on the context switching mechanism, see page 339 of the Debugging Handbook, Volume I.

Of course, it is possible for an application not to make any kernel calls for a long period of time. Perhaps the program is solving a differential equation, doing a complex string search, or otherwise minding its own business without needing to do any I/O or use any kernel services. Will the KMExitKmodeEvents routine then be bypassed, and must all other threads bide their time while waiting for such a program to finish?

The answer, as we would expect, is no, thanks to the 8254 PIT, or Programmable Interval Timer, chip on the PC motherboard. At boot-up, counter 0 of the 8254 chip is set to operate in mode 2 (rate generator mode) to cause an interrupt on IRQ0 approximately 18.2 times per second. Like other IRQs, this one is handled by intIRQRouter in DOSHIGH32CODE, and upon receiving IRQ0, intIRQRouter calls KMExitKmodeEvents, as above. This forces the application to undergo the same scheduling scrutiny that it would if it made a kernel call directly.

VI. Memory management

When an application calls DosAllocMem, the system creates a "memory object" by reserving a contiguous segment of the process's private virtual address space. The system allocates page table entries for the object, and since each page table entry controls 4Kb of memory, the object will actually have a size equal to the requested size rounded up to the nearest 4Kb. The object will begin on a 64Kb boundary to allow it to be addressed by an LDT selector, so each call to DosAllocMem consumes at least 64Kb of virtual address space.

However, no physical memory or disk swap space is allocated by the DosAllocMem code. The mechanism used is called "lazy commit": when an attempt is made to read or write to the area of virtual memory in question, a page fault will be generated (trap 0eh), and the handling routines in DOSHIGH32CODE will then allocate physical memory and set the "present" bit in the corresponding page table entry.

A simple experiment shows the "before" and "after" state of the page table resulting from a call to DosAllocMem. Here is a program about to make the call with a request to allocate 00020000h, or 128 Kb:

  eax=0006eb03 ebx=000a0000 ecx=0006eb88 edx=0006ebb0 esi=00000000 edi=00019010
  eip=000120f7 esp=0006eb7c ebp=0006ebd4 iopl=2 rf -- -- nv up ei pl nz na pe nc
  cs=005b ss=0053 ds=0053 es=0053 fs=150b gs=0000  cr2=00093ffe  cr3=001f6000
  005b:000120f7 e8385a011a     call    DOS32ALLOCMEM (1a027b34)
  ##d ss:esp l 20
  0053:0006eb7c 88 eb 06 00 00 00 02 00-13 00 00 00 03 00 00 00 .k..............
  0053:0006eb8c 74 34 01 00 d0 eb 06 00-08 00 00 00 00 00 00 00 t4..Pk..........
  ##
And here are the page table entries for %120000 and %130000:
  ##dp %120000
   linaddr   frame   pteframe  state res Dc Au CD WT Us rW Pn state
  %00120000* 02ec1  frame=02ec1  2    0  D  A        U  W  P  resident
  ##dp %130000
   linaddr   frame   pteframe  state res Dc Au CD WT Us rW Pn state
  %00130000* 02ec1  frame=02ec1  2    0  D  A        U  W  P  resident
  ##
We type "p", and then examine the stack and page tables again:
  ##p
  eax=00000000 ebx=000a0000 ecx=0006eb88 edx=0006ebb0 esi=00000000 edi=00019010
  eip=000120fc esp=0006eb7c ebp=0006ebd4 iopl=2 -- -- -- nv up ei pl nz na pe nc
  cs=005b ss=0053 ds=0053 es=0053 fs=150b gs=0000  cr2=00093ffe  cr3=001f6000
  005b:000120fc 83c40c         add     esp,+0c
  ##d %6eb88 l 20
  %0006eb88 00 00 12 00 74 34 01 00-d0 eb 06 00 08 00 00 00 ....t4..Pk......
  %0006eb98 00 00 00 00 00 00 00 00-00 00 00 00 e0 5d 01 00 ............`]..
  ##dp %120000
   linaddr   frame   pteframe  state res Dc Au CD WT Us rW Pn state
  %00120000* 02ec1  frame=02ec1  2    0  D  A        U  W  P  resident
  %00120000         vp id=01608  0    0  c  u        U  W  n  pageable
  %00121000         vp id=01609  0    0  c  u        U  W  n  pageable
  %00122000         vp id=0160a  0    0  c  u        U  W  n  pageable
  %00123000         vp id=0160b  0    0  c  u        U  W  n  pageable
  %00124000         vp id=0160c  0    0  c  u        U  W  n  pageable
  %00125000         vp id=0160d  0    0  c  u        U  W  n  pageable
  %00126000         vp id=01635  0    0  c  u        U  W  n  pageable
  %00127000         vp id=01636  0    0  c  u        U  W  n  pageable
  %00128000         vp id=01637  0    0  c  u        U  W  n  pageable
  %00129000         vp id=01638  0    0  c  u        U  W  n  pageable
  %0012a000         vp id=01639  0    0  c  u        U  W  n  pageable
  %0012b000         vp id=0163a  0    0  c  u        U  W  n  pageable
  %0012c000         vp id=0163b  0    0  c  u        U  W  n  pageable
  %0012d000         vp id=0163c  0    0  c  u        U  W  n  pageable
  %0012e000         vp id=0163d  0    0  c  u        U  W  n  pageable
  %0012f000         vp id=0163e  0    0  c  u        U  W  n  pageable
  ##dp %130000
   linaddr   frame   pteframe  state res Dc Au CD WT Us rW Pn state
  %00130000* 02ec1  frame=02ec1  2    0  D  A        U  W  P  resident
  %00130000         vp id=0163f  0    0  c  u        U  W  n  pageable
  %00131000         vp id=01640  0    0  c  u        U  W  n  pageable
  %00132000         vp id=01641  0    0  c  u        U  W  n  pageable
  %00133000         vp id=01642  0    0  c  u        U  W  n  pageable
  %00134000         vp id=01643  0    0  c  u        U  W  n  pageable
  %00135000         vp id=01644  0    0  c  u        U  W  n  pageable
  %00136000         vp id=01645  0    0  c  u        U  W  n  pageable
  %00137000         vp id=01646  0    0  c  u        U  W  n  pageable
  %00138000         vp id=01647  0    0  c  u        U  W  n  pageable
  %00139000         vp id=01648  0    0  c  u        U  W  n  pageable
  %0013a000         vp id=01649  0    0  c  u        U  W  n  pageable
  %0013b000         vp id=0164a  0    0  c  u        U  W  n  pageable
  %0013c000         vp id=0164b  0    0  c  u        U  W  n  pageable
  %0013d000         vp id=0164c  0    0  c  u        U  W  n  pageable
  %0013e000         vp id=0164d  0    0  c  u        U  W  n  pageable
  %0013f000         vp id=0164e  0    0  c  u        U  W  n  pageable
  ##
The kernel has allocated 128Kb worth of page table entries beginning at linear address %120000.

The main worker routine in the kernel which does this is _VMAllocMem, which calls the routines _VMReserve, _PGAlloc, and _SELAlloc.

We may also want to see what happens when the program actually tries to access the memory. The KDB command "vsp e" will intercept page faults before they are processed, and this can be used in conjunction with the "zs" (change default command) facility to collect statistics on the page fault mechanism and its effect on system performance. For tracing purposes, it is easier just to put a breakpoint at the start of the lengthy _PGPageFault routine which handles this exception.

VII. The DOS emulation kernel

The DOS emulation component of the system is not mentioned at all in the Debugging Handbook and tends to be ignored by developers because it exists only for compatibility with older programs. However, it occupies over 25% of the code in OS2KRNL and is worth examining if only as an illustration of the versatility of the x86 architecture.

There are essentially three parts to DOS emulation in OS/2: the MVDM manager, the DOS emulator proper and the x86 emulator. A fourth part, the virtual device drivers necessary to run many DOS programs, exists outside the kernel but makes use of the Virtual DevHelp API calls implemented in the MVDM manager.

To get an idea of the issues involved in tracing a DOS application in the Kernel Debugger, let's look at a simple "Hello, world" program written in assembly. We open a DOS Window, whereupon the kernel gives us a VDM with a copy of the "stub virtual DOS kernel"-- the file C:\OS2\MDOS\DOSKRNL-- loaded into low memory to provide int 21h services. We then start the program HELLO.EXE. Here is the complete disassembly:

  --u ac2:0 l 7
  0ac2:00000000 b8c30a         mov     ax,0ac3
  0ac2:00000003 8ed8           mov     ds,ax
  0ac2:00000005 b409           mov     ah,09        ; output a string at ds:dx
  0ac2:00000007 ba0000         mov     dx,0000
  0ac2:0000000a cd21           int     21
  0ac2:0000000c b44c           mov     ah,4c
  0ac2:0000000e cd21           int     21
  --d ac3:0 l 10
  0ac3:00000000 48 65 6c 6c 6f 2c 20 77-6f 72 6c 64 0d 0a 24 00 Hello, world..$.
Notice that KDB uses a double-dash prompt ("--") instead of the usual "##" to indicate that we are running in V86 mode. We were able to break in this code by patching in a CCh opcode at the entry point of HELLO.EXE, running the program, then editing memory to replace the proper opcode.

After typing "t" a few times, we arrive at a DOS system call:

  --t
  eax=000009c3 ebx=00000000 ecx=000000ff edx=00000000 esi=00000000 edi=00000100
  eip=0000000a esp=00000100 ebp=0000091c iopl=3 -- vm -- nv up ei pl zr na pe nc
  cs=0ac2 ss=0ac4 ds=0ac3 es=0ab2 fs=0000 gs=0000  cr2=01390000  cr3=001f6000
  0ac2:0000000a cd21           int     21
  --
But we will not be able to "t" into this call. This is because this instruction is about to cause a General Protection Exception (Trap 0D), even though IOPL is 3, apparently because the IDT entry for int 21h is invalid or contains a null pointer:
  --di 21
  0021  TrapG   Sel:Off=0000:00000000     DPL=3 P
So we put a breakpoint at trap0d in the 32-bit kernel, and continue:
  --br e trap0d
  --g
  Debug register hit
  eax=000009c3 ebx=00000000 ecx=000000ff edx=00000000 esi=00000000 edi=00000100
  eip=fff491bc esp=00006708 ebp=0000091c iopl=3 rf -- -- nv up ei pl zr na pe nc
  cs=0170 ss=0030 ds=0000 es=0000 fs=0000 gs=0000  cr2=01390000  cr3=001f6000
  os2krnl:DOSHIGH32CODE:trap0d:
  0170:fff491bc 6a0d           push    +0d              ;br0
  ##
This code will call em86opINTnn to simulate the software interrupt, and we will soon "iretd" back into V86 mode. The call will then be handled by DOSKRNL in low memory.

We will leave for another day the rest of the saga, for DOSKRNL must still make BIOS calls, which will again cause GP exceptions and be routed to VDDs such as VBIOS.SYS and VVGA.SYS. These will cooperate with SESMGR and the PDDs to finally display the greeting on the screen.

Some additional clues about the workings of OS/2 DOS emulation can be found in The Design of OS/2, 1992 Addison-Wesley, by H. M. Deitel and M. S. Kogan, pp. 290-300. These are only clues, however, as the correlation is not exact between the text descriptions offered there and what is observable with KDB.

VIII. The shut-down routines

Since all good things must come to an end, the Control Program API includes the DosShutdown routine. The worker code is at the symbol w_Shutdown in DOSHIGH4CODE.

This function disables the installed file system drivers by overwriting all their entry points with the address of the ShutdownBlock routine in DOSHIGH2CODE. Any thread thereafter attempting to call an FSD will be blocked. A few routines, however, remain intact for use by the shutdown code: FS_COMMIT, FS_DOPAGEIO, FS_FSCTL, FS_FLUSHBUF, and FS_SHUTDOWN. Also, for file system drivers used by the swapper, certain key entry points are first preserved at the locations FS_SDCHGFILEPTR, FS_SDFSINFO, FS_SDREAD, and FS_SDWRITE. This enables the paging routines to continue to perform shutdown chores, while all other threads are locked out.

We then iterate through all installed device drivers, sending "shutdown" (command code 1Ch) request packets to each. Each driver is called twice, with parameters of 0 and 1 for begin shutdown and end shutdown, respectively. Likewise, for each installed file system driver, the function FS_SHUTDOWN is called twice with start and end shutdown flags. In between these calls, the routines shutdown$FlushAllSFTs and h_FSD_FlushBuf stabilize RAM-cached portions of the file systems.

IX. Conclusion

At a time when IBM's support for OS/2 seems to grow less enthusiastic every day, it becomes increasingly important for users and developers to understand the internals of the system on their own. This knowledge can help in developing drivers and applications, building independent help desks, and even in coding patches to the system if necessary. With active support efforts from the outside community, OS/2 can and will continue to thrive. I hope that the present article has contributed in some measure to understanding the foundation of this imposing edifice.

  eax=00000000 ebx=7b22002b ecx=80010013 edx=7ba0fc0c esi=7ba093c0 edi=ffffffff
  eip=fff46572 esp=00006688 ebp=00006688 iopl=2 -- -- -- nv up ei pl zr na pe nc
  cs=0170 ss=0030 ds=0168 es=0168 fs=0000 gs=0000  cr2=111f3302  cr3=001f6000
  0170:fff46572 833d7c1ff1ff00 cmp     dword ptr [_ptcbPriQReady (fff11f7c)],+00
                                                             ds:fff11f7c=00000000
  ##
Copyright © 1998, David C. Zimmerli. All rights reserved.