Inside the OS/2 Kernel
Written by David C. Zimmerli
[Note: Due to the nature of the material in this
article, the width of the preformatted text likely exceeds normal browser
dimensions. I apologize for this, but for this particular article, it was
unavoidable, given the current capabilities of HTML. Ed.]
In this article, I aim to take OS/2 users and developers on a figurative "journey to the center of the earth"-- an expedition into the little-seen but fundamental workings of the system kernel.
The main tool used in researching this article was the Kernel Debugger (KDB), helpfully provided by IBM Corporation, and the accompanying 4-volume "OS/2 Debugging Handbook". My previous article, "Adventures in Kernel Debugging" (EDM/2 Nov. 1996, http://www.edm2.com/0410/kdb.html), gave an introduction to the setup and use of KDB. To get the most out of the present article, you should have an understanding of that material as well as a working knowledge of the Intel 80x86 architecture and memory management features.
For a detailed look at the kernel, it is of course necessary to settle on a specific version of OS/2, since the kernel code has been modified and upgraded along with the other system components in the various versions and fixpack levels. I have chosen OS/2 2.1 for Windows, build 6.514, as the least common denominator of the systems that are likely to be still running. Since most users will be running newer versions, the memory locations, file sizes and so forth will be different from those shown here. But the major data structures and code components have not changed appreciably.
II. Components of the kernel
The main bulk of the kernel code is in the file OS2KRNL, which on the example system is about 730K (retail) or 1Mb (debug version). The supporting cast consists of several smaller files: OS2LDR (28K retail, 37K debug), and the base device drivers SCREEN01.SYS, PRINT01.SYS, KBD01.SYS, and CLOCK01.SYS (3-30K each). Since the source code for the base device drivers is provided on the IBM DDK Developer Connection CD-Roms, we will not delve into them here.
The system DLLs DOSCALL1.DLL, KBDCALLS.DLL, QUECALLS.DLL, SESMGR.DLL, OS2CHAR.DLL and so forth are not part of the Presentation Manager but are not strictly part of the kernel either. Much of this "code" is simply forwarder entries to the Dos* kernel APIs. The SESMGR code does deserve some comment, but I will postpone it to a future article in order to keep this one to a manageable length.
The file OS2KRNL is a standard "LX" format 32-bit segmented-executable
file, and as such can be examined with an EXE-file formatting utility. But
it is easier and more informative to use the KDB "lg" command, which reads
the symbol files supplied by IBM. This reveals the following segments:
The "DOS" segments pertain, naturally, not to DOS but to the Control Program, the protected-mode successor to DOS. The "HIGH" segments are those which will be loaded high, that is, in physical memory above the 1Mb line.
Four segments require special comment. TASKAREA is an expand-down data segment which simply maps the current PTDA (Per-Task Data Area), as detailed in the Debugging Handbook. DOSGDTDATA maps the system GDT, which contains call gates for kernel API calls, and entries for segments used by some of the 16-bit kernel code. The symbol file has names for most of these segments, which give clues as to their functions (type "ls dosgdtdata" at the KDB prompt). DOSMVDMINSTDATA and DOSSWAPINSTDATA, as their names imply, support Multiple VDMs (Virtual DOS Machines), discussed in section V.
The following diagrams give an idea of the relative sizes of these segments. (The 16- and 32-bit segments are not drawn to the same scale, however.) The shaded areas represent pieces that are discarded after system initialization is complete.
The 16-bit code segments appear to be hand-coded in assembler, whereas the 32-bit code segment is largely compiled C code. Here is a synopsis of the major source modules making up each code segment.
III. The OS/2 start-up sequence
In this section we will take a close look at the kernel's bootstrap mechanism. This type of analysis is often useful in diagnosing hardware failures or file corruption which can cause OS/2 to fail to boot properly. For some more details on this phase of the system, see also the section "Remote IPL/ Bootable IFS" in the "Installable File Systems" (IFS.INF) file on Hobbes.
As every English schoolchild knows, when an IBM-compatible computer is turned on or rebooted, the appropriate boot sector is read into memory at location 07c0:0. Execution begins in real mode at this address, which is just below the 32K line. (This location was chosen so that the early IBM PCs could theoretically operate, though of course not run OS/2, with as little as 32K of RAM.)
The boot sector begins with a 3-byte JMP instruction, followed by a data structure of drive parameters which is common to both DOS and OS/2. This data structure is documented in a number of books on DOS and PC internals and will not be further discussed here. (See, for example, Ray Duncan, Advanced MS-DOS Programming, 2d. ed. 1988, Microsoft Press, p. 180.) The boot sector loads a small (1K) file called OS2BOOT at 0800:0, and OS2BOOT in turn loads OS2LDR at 2000:0. Of course, this file loading is done with real mode BIOS calls, since no file system is available yet.
The OS2LDR file is one of the most obscure and least documented parts of the kernel. There is no symbol file for this code, nor can one step through it in the Kernel Debugger, since the Kernel Debugger resides in the debug version of OS2KRNL, which has not been loaded yet. However, the debug version of OS2LDR generates a wealth of data on the debug terminal, and by analyzing this output along with code disassemblies, one can get a good idea of this module's activities.
We start with chip tests and basic system initialization-- querying
available memory and installed drives, testing the CPU clock speed, and
storing available video modes. Along the way, we get some cryptic
progress indications on the debug terminal:
The loader routine first gives us an inventory of the OS2KRNL segments:
As an aside, the term "object" here means the equivalent in a 32-bit executable module of a "segment" in a 16-bit module. An object is more complex than a segment, since it consists of pages which can be read in and swapped to disk independently of one another, and this may be why a new term was felt to be necessary. But for most purposes, an object is simply a segment which is not limited to 64k in size. I will use the two terms more or less interchangeably.
At any rate, the table above gives attributes, physical and linear addresses and selectors, and sizes for each segment in OS2KRNL. The actual reading in of the file proceeds by first loading the "high" segments into low memory, and then moving them to their proper places using the BIOS Int 15h/87h "Move extended memory block" function. (Remember, we're still running in real mode here!) On the second pass, the low memory segments are loaded.
Wrapping up, OS2LDR gives us a map of physical memory:
We set the 8259 PIC chips so that IRQs 0 through 7 map to interrupts
50h-57h, and IRQs 8 through 0fh map to interrupts 70h-77h:
The bulk of the system init part of DOSCODE consists of logic to parse CONFIG.SYS. There is also a component called the "system init file system" (abbreviated sifs), used to read in CONFIG.SYS as well as various BASEDEVs and other files needed before a full-fledged file system can be set up. Each major piece of the kernel (the loader, pager, scheduler, and so on) has an initialization routine which is called at this point, and we "return" to the syiProcess routine in the DOSINITR3CODE segment.
syiProcess then loads and initializes the regular (non-base) installable device drivers, loads the system DLLs, and starts the shell. This is the first place where we have available the ADD drivers used by the fully initialized system to access the hard disk. Since one of the first routines called by syiProcess is in the inicp.asm module (initialize codepage), it is here that we can get the infamous "Cannot find COUNTRY.SYS" error message, even when there is no problem with the COUNTRY.SYS file, if the base device drivers have not installed properly.
IV. The file access code
I want to take a brief look at the sequence of calls connecting a Dos* file I/O call with the hardware access code. For more details, please consult the "Storage Device Driver Reference" found in the DDK and the IFS.INF file on Hobbes.
In the first scenario, we call DOS32READ with the handle of a file on a FAT partition of a SCSI hard drive. This is an entry point in DOSCALL1 which soon passes through a call gate to the 32-bit kernel and invokes FS32IREAD. For FAT access, FS32IREAD then calls the 16-bit routine h_DOS_Read in DOSHIGH4CODE, which, after ascertaining that the requested data is not in a previously read buffer, formulates a "request list" and sends it to the OS2DASD.DMD device driver. The request list is contructed in the _BufReadExecute and _ReqListExecute routines of DOSHIGH32CODE, and consists of a single request with the extended command code 1Eh for Read.
The request specifies the start block, number of blocks to read, and addresses of buffers to hold the data. OS2DASD.DMD then calls the appropriate *.ADD device driver-- for the example system, AHA152X.ADD-- to access the physical hardware.
The second scenario is the same except that the file we are reading resides on an HPFS partition. In this case, FS32IREAD bypasses the legacy 16-bit FAT routines in DOSHIGH4CODE, and calls instead the FS_READ entry point of HPFS.IFS. The HPFS file system then takes care of buffering data and interfacing to the OS2DASD.DMD module.
The third scenario involves reading a file on a floppy disk. This is the same as the first scenario as far as the kernel is concerned; however, the OS2DASD.DMD code will pass the request down to IBM1FLPY.ADD (or IBM2FLPY.ADD for a Micro Channel machine), rather than to AHA152X.ADD.
Finally, to read data from a SCSI CD-Rom, FS32IREAD calls FS_READ on the CD-Rom file system driver CDFS.IFS. CDFS will again perform buffering services, then send a request list to OS2CDROM.DMD, which will invoke the appropriate BASEDEV-- in the example system, LMS206.ADD for a Philips CD-Rom drive.
V. The context-switching mechanism
A context switch generally takes place at the trailing edge of a kernel API call. Before returning from kernel mode to user mode, the system will call a special routine called KMExitKmodeEvents. This routine examines the global variable Resched, which indicates whether other threads of sufficient priority are ready to run. If Resched is non-zero, the next stop is the _tkSchedNext routine.
_tkSchedNext invokes the scheduler apparatus (_SCHGetNextRunner) to decide which of the ready threads will be the next to receive a timeslice. Some aspects of the scheduler, with its thicket of states and transitions, priority queues, sleep queues, and so forth, are documented in the Debugging Handbook. For now we simply note that _SCHGetNextRunner returns, in the EAX register, a pointer to the TCB of the new, or incoming thread. This pointer then becomes the single argument to the _PGSwitchContext routine.
The _PGSwitchContext code occupies 559 bytes in DOSHIGH32CODE, and it is worth close study. We cannot step through this code in the Kernel Debugger, since the page tables and system structures are in a transitional state which the debugger cannot make sense of. But by examining the disassembly we can understand its operation and gain significant insight into the OS/2 process architecture.
The path we take depends to some extent on whether we are switching to a different process or merely to a different thread within the same process. If it is a process switch, we must rewrite the portion of the page tables corresponding to user memory (typically up to about 256 Mb) to show the new physical addresses. A process switch also requires pointing the LDTR to a new value, since the LDT tiling can be different for different processes.
For either a process or thread switch, we must remap the TASKAREA segment (selector 30), since this selector addresses the current TCB and TSD as well as the PTDA. We must also update various system global variables: _pPTDACur, _TKSSBase, _TKTCBBias, _pTCBCur, _pTSDCur, and the ring 0 and ring 2 stack pointers.
For some more details on the context switching mechanism, see page 339 of the Debugging Handbook, Volume I.
Of course, it is possible for an application not to make any kernel calls for a long period of time. Perhaps the program is solving a differential equation, doing a complex string search, or otherwise minding its own business without needing to do any I/O or use any kernel services. Will the KMExitKmodeEvents routine then be bypassed, and must all other threads bide their time while waiting for such a program to finish?
The answer, as we would expect, is no, thanks to the 8254 PIT, or Programmable Interval Timer, chip on the PC motherboard. At boot-up, counter 0 of the 8254 chip is set to operate in mode 2 (rate generator mode) to cause an interrupt on IRQ0 approximately 18.2 times per second. Like other IRQs, this one is handled by intIRQRouter in DOSHIGH32CODE, and upon receiving IRQ0, intIRQRouter calls KMExitKmodeEvents, as above. This forces the application to undergo the same scheduling scrutiny that it would if it made a kernel call directly.
VI. Memory management
When an application calls DosAllocMem, the system creates a "memory object" by reserving a contiguous segment of the process's private virtual address space. The system allocates page table entries for the object, and since each page table entry controls 4Kb of memory, the object will actually have a size equal to the requested size rounded up to the nearest 4Kb. The object will begin on a 64Kb boundary to allow it to be addressed by an LDT selector, so each call to DosAllocMem consumes at least 64Kb of virtual address space.
However, no physical memory or disk swap space is allocated by the DosAllocMem code. The mechanism used is called "lazy commit": when an attempt is made to read or write to the area of virtual memory in question, a page fault will be generated (trap 0eh), and the handling routines in DOSHIGH32CODE will then allocate physical memory and set the "present" bit in the corresponding page table entry.
A simple experiment shows the "before" and "after" state of the page
table resulting from a call to DosAllocMem. Here is a program about to
make the call with a request to allocate 00020000h, or 128 Kb:
The main worker routine in the kernel which does this is _VMAllocMem, which calls the routines _VMReserve, _PGAlloc, and _SELAlloc.
We may also want to see what happens when the program actually tries to access the memory. The KDB command "vsp e" will intercept page faults before they are processed, and this can be used in conjunction with the "zs" (change default command) facility to collect statistics on the page fault mechanism and its effect on system performance. For tracing purposes, it is easier just to put a breakpoint at the start of the lengthy _PGPageFault routine which handles this exception.
VII. The DOS emulation kernel
The DOS emulation component of the system is not mentioned at all in the Debugging Handbook and tends to be ignored by developers because it exists only for compatibility with older programs. However, it occupies over 25% of the code in OS2KRNL and is worth examining if only as an illustration of the versatility of the x86 architecture.
There are essentially three parts to DOS emulation in OS/2: the MVDM manager, the DOS emulator proper and the x86 emulator. A fourth part, the virtual device drivers necessary to run many DOS programs, exists outside the kernel but makes use of the Virtual DevHelp API calls implemented in the MVDM manager.
To get an idea of the issues involved in tracing a DOS application in
the Kernel Debugger, let's look at a simple "Hello, world" program written
in assembly. We open a DOS Window, whereupon the kernel gives us a VDM
with a copy of the "stub virtual DOS kernel"-- the file
C:\OS2\MDOS\DOSKRNL-- loaded into low memory to provide int 21h services.
We then start the program HELLO.EXE. Here is the complete disassembly:
After typing "t" a few times, we arrive at a DOS system call:
We will leave for another day the rest of the saga, for DOSKRNL must still make BIOS calls, which will again cause GP exceptions and be routed to VDDs such as VBIOS.SYS and VVGA.SYS. These will cooperate with SESMGR and the PDDs to finally display the greeting on the screen.
Some additional clues about the workings of OS/2 DOS emulation can be found in The Design of OS/2, ¸ 1992 Addison-Wesley, by H. M. Deitel and M. S. Kogan, pp. 290-300. These are only clues, however, as the correlation is not exact between the text descriptions offered there and what is observable with KDB.
VIII. The shut-down routines
Since all good things must come to an end, the Control Program API includes the DosShutdown routine. The worker code is at the symbol w_Shutdown in DOSHIGH4CODE.
This function disables the installed file system drivers by overwriting all their entry points with the address of the ShutdownBlock routine in DOSHIGH2CODE. Any thread thereafter attempting to call an FSD will be blocked. A few routines, however, remain intact for use by the shutdown code: FS_COMMIT, FS_DOPAGEIO, FS_FSCTL, FS_FLUSHBUF, and FS_SHUTDOWN. Also, for file system drivers used by the swapper, certain key entry points are first preserved at the locations FS_SDCHGFILEPTR, FS_SDFSINFO, FS_SDREAD, and FS_SDWRITE. This enables the paging routines to continue to perform shutdown chores, while all other threads are locked out.
We then iterate through all installed device drivers, sending "shutdown" (command code 1Ch) request packets to each. Each driver is called twice, with parameters of 0 and 1 for begin shutdown and end shutdown, respectively. Likewise, for each installed file system driver, the function FS_SHUTDOWN is called twice with start and end shutdown flags. In between these calls, the routines shutdown$FlushAllSFTs and h_FSD_FlushBuf stabilize RAM-cached portions of the file systems.
At a time when IBM's support for OS/2 seems to grow less enthusiastic
every day, it becomes increasingly important for users and developers to
understand the internals of the system on their own. This knowledge can
help in developing drivers and applications, building independent help
desks, and even in coding patches to the system if necessary. With active
support efforts from the outside community, OS/2 can and will continue to
thrive. I hope that the present article has contributed in some measure
to understanding the foundation of this imposing edifice.