OS/2's Symmetrical Multiprocessing Demystified

Who needs SMP?

Any OS/2™ software development shop can benefit by using SMP (Symmetric Multi-Processing/Processor). It is an excellent testing platform for OS/2 applications that do not make use of any custom device drivers or kernel level code sequences.

Many OS/2 applications are multithreaded and make use of semaphores. This means that the potential for timing related problems are roughly proportional to the number of threads and semaphores used. In any serious program, dozens of threads and semaphores are employed to coordinate data manipulation and direct the processing flow in an asynchronous manner. The sad fact is that it is easy to improperly use the various flavours of process and thread management and have an application that appears to "work like a charm". The reality is more like a game of chess. The rules are simple enough; however, this game can real complicated real fast. It is a good bet that not every semaphore call checks error returns and takes the appropriate action or corrects the problem. Need some examples? How about ERROR_INTERRUPT, ERROR_TIMEOUT, ERROR_TOO_MANY_SEM_REQUESTS, ERROR_SEM_OWNER_DIED - just to name a few. Need more? How about this sequence:

DosEnterCritSec
 (any API call)
DosExitCritSec

You can not make any Application Programming Interface (API) calls while inside a critical section and be certain that a deadlock will not occur. The deadlock occurs in a multithreaded application because another thread was inside the kernel and owned the gates to DOSCALL1, and while executing, was pre-empted to run the code with the DosEnterCritSec. Any API call inside the critical section requests entry into DOSCALL1 and can not obtain the mutex semaphore. The thread owning the DOSCALL1 can not run due to the exclusion required by a critical section. The result is a frozen application. This can be difficult to detect in a uniprocessor (UP) system. It would be prudent to scan your code to insure that such code sequences are not present.

There are many more such scenarios in which code appears to work on a uniprocessor system and fails on a multiprocessor system. An SMP is your ultimate test and should be used in combination with many other programs in testing the application prior to release. If you test with an SMP you will get the added benefit of being SMP compliant. Additionally, you can use the SMP to do your builds. Simply split the makefile into 4 or more makefiles so that at least 4 OBJs are building simultaneously (for a two processor system). In general, use double the number of OBJs to build per processor. Insure that the link is only done after all of the OBJs are built.

You do not need to get a fully loaded or very expensive SMP. Any bargain basement dual processor 486 compatible or low end Pentium or clone will do. Even consider renting one. The fact is, if just one of your big customers has a problem with your application, it could cost you more to duplicate, find and fix the problem than the cost of a low-end SMP. Also, it is much easier to fix a problem in your own shop before the release than it is when a key customer is breathing down your neck requesting status updates.

Finally, one of the most common complaints about OS/2's SMP is that many programs do not work. If you fix your old programs to work on SMP and insure that new ones being developed will work, you would help exploit OS/2's biggest advantage over Window's NT™ - namely SMP.

SMP - A Historical perspective

The first SMP version of OS/2 constructed was 2.11. This contained many advanced design concepts. For instance, the SMP is in reality a hybrid between symmetric, distributed, and asymmetric processing concepts. Taking the best of each produced super-scalar results in early prototypes. Running 5 programs on 4 processors yielded a 404% improvement over a single processor. This was accomplished using one compute-bound program, one database program, one graphically intensive program that visually constructed a building, one multi-media program of a tennis match, and one graphic processor utilization program. On a uniprocessor, a complete build of the OS/2 kernel would take 2 hours and 2 minutes whereas a dual processor of the same CPU speed only needed 46 minutes. That was in 1993. This was largely due to its interrupt routing strategy, kernel semaphore usage strategy and the high performance processor scheduling and load balancing logic. Typically, if SMP yields a 300% improvement for 4 processors, it is considered acceptable scalability.

Does SMP give benefits to all programs?

Yes. Even a single application that is single threaded can get benefits on SMP. The OS/2 scheduler performs "house cleaning" functions during its idle cycles. For instance, the virtual memory management system employs an intermediate memory swapfile concept in which a minimum number of 4K pages are maintained. These are modified as the application executes, based on page access flags. Pages that accessed once and then not accessed for a period of time are removed from the idle list and queued to the swapfile when the pages are not discardable. This is referred to as the ager thread. When the idle list hits a minimum threshold, the ager thread runs at the highest priority in the system (Class = Time Critical, Level = 31). This will pre-empt the running program on a uniprocessor system, and will run on the idle processor in the case of running a single application that is single threaded.

Also, the scheduler and memory management functions work jointly to enhance the "working set" of the running application in memory overload conditions by swapping out the Thread Swappable Data (TSD). This occurs when there are more tasks to run than will fit in available memory. Granted, in the case of a single single threaded program, this is not likely. However, the operating system has a number of TSD's and in very low memory configurations (i.e. 8MB), the swapping logic is not choosy as to which TSD's get thrown out of memory. Any TSD not active for period of time is vulnerable to swapping. The reverse is also true. When the scheduler (SMP or UP) detects idle cycles, it anticipates activity for the TSD's and reads them into memory if there is room for them.

In conclusion, it can be said that a single application that is single threaded will get benefits running on SMP in memory overload conditions. It must also be pointed out that running many applications in memory overload conditions get compounding effects due to the availability of any free processor to run memory maintenance "house cleaning" threads. This results in fewer pre-emptions of the running applications and better overall performance.

What modifications are needed for SMP?

If you have a device driver that performs direct memory writes using Ring 0 or Ring 2 code, then you must check to see if you are on an SMP machine and use the SMP spin locks. Information on how to check for SMP and use the various spin locks is well covered by Ivan Skytte Jørgensen in a previous article.

In early days of SMP, spin-locks were provided as a "quick fix" for programs that worked on a UP and failed on a SMP. However, do not use these for any purpose other than to protect Ring 0 or Ring 2 code that makes direct memory writes. Use the normal semaphores; they are already SMP safe. The critical sections are also SMP safe. The first SMP (Version 2.11) was designed to detect a critical section entry, query the other processors (only active processors in the Processor Control Block (PCB)), and if detected, stop the processors running the other threads in the process, transit the threads to the CRT state chain, reschedule the affected processors, and prevent other threads from the same process from being scheduled until the critical section exited. If you have any of the DosAcquireSpinLock APIs in your SMP code, remove them and replace them with semaphores or critical sections, if not using Ring 0 or Ring 2 code that makes direct memory writes.

"Home brewed" spin locks for the SMP are excellent educational tools. However, they should not be used in production systems. DosAcquireSpinLock marks an entry into the PCB and this is used in scheduling the processors. It protects the process from pre-emption, whereas the home brewed semaphores can not give such protection. Priority pre-emption is maintained across the processors, however, entry into the kernel means that the thread can not be pre-empted until it begins to execute in Ring 3 code. The last thing you want is to be pre-empted while doing direct memory writes from an I/O device driver.

It should go without saying that programs relying on priority will not work on SMP. Adding a mutex semaphore to the threads is a safe bet. This may be overkill and the performance may suffer, so use an event semaphore if more than one thread can be active at one time. This is not rocket science. Simply remove the DosSetPriority calls and have what was the low priority thread wait on the semaphore of what was the high priority thread. However, in a case like the Pulse program, adding a semaphore will not work. So, go over your application's data flow and semaphore usage before you start your modifications.

What is different between UP and SMP?

Obviously, the motherboard has modifications that may allow for additional processors or additional processor boards. Beyond that, there is the Advanced Program Interrupt Controller (APIC) that is used to route the interrupts to the various processors. The UP also has an APIC, except that all interrupts default to the first and only processor. There are a number of proprietary interrupt and processor strategies available. However, SMP only works on the Intel based APIC multiprocessor support design.

OS/2 SMP was developed from the same code base as the UP version. A PCB is dynamically generated at boot time for each configured processor. There are also some new functions in the resident kernel to manage the processors. However, the memory requirements between UP and SMP are negligible. SMP has a performance monitor that shows CPU utilization of the various processors and allows you switch off any or all but the first processor.

How the SMP and UP scheduling works

The SMP scheduler employs the same "house cleaning" functions as the UP scheduler. It also uses the US Patented (#5,386,561) dynamic time-slicing logic. This is formally called: #5,386,561 Method of Integrated System Load Control Through Dynamic Time-Slicing in a Virtual Storage Environment.

Scheduling for tasks of equal priority is based upon the system clock which fires at a rate of every 31.25 milliseconds. Therefore, if 32 such programs are compute bound (e.g. calculate the value of pi to the last digit), each would get one 31.25 millisecond time slice per second. Scheduling for tasks in the case of a pre-emption can begin in as little one clock cycle. When the pre-emption occurs while the currently executing program is executing in the application's code, a timer "theft" occurs, and pre-emption begins in as little 10-25 nanoseconds. Pre-emption can complete in as little as under 25 microseconds on a fast machine under ideal conditions. This makes OS/2's real time capabilities second to none on the desktop environment. However, when the currently executing thread is performing a system call (i.e. API call), pre-emption times can be longer. To guard against very long pre-emption times, OS/2 has a wide range of internal yields that allow it to be pre-empted while inside of a system call. For example, a huge memory allocation requires the virtual memory management system to carve out entries from the 512MB page tables associated with the process. Although the action of marking the 4K linear address pages is non-pre-emptable, bail-out points are strategically located within the logic to allow a higher priority program to take control while storing the intermediate results on the pre-empted program's stack. Therefore, minimum and maximum pre-emption times are more deterministic, which is characteristic of frame based real-time operating systems meeting hard and soft deadlines.

They share an extensive priority boost matrix which is designed to solve the classic problem of satisfying two diametrically opposing concepts, namely, that of achieving maximum throughput and attaining acceptable user response times. OS/2 is able to get near maximum throughput and foster excellent user response time for the foreground application. The priority boosts are applied in a cumulative manner. Once the boosts are applied, an immediate check is made to compare the recently boosted thread to that of the current runner. If it is higher, a pre-emption is immediately begun.

Priority inversions inside the kernel are automatically resolved for mutually exclusive system resources. This prevents a deadlock condition in which a high priority program needs a system resource that is owned by a lower priority thread. On both UP and SMP this done in a highly optimized fashion.

Both employ a rich 14 state chain internal management system (Version 1.3 had only 10) to efficiently transit threads from various blocking states into execution. This decomposes into a run state, a prioritized ready-to-run queue, and 12 blocking states. All except the run state are doubly-linked lists. For instance, if a thread sleeps using DosSleep, it is removed from the run state chain and placed into the blocked state. Upon expiration of its associated timer, an interrupt occurs and the thread is then entered into the ready-to-run state chain in the appropriate priority order. The blocking states are always entered at the top of the state chain, whereas entry into the ready-to-run list is always done so that the highest priority thread in the system is at the top of the ready-to-run list. When a thread becomes ready to run and it is higher in priority than the currently executing thread, a pre-emption occurs. The scheduler removes the thread from the top of the ready-to-run list and loads the thread's stack and environment. The threads execution entry point always targets the system pseudo-action requests first. This could be only one of 32 possible entry points into the program. For instance, if a thread had a signal posted to it while in the blocked state, execution would begin at the first instruction of the signal handler. After the signal handler completes its instruction sequence, control would return to the scheduler and exhaust the remaining system pseudo-action request in a predefined order. Once all have completed, execution would begin at the next instruction in the normal program flow. Both the UP and SMP use this architecture and the only exception is that the SMP has one run state for each configured processor.

OS/2 SMP always schedules the first configured processor last. This is due to the fact that all interrupts are fielded by the first configured processor. It is also why the first configured processor can not be turned offline using the CPU performance monitor. This is referred to as distributed processing.

In maintaining priority pre-emption across processors, the first configured processor runs the lowest priority thread and is last in consideration for pre-emption.

Only one thread may enter the kernel at a time. If a thread on another processor attempts to execute an API, it spins until the original thread exits the kernel to begin executing Ring 3 code. However, any of the configured processors can execute in the kernel. This is generally referred to as asymmetric processing.

Conclusion

The OS/2 SMP architecture combines symmetric, asymmetric and distributed processing concepts. This was used to overcome the SMP's disadvantages of allowing only one thread in the kernel at time (except the semaphores which are multithread and can execute on all processors concurrently). That's analogous to running a race on crutches with a broken leg.

However, the design of OS/2's scheduling policies and deterministic pre-emption logic make OS/2 SMP the most advanced architecture available today and in the foreseeable future.

The outstanding performance of the SMP is due to its unique design and to the entire OS/2 community. Support SMP by making your application SMP safe. Ever onward OS/2!

OS/2 is a trademark of IBM Corporation and Windows NT is a trademark of Microsoft Corporation.