SMP - Symmetrical Multiprocessing

Written by Ivan Skytte Jørgensen

Intended audience: programmers familiar with threads.

What is SMP - and why?
Basically SMP is just putting more than one CPU into a computer system. An SMP system is just like every other system, except that it has more than one CPU. It still uses the same kind of RAM, disk, keyboard and so on as single-processor systems. It obviously has a different motherboard, but that's it.

The reasons for using SMP are also straightforward: speed and costs. Suppose that you have to put together a system that offers speed equivalent to a 200MHz Pentium II. You can get that speed by either using a 200Mhz Pentium II or by using an SMP motherboard supporting 4 CPUs and putting 4 100Mhz Pentium onto it.

Not only is the SMP solution a bit faster than the single-processor solution but it is also cheaper. SMP makes sense. It is also possible to obtain a higher performance on an SMP system than what is possible with a single-processor system. At any time if a single-processor system can deliver X MIPS, an SMP system can deliver N*X MIPS.

It is just like having to move 20 people. You can use a turbo-charged Ferrari or 4 VW beetles. Not only will the 4 VW beetles be cheaper they will also get the job done faster.

Although just like a Ferrari is more fun than a beetle, a 266MHz Pentium II is more fun than 486 :-)

OS/2 and SMP
OS/2 has supported SMP for quite a while in special editions of OS/2. The first version was "OS/2 2.11 for Symmetrical Multiprocessing". A Warp 3 SMP was being developed (not announced) but dropped. Then last year "OS/2 Warp Server Version 4 Advanced, Symmetrical Multiprocessing" came out. There are rumours that the next version of the off-the-shelf OS/2 will contain support for SMP.

The differences between the single-processor OS/2 and the SMP version are very small: 2 APIs for controlling the CPUs, an extra index for DosQuerySysInfo and 4 APIs for dealing with spinlocks. I know that the "SMP addendum" mentions a lot of other things, but they are server things, not SMP things.

"OS/2 2.11 SMP" supports up to 16 processors, "OS/2 Warp Server Advanced, SMP" supports up to 64 processors.

Detecting SMP
Detection of SMP is quite simple. OS/2 SMP supports QSV_NUMPROCESSORS index in DosQuerySysInfo.

Examples
In order to not only support (= not crash under) SMP, but _use_ SMP you have to use threads. SMP works best when the threads do not have to communicate a lot.

All the examples has been compiled with Watcom C/C++ 11.0 and tested under OS/2 2.11 SMP, Warp 3, Warp 4 and Warp Server 4 Advanced SMP. They have been tested on a motherboard with 2 100MHz Pentiums.

A non-real-life example to prove a point
Let's create a small program that uses SMP to get a job done faster: The program has some work to do (a billion things to be exact). First the number of processors is detected. Then the main thread creates an equal number of threads and assigns them a piece of the work. Then the main thread waits for the threads to finish.

Running time: OS/2 2.11 SMP (both CPUs online)    : 10.19 OS/2 2.11 SMP with only 1 CPU online : 21.03 As you can see, SMP does work.

A real-life example
Some time ago I made a program that generated a large true-colour bitmap (1280x1024) which had to be "softened" so the edges in the bitmap would not be so hard. The softening was time-consuming so I changed the program a bit and gained almost a 100% performance increase. The original algorithm was something like this (Don't laugh, I am not an expert on graphics): for(int x0; x0  && y>0  ) { colorsum+=src.queryPixel(x-1,y-1); colors++; } if(        y>0  ) { colorsum+=src.queryPixel(x,y-1);   colors++; } if(x0 ) { colorsum+=src.queryPixel(x+1,y-1); colors++; } if(x>0          ) { colorsum+=src.queryPixel(x-1,y);   colors++; } { colorsum+=src.queryPixel(x,y);    colors++; } if(x0  && y<h-1) { colorsum+=src.queryPixel(x-1,y+1); colors++; } if(        y<h-1) { colorsum+=src.queryPixel(x,y+1);   colors++; } if(x<w-1 && y<h-1) { colorsum+=src.queryPixel(x+1,y+1); colors++; } unsigned averagecolor=colorsum/colors; int colordifference = src.queryPixel(x,y) - averagecolor; dst.setPixel(x,y, src.queryPixel(x,y) - colordifference); } } As you can see the inner-most loop contains an awful lot of code that has to be executed for _every_ pixel in the bitmap. The program took 16 seconds to run.

The softening could be done in parallel and since I had OS/2 SMP, I changed the algorithm into something like this: void softenrect(const PixelMap &src,                PixelMap &dst,                 int lx, int rx, int ty, int by) {  for(int x=lx; x void soften(const PixelMap &src, PixelMap &dst) { int threads = GetNumberOfCPUs; if(threads>16) threads=16; Thread t[16]; int h=src.queryHeight; int y=0; int chunk=h/threads; for(int i=0; i<threads; i++) { //start each thread giving it a piece of the bitmap to work on    if(i<threads-1) t[i].Start(src,dst,0,src.queryWidth, y, y+chunk); else //due to round-off errors the last thread is special t[i].Start(src,dst,0,src.queryWidth, y, h); }  for(i=0; ie[1]) Swap(e[0],e[1]); }  int s = Partition(e,N); quicksort(e,s-1); quicksort(e+s+1,N-s-1); } Notice the last two steps, sorting one half and then the other. That could be done in parallel. Doing this on N processors would make the sorting almost N times faster.

Performance of SMP-unaware programs
Until now I have used my own programs as examples and shown that performance can be proportional to the number of CPUs. What about programs that know nothing of SMP? If the program uses multiple threads that can work in parallel it will run faster. If it is single-threaded it will run at the same speed as usual.

But, since I am not satisfied with that, let's modify them a bit!

A single-threaded program will not run faster on an SMP-system. But two single-threaded programs will.

[ If you don't like big ugly hacks you may skip this section:-) ]

I am using Watcom C/C++ and its make, wmake. Wmake basically reads the makefile and then one after one calls the compiler (wcc386/wpp386). Each compilation can run in parallel with other compilations. AFAIK GNU make can do that.



I created alias-programs for wmake, wcc386 and wpp386. The smp-wmake acted as named-pipe server, spawning the real compilers. smp-wcc386 and smp-wpp386 just told smp-wmake to (asynchronously) compile and exited with a return code of 0 (success) to wmake. smp-wlink connected to smp-wmake and waited for all the compilations to finish (linking unfinished .OBJs is not a good idea) and then started the original wlink.



Then I tried rebuilding a large project and measured the time. Warp 3, original wmake   : 103 seconds Warp 3, smp-aware wmake  :  97 seconds 2.11 SMP, original wmake : 104 seconds 2.11 SMP, smp-aware wmake : 70 seconds With one CPU the performance gain is small (6 seconds). The gain is there because while one compiler is busy doing I/O (reading header files, writing .OBJ, making page faults) the CPU is free for other compilations.

Under 2.11 SMP time can be saved. However, the rebuild is not twice as fast. The utilization of the extra CPU is: (104-70)/(104/2) = 65%. The reason for the extra CPU not being used 100% is that wmake and the linker are single-threaded, and OS/2's cache, memory manager etc. are probably also single-threaded.

Ooops - gotcha!
A few programs cannot run using multiple CPUs. That is why OS/2 SMP includes a special version of the MARKEXE utility that can mark executables as "single-CPU only".

There are 3 reasons for those programs crashing under OS/2 SMP: (I think that you now can guess what I think of programs that crash under SMP:-)
 * 1) The programs have not protected their variables using mutex semaphores. They would also eventually crash on non-SMP systems. It is just more likely that the race-conditions show up on an SMP-system. Blame the programmer.
 * 2) The programs assume that while a high-priority thread is running a lower-priority thread cannot run. This is true on single-processors systems due to OS/2's priority-first/round-robin scheduling. This is not true on SMP systems where two threads of different priority can run at the same time on different CPUs. Blame the programmer.
 * 3) The programs assume that certain assembler-instructions are thread-safe. They are in single-processors systems but not in SMP-systems. Blame the programmer.

Thread-unsafe is SMP-unsafe
If a program is not even thread-safe, it will not be SMP-safe.

High priority does not mean mutual exclusion
On single-processor systems giving a thread higher priority means that lower-priority threads will not be given timeslices as long as the high-priority thread is not blocked (on a file/semaphore/whatever). Programs that assume this will fail on SMP-systems.

Read-modify-write is bad
Some programs and libraries assume that certain assembler instructions are atomic. Some of them are not on an SMP-system. The problematic instructions are all read-modify-write instructions, such as INC, DEC, ADD, SUB, XCHG, ... Fortunately, all of these instructions can have a LOCK prefix. Here is an example: int x=0; thread1 { for(int i=0; i<5000; i++) x++; } thread2 { for(int j=0; j<5000; j++) x++; } After the two threads has run, what is the value of x? On single-processor systems x will be 10000. On SMP systems some of the increments have been lost, so x will be, for instance, 6483.

huh?
An SMP-aware program can easily find itself spending most of the time in the kernel due to synchronization. Often it is possible to restructure the program so it will not need as much synchronization - this will also increase performance on single-processor systems. Sometimes this is not possible.

Issuing a DosRequestMutex can involve executing 5000 instruction in the kernel even when the mutex semaphore is available. There is obviously room for improvement here.

Spinlocks implement mutual exclusion by polling a variable to change its value. If the spinlock is available most of the time, requesting and releasing a spinlock usually only involves executing 50 instructions.

OS/2 SMP spinlocks
OS/2 SMP exports APIs for creating, acquiring, releasing and freeing spinlocks.

Note: These APIs are only exported in the SMP versions of OS/2, so using them will prevent your program from running on non-SMP version of OS/2 (undefined dynalink).

The APIs are reasonably straight-forward: APIRET DosCreateSpinLock(PHSPINLOCK pHandle); APIRET DosAcquireSpinLock(HSPINLOCK Handle); APIRET DosReleaseSpinLock(HSPINLOCK Handle); APIRET DosFreeSpinLock(HSPINLOCK Handle); Using them is also straight-forward: HSPINLOCK hspin; thread1 { //acquire spinlock DosAcquireSpinLock(hspin); //do something that only takes a few microseconds ...  //release spinlock DosReleaseSpinLock(hspin); } void main { //create spinlock DosCreateSpinLock(&hspin); //do something ...  //destroy spinlock DosFreeSpinLock(hspin); } The OS/2-supplied spinlocks have some disadvantages: So OS/2 spinlocks are not that interesting except in very special applications.
 * 1) define INCL_DOSSPINLOCK
 * 2) include 
 * They can only be owned one at a time.
 * DosAqcuireSpinLock disables interrupts and DosReleaseSpinLock enables them again.
 * they cannot have nested ownership.

Homebrewed spinlocks
We can do better than the OS/2-supplied spinlocks. In the following sections I will show you how.

Before I begin
For the following examples I need access to a few low-level instructions: interlocked increment, interlocked decrement and interlocked exchange. Since I use Watcom C/C++, I will use one of its features: "#pragma aux" to implement them: extern "C" void LockedIncrement(unsigned *p); ".386p                     " \ "lock inc dword ptr [eax]  " \ parm [eax] modify []; extern "C" void LockedDecrement(unsigned *p); ".386p                      "\ "lock dec dword ptr [eax]   "\ parm [eax] modify []; extern "C" unsigned LockedExchange(unsigned *p, unsigned v); ".386p                     "\ "lock xchg ecx,[eax]       "\ parm [eax] [ecx] modify [] value [ecx]; Those of you who do not use Watcom C/C++ can use this assembler file: .386p fm_TEXT segment para public use32 'CODE' assume cs:_TEXT assume ds:nothing LockedIncrement proc   near mov    eax,[esp+4] lock inc dword ptr [eax] ret LockedIncrement endp LockedDecrement proc   near mov eax,[esp+4] lock dec dword ptr [eax] ret LockedDecrement endp LockedExchange proc    near mov ecx,[esp+4] mov eax,[esp+8] lock xchg eax,[ecx] ret LockedExchange endp public LockedIncrement public LockedDecrement public LockedExchange fm_TEXT   ends end and use this header file: void APIENTRY LockedIncrement(unsigned *p); void APIENTRY LockedDecrement(unsigned *p); unsigned APIENTRY LockedExchange(unsigned *p, unsigned v); I also need access to the native event semaphores and I think typing those terribly long OS/2 API names is tedious, so assume: class EventSemaphore { public: EventSemaphore; ~EventSemaphore; int Post; int Wait(long timeout=-1); int Reset; };
 * 1) pragma aux LockedIncrement = \
 * 1) pragma aux LockedDecrement = \
 * 1) pragma aux LockedExchange = \

Implementation 1: A true spinlock
This implementation is a true spinlock, ie. it spins until the variable clears. unsigned owned=0; void request { while(LockedExchanged(&owned,1)!=0) ; } void release { owned = 0; } Obtaining the spinlock involves repeatedly exchanging the contents of "owned" with 1 and looking at the old value. If the old value was 0 then the spinlock was not owned and since we have put 1 in it we own the spinlock. If the old value was not 0 we just try again.

Releasing the spinlock is simple: Just set the value to 0 (not owned) This implementation has the following positive properties: Unfortunately, this implementation has the following negative property: Back to the drawing board.
 * When the spinlock is requested and it is not already owned by a thread, very few instructions are executed, say 5-10 instructions.
 * When a spinlock owned by a thread is released one of the waiting threads obtains the spinlock in a matter of nanoseconds.
 * As long as the spinlock is owned other waiting threads are spinning and using valuable CPU time doing nothing useful.

Implementation 2: a friendly spinlock
If the spinlock is already owned, it makes sense to give up the CPU. unsigned owned=0; void request { while(LockedExchanged(&owned,1)!=0) DosSleep(1); ; } void release { owned = 0; } This implementation is friendlier than implementation 1, but is still uses valuable CPU time spinning, even if the spinlock is owned for the next 2 years.

Back to the drawing board.

Implementation 3: An even friendlier spinlock
Code first, document later :-) unsigned owned=0; unsigned waiters=0; EventSemaphore try_again; void request {  LockedIncrement(&waiters);   for {     if(LockedExchange(&toggle,1)==0) {       LockedDecrement(&waiters);       return;     }     try_again.Wait;     try_again.Reset;   } } void release {   owned = 0;   if(waiters!=0)     try_again.Post; }
 * Requesting the spinlock.
 * First the waiters variable is incremented (must be done with LockedIncrement since other threads may be modifying it at the same time). Then it tries to obtain the spinlock with a LockedExchange. If that succeeds waiters is decremented and the function returns. If it does not succeed it blocks on the event semaphore, then resets the event semaphore and goes back to trying to obtain the spinlock.


 * Releasing the spinlock.
 * First ownership is released by setting owned to 0. Then if there are any waiters (they may be blocked on the event semaphore) the event semaphore is posted to wake them up.

This implementation has the following properties: Unfortunately, if the spinlock is going to be released within a few microseconds, waiting threads take the penalty of an OS/2 API call. It would be better if they just held on to the CPU for those few microseconds doing nothing.
 * Requesting a unowned spinlock is very fast (10-20 instructions)
 * Waiting threads do not use valuable CPU time

Back to the drawing board.

Implementation 4: a fast and friendly spinlock
This final implementation is a slight modification af implementation 3. The request function starts as implementation 1 and then reverts to implementation 3. unsigned owned=0; unsigned waiters=0; EventSemaphore try_again; void request { for(unsigned spin=0; spin<spin_retries; spin++) { if(LockedExchange(&toggle,1)==0) return; }  LockedIncrement(&waiters); for { if(LockedExchange(&toggle,1)==0) { LockedDecrement(&waiters); return; }    try_again.Wait; try_again.Reset; } } void release { owned = 0; if(waiters!=0) try_again.Post; } The only thing that has changed from implementation 3 is the request function. The function starts spinning trying to obtain the spinlock. It does this a fixed number of times (spin_retries). spin_retries has to be selected carefully. The time spent spinning should be less than it takes to do a full thread/process/context-switch and is should be long enough to prevent reverting to implementation 3 if the spinlock is going to be released within a few microseconds.
 * 1) define spin_retries 100

This implementation has the following positive properties: It has only one drawback: nested ownership is not possible. Requesting a spinlock that you already own causes a deadlock. This is left as an exercise for the reader.
 * requesting an unowned spinlock takes only a few instructions (5-10 instructions)
 * requesting an owned spinlock may not need an expensive API call
 * releasing a spinlock that no one is waiting for is fast (5-10 instructions)
 * valuable CPU time is not spent spinning forever

The road to SMP power
The road to SMP power consists of three steps: That is all.
 * 1) Multithreading. Single-threaded programs do not gain from SMP. (Unless they are communicating with another process)
 * 2) SMP-safe. Make sure that you do not rely on SMP-unsafe instructions or how the scheduler deals with thread priorities.
 * 3) SMP-utilization. CPU-bound tasks that take more than 30 milliseconds and can be divided into sub-tasks should be divided.