SMP - Symmetrical Multiprocessing

From EDM2
Jump to: navigation, search

Written by Ivan Skytte Jørgensen

Intended audience: programmers familiar with threads.

What is SMP - and why?

Basically SMP is just putting more than one CPU into a computer system. An SMP system is just like every other system, except that it has more than one CPU. It still uses the same kind of RAM, disk, keyboard and so on as single-processor systems. It obviously has a different motherboard, but that's it.

The reasons for using SMP are also straightforward: speed and costs. Suppose that you have to put together a system that offers speed equivalent to a 200MHz Pentium II. You can get that speed by either using a 200Mhz Pentium II or by using an SMP motherboard supporting 4 CPUs and putting 4 100Mhz Pentium onto it.

Not only is the SMP solution a bit faster than the single-processor solution but it is also cheaper. SMP makes sence. It is also possible to obtain a higher performance on an SMP system than what is possible with a single-processor system. At any time if a single-processor system can deliver X mips, an SMP system can deliver N*X mips.

It is just like having to move 20 people. You can use a turbo-charged Ferrari or 4 VW beetles. Not only will the 4 VW beetles be cheaper they will also get the job done faster.

Although just like a Ferrari is more fun than a beetle, a 266MHz Pentium II is more fun than 486 :-)

OS/2 and SMP

OS/2 has supported SMP for quite a while in special editions of OS/2. The first version was "OS/2 2.11 for Symmetrical Multiprocessing". A Warp 3 SMP was being developed (not announced) but dropped. Then last year "OS/2 Warp Server Version 4 Advanced, Symmetrical Multiprocessing" came out. There are rumours that the next version of the off-the-shelf OS/2 will contain support for SMP.

The diffences between the single-processor OS/2 and the SMP version are very small: 2 APIs for controlling the CPUs, an extra index for DosQuerySysInfo() and 4 APIs for dealing with spinlocks. I know that the "SMP addendum" mentions a lot of other things, but they are server things, not SMP things.

"OS/2 2.11 SMP" supports up to 16 processors, "OS/2 Warp server, advanced, SMP" supports up to 64 processors.

Detecting SMP

Detection of SMP is quite simple. OS/2 SMP supports QSV_NUMPROCESSORS index in DosQuerySysInfo().

#define INCL_DOSMISC
#include <os2.h>

#ifndef QSV_NUMPROCESSORS
/*The QSV_NUMPROCESSORS is only defined in the SMP toolkit,
 *and since the SMP toolkit is always at least 6 months behind
 *the normal toolkit you will probably not want use it.
 */
#define QSV_NUMPROCESSORS 26
#endif

int GetNumberOfCPUs() {
  ULONG CPUs;
  APIRET rc;
  CPUs = 0;
  rc = DosQuerySysInfo(QSV_NUMPROCESSORS,
                       QSV_NUMPROCESSORS,
                       &CPUs,
                       sizeof(CPUs));
  /*We have to guard against running under a non-SMP OS/2
   *that does not support index 26, and that index 26 (as
   *far as I remember) has previously been used for
   *something else
   */

  if(rc!=0 || CPUs<1 || CPUs>64)
    CPUs = 1;
  return CPUs; 
}

Examples

In order to not only support (= not crash under) SMP, but _use_ SMP you have to use threads. SMP works best when the threads do not have to communicate a lot.

All the examples has been compiled with Watcom C/C++ 11.0 and tested under OS/2 2.11 SMP, Warp 3, Warp 4 and Warp 4 advanced server SMP. They have been tested on a motherboard with 2 100MHz Pentiums.

A non-real-life example to prove a point

Let's create a small program that uses SMP to get a job done faster:

#define INCL_DOSPROCESS
#include <os2.h>
#include <stdio.h>
#include <time.h>
#include "getcpus.h"
void SpendTime(int howmuch) {
  //Use some CPU time
  // - and beg that your compiler does not optimize this aways
  int x=0;
  for(int i=0; i<howmuch; i++) {
    x += 2;
  }
}

void APIENTRY MyThread(ULONG howmuch) {
  SpendTime((int)howmuch);
  DosExit(EXIT_THREAD,0);
}

int main(void) {
  TID tid[64]; //array to hold the TIDs of the threads
  int threads=GetNumberOfCPUs(); //how many threads to create
  int howmuch=1000000000;
  int t;

  clock_t starttime=clock();

  //create threads
  for(t=0; t<threads; t++) {
    DosCreateThread(&tid[t],
                    MyThread,
                    (ULONG)(howmuch/threads),
                    CREATE_READY|STACK_COMMITTED,
                    8192
                   );
  }

  //wait for the threads to finish
  for(t=0; t<threads; t++) {
    DosWaitThread(&tid[t],DCWW_WAIT);
  }

  clock_t endtime=clock();

  printf("Running time: %f seconds\n", ((double)(endtime-starttime))/CLK_TCK);

  return 0;
}

The program has some work to do (a billion things to be exact). First the number of processors is detected. Then the main thread creates an equal number of threads and assigns them a piece of the work. Then the main thread waits for the threads to finish.

Running time:

OS/2 2.11 SMP (both CPUs online)     : 10.19
OS/2 2.11 SMP with only 1 CPU online : 21.03

As you can see, SMP does work.

A real-life example

Some time ago I made a program that generated a large true-colour bitmap (1280x1024) which had to be "softened" so the edges in the bitmap would not be so hard. The softening was time-consuming so I changed the program a bit and gained almost a 100% performance increase. The original algoritm was something like this (Don't laugh, I am not an expert on graphics):

 for(int x0; x<w; x++) {
   for(int y=0; y<h; y++) {
     unsigned colorsum=0;
     unsigned colors=0;
     if(x>0   && y>0  ) { colorsum+=src.queryPixel(x-1,y-1); colors++; }
     if(         y>0  ) { colorsum+=src.queryPixel(x,y-1);   colors++; }
     if(x<w-1 && y>0  ) { colorsum+=src.queryPixel(x+1,y-1); colors++; }
     if(x>0           ) { colorsum+=src.queryPixel(x-1,y);   colors++; }
                        { colorsum+=src.queryPixel(x,y);     colors++; }
     if(x<w-1         ) { colorsum+=src.queryPixel(x+1,y);   colors++; }
     if(x>0   && y<h-1) { colorsum+=src.queryPixel(x-1,y+1); colors++; }
     if(         y<h-1) { colorsum+=src.queryPixel(x,y+1);   colors++; }
     if(x<w-1 && y<h-1) { colorsum+=src.queryPixel(x+1,y+1); colors++; }
 
     unsigned averagecolor=colorsum/colors;
     int colordifference = src.queryPixel(x,y) - averagecolor;
     dst.setPixel(x,y, src.queryPixel(x,y) - colordifference);
   }
 }

As you can see the inner-most loop contains an awful lot of code that has to be executed for _every_ pixel in the bitmap. The program took 16 seconds to run.

The softening could be done in parallel and since I had OS/2 SMP, I changed the algorithm into something like this:

 void softenrect(const PixelMap &src,
                 PixelMap &dst,
                 int lx, int rx, int ty, int by)
 {
   for(int x=lx; x<rx; x++) {
     for(int y=ty; y<by; y++) {
       <old code goes here>
     }
   }
 }
 
<thread startup and other stuff>
 
 void soften(const PixelMap &src, PixelMap &dst) {
   int threads = GetNumberOfCPUs();
   if(threads>16) threads=16;
   Thread t[16];
   int h=src.queryHeight();
   int y=0;
   int chunk=h/threads;
   for(int i=0; i<threads; i++) {
     //start each thread giving it a piece of the bitmap to work on
     if(i<threads-1)
       t[i].Start(src,dst,0,src.queryWidth(), y, y+chunk);
     else //due to round-off errors the last thread is special
       t[i].Start(src,dst,0,src.queryWidth(), y, h);
   }
   for(i=0; i<threads; i++) {
     t[i].wait();
   }
 }

On my system with 2 CPUs the modified algorithm would start two threads to get the job done. The running time of the modofied program was 9 seconds. So the running time was almost halved.

Yet another non-real-life example.

Any task that can be divided into subtasks that can be run in parallel can benefit from SMP. Another example is quicksort. The original QuickSort looks something like this:

 void quicksort(int e[], int N) {
   if(N<2) {
   else if(N==2) {
     if(e[0]>e[1]) Swap(e[0],e[1]);
   }
   int s = Partition(e,N);
   quicksort(e,s-1);
   quicksort(e+s+1,N-s-1);
 }

Notice the last two steps, sorting one half and then the other. That could be done in parallel. Doing this on N processors would make the sorting almost N times faster.

Performance of SMP-unaware programs

Until now I have used my own programs as examples and shown that performance can be proportional to the number of CPUs. What about programs that know nothing of SMP? If the program uses multiple threads that can work in parallel it will run faster. If it is single-threaded it will run at the same speed as usual.

But, since I am not satisfied with that, let's modify them a bit!

A single-threaded program will not run faster on an SMP-system. But two single-threaded programs will.

[ If you don't like big ugly hacks you may skip this section:-) ]

I am using Watcom C/C++ and its make, wmake. Wmake basically reads the makefile and then one after one calls the compiler (wcc386/wpp386). Each compilation can run in parallel with other compilations. AFAIK GNU make can do that.

SMP-orgmake.gif

I created alias-programs for wmake, wcc386 and wpp386. The smp-wmake acted as named-pipe server, spawning the real compilers. smp-wcc386 and smp-wpp386 just told smp-wmake to (asynchronously) compile and exited with a return code of 0 (success) to wmake. smp-wlink connected to smp-wmake and waited for all the compilations to finish (linking unfinished .OBJs is not a good idea) and then started the original wlink.

SMP-smpmake.gif

Then I tried rebuiling a large project and measured the time.

Warp 3, original wmake    : 103 seconds
Warp 3, smp-aware wmake   :  97 seconds
2.11 SMP, original wmake  : 104 seconds
2.11 SMP, smp-aware wmake :  70 seconds

With one CPU the performance gain is small (6 seconds). The gain is there because while one compiler is busy doing I/O (reading headerfiles, writing .OBJ, making page faults) the CPU is free for other compilations.

Under 2.11 SMP time can be saved. However, the rebuild is not twice as fast. The utilization of the extra CPU is: (104-70)/(104/2) = 65%. The reason for the extra CPU not being used 100% is that wmake and the linker are single-threaded, and OS/2's cache, memory manager etc. are probably also single-threaded.

Ooops - gotcha!

A few programs cannot run using multiple CPUs. That is why OS/2 SMP includes a special version of the MARKEXE utility that can mark executables as "single-CPU only".

There are 3 reasons for those programs crashing under OS/2 SMP:

  1. The programs have not protected their variables using mutex semaphores. They would also eventually crash on non-SMP systems. It is just more likely that the race-conditions show up on an SMP-system. Blame the programmer.
  2. The programs assume that while a high-priority thread is running a lower-priority thread cannot run. This is true on single-processors systems due to OS/2's priority-first/round-robin scheduling. This is not true on SMP systems where two threads of different priority can run at the same time on different CPUs. Blame the programmer.
  3. The programs assume that certain assembler-instructions are thread-safe. They are in single-processors systems but not in SMP-systems. Blame the programmer.

(I think that you now can guess what I think of programs that crash under SMP:-)

Thread-unsafe is SMP-unsafe

If a program is not even thread-safe, it will not be SMP-safe.

High priority does not mean mutual exclusion

On single-processor systems giving a thread higher priority means that lower-priority threads will not be given timeslices as long as the high-priority thread is not blocked (on a file/semaphore/whatever). Programs that assume this will fail on SMP-systems.

Read-modify-write is bad

Some programs and libraries assume that certain assembler instructions are atomic. Some of them are not on an SMP-system. The problematic instructions are all read-modify-write instructions, such as INC, DEC, ADD, SUB, XCHG, ... Fortunately, all of these instructions can have a LOCK prefix. Here is an example:

 int x=0;
 thread1() {
   for(int i=0; i<5000; i++)
     x++;
 }
 thread2() {
   for(int j=0; j<5000; j++)
     x++;
 }

After the two threads has run, what is the value of x? On single-processor systems x will be 10000. On SMP systems some of the increments have been lost, so x will be, for instance, 6483.

Spinlocks

huh?

An SMP-aware program can easily find itself spending most of the time in the kernel due to synchronization. Often it is possible to restructure the program so it will not need as much synchronization - this will also increase performance on single-processor systems. Sometimes this is not possible.

Issuing a DosRequestMutex() can involve executing 5000 instruction in the kernel even when the mutex semaphore is available. There is obviously room for improvement here.

Spinlocks implement mutual exclusion by polling a variable to change its value. If the spinlock is available most of the time, requesting and releasing a spinlock usually only involves executing 50 instructions.

OS/2 SMP spinlocks

OS/2 SMP exports APIs for creating, acquiring, releasing and freeing spinlocks.

Note: These APIs are only exported in the SMP versions of OS/2, so using them will prevent your program from runing on non-SMP version of OS/2 (undefined dynalink).

The APIs are reasonably straight-forward:

 APIRET DosCreateSpinLock(PHSPINLOCK pHandle);
 APIRET DosAcquireSpinLock(HSPINLOCK Handle);
 APIRET DosReleaseSpinLock(HSPINLOCK Handle);
 APIRET DosFreeSpinLock(HSPINLOCK Handle);

Using them is also straight-forward:

 #define INCL_DOSSPINLOCK
 #include <os2.h>
 HSPINLOCK hspin;
 
 thread1() {
   //acquire spinlock
   DosAcquireSpinLock(hspin);
   //do something that only takes a few microseconds
   ...
   //release spinlock
   DosReleaseSpinLock(hspin);
 }
 
 void main() {
   //create spinlock
   DosCreateSpinLock(&hspin);
   //do something
   ...
   //destroy spinlock
   DosFreeSpinLock(hspin);
 }

The OS/2-supplied spinlocks have some disadvantages:

  • They can only be owned one at a time.
  • DosAqcuireSpinLock() disables interrupts and DosReleaseSpinLock() enables them again.
  • they cannot have nested ownership.

So OS/2 spinlocks are not that interesting except in very special applications.

Homebrewed spinlocks

We can do better than the OS/2-supplied spinlocks. In the following sections I will show you how.

Before I begin

For the following examples I need access to a few low-level instructions: interlocked increment, interlocked decrement and interlocked exchange. Since I use Watcom C/C++, I will use one of its features: "#pragma aux" to implement them:

 extern "C" void LockedIncrement(unsigned *p);
 #pragma aux LockedIncrement = \
 ".386p                      " \
 "lock inc dword ptr [eax]   " \
 parm [eax] modify [];
 
 extern "C" void LockedDecrement(unsigned *p);
 #pragma aux LockedDecrement = \
 ".386p                       "\
 "lock dec dword ptr [eax]    "\
 parm [eax] modify [];
 
 extern "C" unsigned LockedExchange(unsigned *p, unsigned v);
 #pragma aux LockedExchange = \
 ".386p                      "\
 "lock xchg ecx,[eax]        "\
 parm [eax] [ecx] modify [] value [ecx];

Those of you who do not use Watcom C/C++ can use this assembler file:

 .386p
 fm_TEXT segment para public use32 'CODE'
   assume cs:_TEXT
   assume ds:nothing
 LockedIncrement proc    near
   mov     eax,[esp+4]
   lock inc dword ptr [eax]
   ret
 LockedIncrement endp
 
 LockedDecrement proc    near
   mov eax,[esp+4]
   lock dec dword ptr [eax]
   ret
 LockedDecrement endp
 
 LockedExchange  proc    near
   mov ecx,[esp+4]
   mov eax,[esp+8]
   lock xchg eax,[ecx]
   ret
 LockedExchange  endp
   public LockedIncrement
   public LockedDecrement
   public LockedExchange
 
 fm_TEXT    ends
 
            end

and use this header file:

 void APIENTRY LockedIncrement(unsigned *p);
 void APIENTRY LockedDecrement(unsigned *p);
 unsigned APIENTRY LockedExchange(unsigned *p, unsigned v);

I also need access to the native event semaphores and I think typing those terribly long OS/2 API names is tedious, so assume:

 class EventSemaphore {
 public:
   EventSemaphore();
   ~EventSemaphore();
   int Post();
   int Wait(long timeout=-1);
   int Reset();
 };

Implementation 1: A true spinlock

This implementation is a true spinlock, ie. it spins until the variable clears.

 unsigned owned=0;
 
 void request() {
   while(LockedExchanged(&owned,1)!=0)
     ;
 }
 
 void release() {
   owned = 0;
 }

Obtaining the spinlock involves repeatedly exchanging the contents of "owned" with 1 and looking at the old value. If the old value was 0 then the spinlock was not owned and since we have put 1 in it we own the spinlock. If the old value was not 0 we just try again.

Releasing the spinlock is simple: Just set the value to 0 (not owned) This implementation has the following positive properties:

  • When the spinlock is requested and it is not already owned by a thread, very few instructions are executed, say 5-10 instructions.
  • When a spinlock owned by a thread is released one of the waiting threads obtains the spinlock in a matter of nanoseconds.

Unfortunately, this implementation has the following negative property:

  • As long as the spinlock is owned other waiting threads are spinning and using valuable CPU time doing nothing useful.

Back to the drawing board.

Implementation 2: a friendly spinlock

If the spinlock is already owned, it makes sense to give up the CPU.

 unsigned owned=0;
 
 void request() {
   while(LockedExchanged(&owned,1)!=0)
     DosSleep(1);
     ;
 }
 
 void release() {
   owned = 0;
 }

This implementation is friendlier than implementation 1, but is still uses valuable CPU time spinning, even if the spinlock is owned for the next 2 years.

Back to the drawing board.

Implementation 3: An even friendlier spinlock

Code first, document later :-)

 unsigned owned=0;
 unsigned waiters=0;
 EventSemaphore try_again;
 
 void request() {
   LockedIncrement(&waiters);
   for(;;) {
     if(LockedExchange(&toggle,1)==0) {
       LockedDecrement(&waiters);
       return;
     }
     try_again.Wait();
     try_again.Reset();
   }
 }
 
 void release() {
   owned = 0;
   if(waiters!=0)
     try_again.Post();
 }
Requesting the spinlock.
First the waiters variable is incremented (must be done with LockedIncrement() since other threads may be modifying it at the same time). Then it tries to obtain the spinlock with a LockedExchange(). If that succeeds waiters is decremented and the function returns. If it does not succeed it blocks on the event semaphore, then resets the event semaphore and goes back to trying to obtain the spinlock.
Releasing the spinlock.
First ownership is released by setting owned to 0. Then if there are any waiters (they may be blocked on the event semaphore) the event semaphore is posted to wake them up.

This implementation has the following properties:

  • Requesting a unowned spinlock is very fast (10-20 instructions)
  • Waiting threads do not use valuable CPU time

Unfortunately, if the spinlock is going to be released within a few microseconds, waiting threads take the penalty of an OS/2 API call. It would be better if they just held on to the CPU for those few microseconds doing nothing.

Back to the drawing board.

Implementation 4: a fast and friendly spinlock

This final implementation is a slight modification af implementation 3. The request() function starts as implementation 1 and then reverts to implementation 3.

 #define spin_retries 100
 unsigned owned=0;
 unsigned waiters=0;
 EventSemaphore try_again;
 
 void request() {
   for(unsigned spin=0; spin<spin_retries; spin++) {
     if(LockedExchange(&toggle,1)==0)
       return;
   }
   LockedIncrement(&waiters);
   for(;;) {
     if(LockedExchange(&toggle,1)==0) {
       LockedDecrement(&waiters);
       return;
     }
     try_again.Wait();
     try_again.Reset();
   }
 }
 
 void release() {
   owned = 0;
   if(waiters!=0)
     try_again.Post();
 }

The only thing that has changed from implementation 3 is the request function. The function starts spinning trying to obtain the spinlock. It does this a fixed number of times (spin_retries). spin_retries has to be selected carefully. The time spent spinning should be less than it takes to do a full thread/process/context-switch and is should be long enough to prevent reverting to implementation 3 if the spinlock is going to be released within a few microseconds.

This implementation has the following positive properties:

  • requesting an unowned spinlock takes only a few instructions (5-10 instructions)
  • requesting an owned spinlock may not need an expensive API call
  • releasing a spinlock that no one is waiting for is fast (5-10 instructions)
  • valuable CPU time is not spent spinning forever

It has only one drawback: nested ownership is not possible. Requesting a spinlock that you already own causes a deadlock. This is left as an exercise for the reader.

The road to SMP power

The road to SMP power consists of three steps:

  1. Multithreading. Single-threaded programs do not gain from SMP. (Unless they are communicating with another process)
  2. SMP-safe. Make sure that you do not rely on SMP-unsafe instructions or how the scheduler deals with thread priorities.
  3. SMP-utilization. CPU-bound tasks that take more than 30 milliseconds and can be divided into sub-tasks should be divided.

That is all.