Exception Management - or coping with bugs

From EDM2
Jump to: navigation, search

By Roger Orr

Introduction

One of the hardest parts of writing robust programs is dealing with the unexpected. There is nothing worse than a mission critical program suddenly halting because of a division by zero error in some obscure statistics gathering routine, and leaving your users fuming!

The first line of defence is of course to add parameter checking - ensure numbers are 'realistic' and pointers are 'plausible' before attempting to use them. This will deal with most problems, but unfortunately it is extremely expensive (in programmer time, in testing, and in execution time) to exhaustively check every possible cause of error before it occurs. In addition bugs may well occur even in the checking code itself!

However even if this technique is used there is additional class of unexpected events - the user (or another program) killing the program via Ctrl+C or DosKillProcess. In a multi-programming environment it is nice to have some control over program exit to ensure shared resources are left in a consistent state.

What further protection against these events is possible? One answer is to make use of the OS/2 exception management subsystem. This provides a structured way to add special handling for various possible unexpected errors, such as divide by zero, Ctrl+C or attempting to access an invalid address.

Those familiar with OS/2 version 1 will recall that the first of the above could be handled by code using DosSetVec, the second by making use of DosSetSigHandler - and the third couldn't be handled. [Note: programming DosSetSigHandler was discussed in an earlier article - Pointers issue 12 (Jan/Feb 91)]

So what's new in version 2?

One of the goals of OS/2 version 2 was to provide an environment which was less specifically targeted to the Intel x86 processor. The most obvious part of the base OS/2 which has been affected is the memory model - no more 64Kb segments just 4Gb of linearly addressable memory! However one of the other areas to be reworked in version 2 is the handling of exceptions and signals, to provide a more generalised mechanism.

Under OS/2 version 2 each thread in a process can have one or more procedures registered as "exception handlers", which means that they are called when an exception occurs and can take corrective action.

The procedures are called in turn starting with the most recently registered. Each procedure has three possible actions: it can return 0 to continue the search down the chain; it can return 0xffffffff to resume execution of the program (after having removed the cause of the fault), or it can jump somewhere else - for example by using the 'C' language longjmp procedure. If every handler in the chain returns 0 then OS/2 takes some default action - typically to abort the process.

So-called 'signals' such as Ctrl+C or another program issuing DosKillProcess generate a specific class of exception which is handled by the exception handler chain for thread 1 of the program.

The functions provided by OS/2

Eexception handlers are registered and de-registered by DosSetExceptionHandler and DosUnSetExceptionHandler. These take the address of an exception registration record - containing a reserved word (used to maintain the chain of exception handlers) and the address of the exception handling function.

DosUnwindException can be used to call and remove a number of exception handlers from the chain. This function takes an address of the location where execution will resume once all procedures have unwound - it would not usually be called from a high level language directly, because of the problems of adjusting the programming environment, but via some language supported construct such as the 'C' longjmp function.

Specific exceptions can be generated by DosRaiseException - this includes user-defined exceptions as well as system ones.

DosError with parameter FERR_DISABLEEXCEPTION can be used to suppress the default exception popup message if a fatal exception causes program exit.

For signal exceptions the process uses DosSetSignalExceptionFocus to tell OS/2 that it is prepared to receive Ctrl+C or Ctrl+Break and once it has processed a signal exception it calls DosAcknowledgeSignalException to tell OS/2 it is ready for another one - this prevents a process being overrun with signal exceptions if the user 'leans' on the Ctrl+C key!. Another process can call DosSendSignalException to send a signal explicitly to another process, or DosKillProcess to do so a little more indirectly.

Finally critical pieces of code can be protected from signal exceptions by use of DosEnterMustComplete and DosExitMustComplete - any signals occuring between these two calls are delayed until the 'must complete' region is exited.

Overview of a general exception handler

An exception handler ought to have the following properties:

  1. It should not impact other exception handlers
  2. It must be reliable

Point 1 is necessary to ensure that, for example, the exception caused by reaching the end of the currently allocated stack is allowed to pass on along the chain to the exception handler which will allocate additional stack space and resume execution. It is quite easy to enforce this behaviour by simply ensuring that 0 (or XCPT_CONTINUE_SEARCH) is returned for all exceptions other than the one or ones being explicitly processed.

Point 2 is necessary to ensure your program can be terminated successfully! The exception chain is called on program termination and the program WILL NOT TERMINATE until each exception handler has returned. One other point is that nested exceptions are quite easy to generate (ie your exception handler causes an exception!) but will crash or hang your program unless checked against. Fortunately there is a flag passed to the exception handler when a nested exception is being processed and usually if this flag is set you will pass the exception on up the chain of handlers until it gets out of the nested region.

An example framework for a 'do nothing' exception handler in C is:

ULONG APIENTRY handler(
  EXCEPTIONREPORTRECORD *pReport, /* details of this exception */
  EXCEPTIONREGISTRATIONRECORD *pRegRecord, /* registration record for handler */
  CONTEXTRECORD *pContext,       /* machine context at time of fault */
  void *ptr )                    /* dispatcher context (exception specific) */
  {
  /* Exception handling goes in here... */
  return XCPT_CONTINUE_SEARCH;
  }

It might be used like this:

  {
  EXCEPTIONREGISTRATIONRECORD reg_rec = { 0, handler };
  APIRET rc = 0;

  rc = DosSetExceptionHandler( &reg_rec );

  /* Code to be protected goes in here... */

  rc = DosUnsetExceptionHandler( &reg_rec );
  }

Note that OS/2 uses the ADDRESS of the exception record to perform sorting and chain searching. The exception record must be allocated from the stack and MUST be unregistered before the procedure exits. If this is not done the exception record will be overwritten and your program will probably crash or fail to exit since the exception record chain will be corrupted. The single commonest cause of this error is registering an exception handler at the beginning of a procedure and either not deregistering it at the end, or doing so - but then returning before the end of the procedure (for example on some error condition). The program may well continue to work fine - until it ends and OS/2 attempts to pass the program termination exception down the chain of handlers; disaster occurs because the data structure for the first handler has been overwritten and the program locks up. This is particularly true of exception handlers which are registered at the beginning of the 'main' procedure - you must EITHER deregister them OR ensure the program is ended by calling the exit function rather than by returning from the main procedure.

The first parameter of the exception handling function explains the actual exception being processed - the most important field being the ExceptionNum which describes the exception, for example XCPT_INTEGER_DIVIDE_BY_ZERO or XCPT_ACCESS_VIOLATION. Many exception handlers consist of a switch on this value with the 'default' statement returning 0. Additional fields give more information, which may be exception specific such as the address causing the access violation for the second example above; others more general such as a flag indicating whether an exception is nested.

The second parameter points to the registration record used to register this instance of the exception handler. Typically the exception handler will need some additional parameters to enable proper action to be taken. One option is of course to use some global or module variable, but this can get a little tricky with multiple threads or recursive procedures; and so a better method is to imbed the exception registration record in a larger structure, and thus since the address of this record is passed to the exception handler it can then be used to access the extra information. I use this method in the sample program below.

The third parameter points to the context record describing the machine state when the exception occurred - you can for example read (and modify) the registers. This is of most use to assembler programmers who might for example single step over a failing instruction by modifying the instruction pointer in this record - users of a high level language usually do not have enough control over the machine code generated to do much with this information.

Finally the fourth parameter is used to pass additional information for one or two specific exceptions. See the full OS/2 documentation for more details.

Description of the sample program

Well that's enough (too much?) of these technical details - here is a simple example of how you might use an exception handler in 'C' to check pointers for validity. This is the 'lazy validation' method - invalid pointers are hopefully a rare error so just access the area pointed to and 99% of the time it will work. The exception handler will catch the bad 1% and enable you (in a fully fledged program) to return an error code or take some other avoiding action.

Under earlier versions of OS/2 you couldn't do this - one bad pointer access and you program was unceremoniously aborted.

So how does it work? The code is listed below and consists of a simple user test harness in the 'main' procedure which asks for an address and then calls the verify procedure to check and display the byte at that address. If unsuccessful the routine prints 'Bad address!' instead.

Comments on the program

The first point to note is that the exception handler is localised - ie it is registered and deregistered within the verify procedure as close to where it is required as possible. This means that I can safely assume that I can process ALL access violations without needing to check any further. It is in general a good idea to try and keep exception handlers close to the code they are protecting.

The second point is the use of an 'extended' registration record (MYREC) to enable me to pass the jmp_buf structure into the exception handler.

The third point is that setjmp/longjmp makes use of the OS/2 exception manager. In particular longjmp calls DosUnwindException to remove all exception handlers added since the environment was saved by setjmp. In the code I register VerifyHandler AFTER setjmp has been called. If the pointer access is invalid the exception handler is called, longjmp is executed and the code to deregister the exeception handler is not called. However, if you try and deregister VerifyHandler in the 'failing' branch of the setjmp call you will find that the call returns an error. This is because when longjmp called DosUnwindException VerifyHandler was removed from the chain.

A general point to bear in mind is that the exception handler ONLY AFFECTS THE THREAD IN WHICH IT WAS REGISTERED so it's no use to register a single exception handler once in your main program and then create lots of threads - each one will need to have the handler separately registered. The only case where one handler is enough is for signal exceptions since they are ALWAYS passed to thread 1.

The final point is to make sure you have a lot of stack if you define your own exception handlers - they seem to need it, especially when you get nested exceptions. Failure to have enough stack may cause your program to exit or hang since OS/2 will be unable to dispatch the program termination exception properly. In this example I use 0x4000 bytes which for such a small program may be a bit excessive, but under OS/2 v2 lazy stack allocation means that the memory for this stack should only be actually obtained when I need it!

It is another reason why exception handlers should be short and simple (if possible!) since this reduces the stack requirement as well as the likelihood of a nested exception.

Conclusion

With version 2 OS/2 now provides fairly robust exception management. The main problem that is it still too easy to prevent a program exiting by a careless line or two of code in an exception handler. In addition stack corruption, or failure to de-register an exception handler on exiting from the procedure it was defined in, can cause problems which are not easy to find later on.

I have noticed that a number of IBM's own OS/2 programs (including the command shell itself!) will on occasion fail to exit because of problems with exception handlers, and the program is then unkillable and remains hanging about until you reboot.

On the positive side however it is a nice luxury to be able to protect important programs from unexpected events, especially bad pointers, and without needing to resort to assembler to do so.

I hope more OS/2 programmers will be encouraged to have a go at putting in exception handlers where they are appropriate to make their programs more reliable or to simplify error handling.

Compilation command

[I am using IBM C/C++]

icc /wall+ /b/stack:0x4000 VerAddr.c
------------------------- VerAddr.c --------------------------

#define INCL_DOS

#include <os2.h>
#include <ctype.h>
#include <stdio.h>
#include <setjmp.h>

typedef struct _myrec             /* 'extended' registration record */
  {
  EXCEPTIONREGISTRATIONRECORD RegRecord; /* MUST BE FIRST! */
  ULONG ErrorCode;
  jmp_buf jmpbuf;
  } MYREC, *PMYREC;

ULONG APIENTRY VerifyHandler(
  EXCEPTIONREPORTRECORD *pReport,
  EXCEPTIONREGISTRATIONRECORD *pRegRecord,
  CONTEXTRECORD *pContext,
  void *ptr)

  {
  PMYREC pMyRec = (PMYREC)(PVOID)pRegRecord; /* Get extended structure */

  /* Reference unwanted parameters */
  ptr = ptr;
  pContext = pContext;

  if ( pReport->ExceptionNum  == XCPT_ACCESS_VIOLATION )
     {
     pMyRec->ErrorCode = pReport->ExceptionInfo[ 0 ];
     longjmp( pMyRec->jmpbuf, 1 );
     }

  return XCPT_CONTINUE_SEARCH;
  }

void verify( MPARAM pAddr )
  {
  MYREC except = { {0, VerifyHandler }, 0 };
  APIRET rc = 0;
  PSZ p = (PSZ) pAddr;
  unsigned char ch = '\0';


  if ( setjmp( except.jmpbuf ) == 0 )
     {
     rc = DosSetExceptionHandler(&except.RegRecord);
     if ( rc != 0 )
        printf( "DosSetExceptionHandler: error - %u\n", (int) rc );

     ch = *p;

     rc = DosUnsetExceptionHandler( &except.RegRecord );
     if ( rc != 0 )
        printf( "DosUnSetExceptionHandler - error %u\n", (int)rc );

     if ( isprint( ch ) )
        printf("Value: '%c'\n", ch );
     else
        printf("Value: '%2.2x'\n", (unsigned char)ch );
     }
  else
     printf( "Bad address - violation error %u\n", except.ErrorCode );


  return;
  }

int main( int argc, char **argv )
  {
  LONG lAddr = 0;
  char buffer[ BUFSIZ ] = { '\0' };

  /* Reference unwanted parameters */
  argc = argc;
  argv = argv;

  while( !feof( stdin ) )
     {
     printf( "Enter address, or press Ctrl+Z: " ); fflush( stdout );
     if ( gets( buffer ) != NULL )
        {
        if ( sscanf( buffer, "%li", &lAddr ) != 1 )
           printf( "Bad argument - integer expected\n" );
        else
           verify( MPFROMLONG( lAddr ) );
        }
     }
  return 0;
  }

Roger Orr - 27-Aug-1993