Jump to content

SMPV211 - Avoiding Device Driver Deadlocks

From EDM2

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation

OS/2 for SMP V2.11 Reference
  1. Notices
  2. Overview of OS/2 for SMP Version 2.11
  3. Platform Specific Drivers (PSDs)
  4. Understanding Spinlocks
  5. Device Drivers In OS/2 for SMP V2.11
  6. Application Considerations
  7. Avoiding Device Driver Deadlocks
  8. New Device Helper (DevHlp) Routines
  9. New Kernel Debugger Commands
  10. The Single Processor Utility Program
  11. OS/2 for SMP V2.11 Tools
  12. Appendix A
  13. Glossary

Deadlock can be defined as an unresolved contention for use of a resource. Whenever any mutual exclusion primitive is used, the possibility of deadlock is introduced. This is evident even in uniprocessor system such as OS/2 with the use of semaphores. The possibilities of deadlock are greater in a multiprocessor environment because of the large requirement for mutual exclusion. The method of mutual exclusion for device drivers and the OS/2 SMP kernel is the spinlock. Using spinlocks incorrectly can result in deadlock conditions where an application or device driver will become hung. In the case of a device driver, no more activity will take place on that processor if the device driver enters a deadlock state. Writing device drivers and code for OS/2 for SMP V2.11 requires the programmer to think about the conditions in the code which might cause a deadlock condition, and then use spinlocks to protect those resources.

While it would be impossible to list every cause of deadlock, a few of the most common code examples are given below in pseudo-code that can result in deadlock. These examples are not exhaustive, but represent the majority of situations that will probably be encountered. Being aware of these types of conditions can help you reduce the chances of deadlock within your device driver or applications.

Use of CLI/STI

As stated above, CLI/STI will only work on the processor on which they execute. Therefore, only the same processor will be protected from "stepping" on a protected resource. For example, assume the application maintains a linked list of I/O packets for a device. Whenever packets are inserted or removed, the list must be protected as a critical resource. Under the uniprocessor model, a CLI/STI around the manipulation of the list would be sufficient protection. However, in an MP environment, the CLI/STI would only protect the resource on the same processor. Another processor could enter a section of code that attempted to manipulate the linked list. The results would be unpredictable. Possibilities would range from no effect to deadlock. Code that uses CLI/STI is not reliable and should be eliminated.

The solution is to replace CLI/STIs with spinlocks. Each critical resource will have associated with it a spinlock. Before accessing the resource the spinlock must be acquired, and when complete, the spinlock is released.

Spinlocks Taken Out of Order

One possible cause of deadlock stems from taking spinlocks in different orders in different sections of code. Consider the following two sections of code, each executing on a separate processor at the same time. For both examples all locks are available when the code begins execution.

       Code section 1              Code section 2


    1  Lock spinlock1           1  Lock spinlock2
    2  Do some processing       2  Do some processing
    3  Lock spinlock2           3  Lock spinlock1
    4  More processing          4  More processing
    5  Unlock spinlock2         5  Unlock spinlock1
    6  Unlock spinlock1         6  Unlock spinlock2

In section 1 line 1 locks spinlock1. In section 2 line 1 locks spinlock2. Both sections will successfully lock their respective locks and continue normally. Now section 1 on line 3 tries to lock spinlock2, which is already locked by section 2, so section 1 spins. Now section 2 tries to lock spinlock1 (line 3), which is already locked by section 1, so section 2 now spins. Now each section of code is spinning waiting for a lock that the other owns. The result is deadlock. Neither section of code will ever continue executing and will therefore never release the spinlock that the other needs. This kind of deadlock is very common, but can be avoided by always taking spinlocks that are related in the same order.

To fix the above code, code section 2 would be recoded to the following:

      Code section 2


   1  Lock spinlock1
   2  Lock spinlock2
   3  Do some processing
   4  More processing
   5  Unlock spinlock2
   6  Unlock spinlock1

By taking the locks in the same order as code section 1 the deadlock potential is eliminated. Both sections can no longer be waiting on a resource the other owns at the same time. It should be noted that spinlocks should be released in the reverse order that they are locked.

Blocking With Spinlocks Locked

Another cause of deadlock is blocking with locked spinlocks. Consider the following two sections of code. Section 1 is a task time operation that needs an interrupt to complete. Section 2 is the interrupt code that will execute and unblock section 1.

     Code section 1                  Code section 2
     (Task time)                     (Interrupt time)

     Lock spinlock1                  interrupt received
     start I/O                       lock spinlock1
     block (ProcBlock)               unblock (ProcRun)
                                     release spinlock1
     return from block
     some processing
        (may include a re-block)
     release spinlock1

In the above example code section 1 locks spinlock1 and then blocks (with the spinlock still locked). Code section 2 will then execute when the I/O completes. The interrupt code first tries to lock spinlock1. Because spinlock1 is already locked, the interrupt code will spin waiting for the lock. The lock will never become available, however, because the only way for the spinlock to be unlocked is for section 1 to be unblocked. But the interrupt code, which is responsible for the unblock, can't continue until it acquires the spinlock. The result is deadlock.

Now the first attempt to solve this problem may be to recode section 1 with the following:

    Lock spinlock1
    start I/O
    release spinlock1
    block (ProcBlock)

    return from block
    lock spinlock1
    some processing
    release spinlock1

The above code sequence appears to correct the problem. It does not, however, and can also result in a deadlock. The reason is that there exists a window between where the code releases the spinlock and the thread is blocked in which an interrupt can occur. Remember that disabling interrupts no longer prevents interrupts from happening. If an interrupt fires in this window, the interrupt handler (section 2 above) will run. It will acquire the spinlock and attempt to unblock the thread. The thread, however, has not actually blocked yet. When the thread finally does block, the wakeup event has already occurred. The result once again is deadlock.

To solve this particular problem, DevHelp_Block has been modified to release ALL spinlocks that are owned on the current processor. The device driver should call DevHelp_Block with spinlocks locked. The kernel will first put the thread of execution in the blocked state. Then, before dispatching the next thread, it will release all locked spinlocks for the current processor. Because the thread is in the blocked state, it is valid for another processor to execute interrupt code that will do the DevHelp_Run. The result is no deadlock. The code sequences from above should be re-coded to the following to avoid the deadlock:

     Code section 1                  Code section 2

     Lock spinlock1                  interrupt received
     start I/O                       lock spinlock1
     While(block required)           unblock (ProcRun)
        Block                        release spinlock1
        return from block
        Lock spinlock1
     EndWhile
     some processing
     release spinlock1

The above example has been expanded to include the steps required to insure that when the thread is woken up, that the blocking condition is satisfied before execution continues. This code sequence is analogous to that listed in the description for DevHlp_Block in the Device Helper Services chapter of the Physical Device Driver Reference. It has been modified to use spinlocks instead of disabling interrupts (which will not work).

Once again this list is not exhaustive, but is a representation of the majority of cases that can cause deadlock. By avoiding these situations the chances of deadlock are reduced considerably. In addition, there are certain system level checks performed to help insure that deadlock is avoided. If the system detects a situation that could cause deadlock, such as attempting to block while owing a spinlock, it will panic the system and print an internal processing error message.

Blocking

As shown in the last example, there are special considerations that must be followed when blocking in an MP aware device driver. Because blocking with a spinlock owned can cause deadlock, the DevHelp_Block service will unlock spinlocks as part of the blocking sequence. When the run is done and the blocked thread begins execution again, it must again lock any required spinlocks.

All system components that use spinlocks must be aware of calls that may block. For example, the file system, which calls a device driver to perform I/O, will almost always block in the device driver. The file system therefore should release all spinlocks before calling the device driver. In general, release all spinlocks before making a call that could block.

Interrupt Processing

Interrupt processing should not be affected, except by the need to lock spinlocks for critical resources. When a spinlock is locked, the LockManger will disable interrupts before returning to the device driver. This insures that no interrupt will occur, on the same processor, between when the spinlock is requested and when the kernel returns to the device driver with the spinlock locked. (The same level of function accomplished by a CLI on a single processor system). The device driver MUST leave interrupts disabled while owning the spinlock. If interrupts were enabled a deadlock could occur. Consider the following:


     Task Time                           Int Time

  (ints enabled)
  Lock spinlock1
  STI
                  ---Interrupt--->     Lock spinlock1
                                       some processing
                                       Unlock spinlock1
                                       EOI
                  <-- Return from Int
  Some processing
  Unlock spinlock1

In the above example the the task time and interrupt code are running on the same processor. When the task time code locks spinlock1 with interrupts enabled the LockManager will return with interrupts disabled. If interrupts were enabled after the lock with the STI instruction, then the interrupt code on the right could run. The first thing the interrupt handler does is try to grab spinlock1. Because spinlock1 is already locked, the interrupt handler will spin. The lock, however, will never become available. The task time code will not run until the interrupt code completes. The result is deadlock. This is why it is important to leave interrupts disabled while owning a spinlock.

Consider the same code above, but with the task time code running on processor A and the interrupt code running on processor B. For this example, however, interrupts remain disabled (remove the STI). Because the LockManager disables interrupts, processor B will run the interrupt code. When the interrupt code attempts to get the spinlock, it will spin. Because processor A continues executing, the spinlock will be released, thereby allowing the interrupt code on processor B to acquire the spinlock and continue execution. Deadlock is avoided. When processor A returns from the unlock the state of the interrupt flag will be restored by the LockManager to its state before the lock was done.

Another action the device driver must avoid is issuing its own EOI. All EOIs must use the DevHelp_EOI device helper service. The reason for this is that different multiprocessor platforms have defined their own advanced interrupt controllers. Without detailed knowledge of the controller and how it operates, and knowledge of how the kernel is using the controller, the device driver can cause unpredictable results, including deadlock. All MP-aware device drivers must use the EOI service.