NDIS Driver Developer's Tool Kit for OS/2 and DOS Programmer's Performance Guide

Introduction
A Network Driver Interface Specification (NDIS) Media Access Control (MAC) device driver provides the software interface with a Specific communication adapter and makes that communication adapter system resources available to communications protocol stacks through the NDIS programming interface.

This document provides guidance to the creators of NDIS MAC device drivers to help them achieve good performance from their device driver implementations.

The performance that is important to focus on for an NDIS MAC device driver is the performance that the communications users see. Communications users are most sensitive to the time it takes to send their data across the communications network. From the performance point of view this suggests that we focus on making the data flow paths as fast as possible for any quantity of data. Accomplishing fast data transfer requires analysis of three areas:
 * fast data flow between the device driver and the protocol stack
 * fast data flow through the device driver
 * fast data flow between the device driver and the adapter card.

In addition to minimizing the latency of the data flow paths, minimizing the CPU utilization of the NDIS MAC device driver is also important. Optimizing CPU utilization is necessary in servers, since the performance of the server is often a bottleneck in client-server transactions.

The performance of the NDIS MAC device driver itself is important as well as how it interfaces with the rest of the hardware and software system.

This document:
 * explains the key measures used in evaluating NDIS MAC device driver performance
 * discusses overall design considerations
 * identifies NDIS verbs of particular performance importance
 * discusses interrupt processing
 * discusses OS/2 system usage
 * discusses programming practices.

This information may help NDIS MAC device driver creators understand the performance effects of the many choices available to them and result in better performing NDIS MAC device drivers.

Performance Criteria
Performance for a NDIS MAC device driver is evaluated by three quantities: latency, throughput and system processor utilization. Latency is a measure of how fast the transport can put data on, or get data off, the communication medium. Latency can be broken down into transmit latency and receive latency. Latency is often measured by the time used to send and receive a few bytes of data. Smaller latency is better.

Throughput is a measure of how well data can be kept flowing on to or off of the communication medium. Throughput is measured in units of work per unit time. The amount of data used in throughput evaluation is large so that the effects of data handling and buffer management are apparent. For communications applications, throughput is often measured in units of bytes per second, or units of frames per second (for a range of frame sizes). Larger throughput is better.

Finally, system processor utilization shows how much of the system processor is used for flowing data in a throughput environment. Smaller system processor utilization is better.

End users are interested in the performance of the entire communications stack. It is important to run the device driver with different protocol stacks and see how each protocol stack uses the device driver. We recommend using the IBM NetBIOS and IBM 802.2 protocol stacks.

In order to focus on NDIS MAC device driver performance independently of other communication media traffic, performance measurements should be made on an isolated communication network.

Good system performance is the real end goal. NDIS MAC device driver performance is one component of communication system performance. System needs or bottlenecks may suggest the best tuning for communication system components to achieve the best system performance. For example a system that depends on fast communication is likely to need the best latency and throughput. However a system that has big CPU demands and little communication may run best when communication system parameters are tuned for least CPU utilization. NDIS MAC device drivers need to perform well across the spectrum of communication system tuning.

Performance Issues In NDIS MAC Device Driver Design
The performance of an NDIS MAC device driver is a key determinant in communications performance and is often a major factor in overall system performance. Consequently, considerable effort should be expended to ensure that each NDIS MAC device driver is high performing. In this section, we discuss overall design issues which affect the performance of an NDIS MAC device driver.

First, though, one must keep in mind that the recommendations in this document are general in nature. Each of the recommendations in this document has been implemented in an NDIS MAC device driver and has been shown to improve its performance. However, the specifics of the adapter play a large role in determining how effective these suggestions will be in improving performance. These specifics include the communications medium to which the adapter interfaces, the hardware design of the adapter, and the software interface specification by which the MAC device driver talks to the adapter. How much each contributing factor affects performance varies from adapter to adapter. The key point is that it is crucial to understand in detail the adapter's design and specifications.

The first NDIS MAC device driver design issue is related to multiple adapter support. If multiple adapters which use the same MAC driver can be present in a single machine, then there are two basic design choices:
 * up a single device driver model
 * use the multiple device driver model i.e., load a copy of the device driver for each adapter.

The single device driver model is strongly recommended for best performance. It avoids two key drawbacks of the multiple driver model: one, the extra RAM used by multiple instantiations of the code segment; and, two, the need for process and/or context switches to change from servicing one adapter to servicing a different adapter. Using the single device driver model, there would be one code segment operating on multiple data segments, one for each adapter. To move from servicing one adapter to servicing another now requires only a segment register load. Segment register loads are considerably less expensive than (operating system dependent) context switches. Note also that in the default case of a single adapter in a machine, the single device driver model will be the same as the multiple device driver model.

If one adopts the single device driver model, then that device driver is responsible for ensuring that service to each adapter is fair. Fairness is provided by selecting the first adapter to check for a pending interrupt in a round robin order. After that, all work is done for each adapter before proceeding to the next adapter. This ensures that the processing will be interwoven for all adapters. When the device driver is entered, all interrupts for all adapters handled by that device driver should be masked. We note that fairness must be ensured both when multiple adapters present interrupts at the same time and when the processing for one adapter is completed before deciding whether or not to return control to the operating system. In particular, before returning control to the operating system, all adapters should be checked for additional work.

After fairness of service among adapters is ensured, two issues of fairness in the processing of work for a single adapter must be addressed. The first fairness issue relates to how the MAC driver learns of adapter events which require service. Either the adapter will notify the MAC driver via interrupt or the MAC driver will find out by checking the adapter's status. This issue will be revisited in the Interrupt Processing section. The basic idea is that once the interrupt handler is in control through an interrupt driven event, further interrupts from the adapter should be masked until all worked discoverable by checking adapter status has been completed. Each particular NDIS MAC device driver must decide for itself how to balance its own interrupt driven and adapter status checking natures.

The second fairness issue deals with the balance of processing between transmit and receive work. The balance that the device driver must strive to maintain is between keeping the adapter as busy as possible with data to be transmitted and keeping the protocol as busy as possible processing data that has been received.

This balance depends on whether the software interface to the adapter is via shared RAM or via direct memory access (DMA). For adapters which use DMA, the corresponding NDIS MAC device driver may have to allocate its own receive buffers into which the adapter will DMA received data. These buffers will become critical resources and may turn into performance bottlenecks if not handled properly.

Two key adapter events to which the MAC driver must give priority are the processing of receive frames and the completion of previously transmitted frames. Processing receive frames quickly improves protocol responsiveness and frees receive buffers for the next incoming frame. Early detection of the completion of previously transmitted frames keeps the adapter busy transmitting additional frames. This happens because the MAC driver can pass additional data to the adapter once it determines that a previous transmit request has been fulfilled. MAC driver handling of these two key events is dependent upon the software interface to the adapter.

The design of NDIS lends itself particularly well to the use of scatter lists on the receive path and gather lists on the transmit path. If the adapter supports such lists, then the high performing NDIS MAC device driver must take advantage of them. Doing so will eliminate the need to copy segmented, and possibly physically dispersed, frames into separate, contiguous, MAC allocated buffer space. The performance gains using this approach are clear:
 * fewer data copies and, hence, reduced CPU utilization
 * smaller data segments
 * less data buffer manipulation.

Finally, each procedure in an NDIS MAC device driver should be coded as efficiently as possible. Efficient coding translates directly into improved performance by reducing CPU usage. In addition, the device driver should take advantage of any special adapter capabilities which could optimize performance. An example of such a capability would be an adapter that allows the device driver to begin handling the start of received data before the entire transmission is received.

The next two sections, Performance Important NDIS Verbs and Interrupt Processing, describe two areas where special attention must be focused to produce a high performing driver. The sections System Service Usage and Programming Practices provide suggestions which should be followed throughout in the implementation of an NDIS MAC device driver.

Performance Important NDIS Verbs
An NDIS MAC device driver is responsible for supporting a large number of NDIS primitives. However, only those primitives in the 'Direct Primitives' class directly affect critical path transmit and receive performance. Primitives in the other three classes: 'General Requests', 'System Requests', and 'Protocol Manager Primitives' do not usually occur on performance critical paths. While developers should strive to implement each piece of code efficiently, special attention must be paid in coding 'Direct Primitives'.

The primitives which most affect critical path transmit and receive performance are part of the protocol to MAC interface described in Chapter 3 of the NDIS Specification. These primitives are:
 * TransmitChain
 * TransferData

and, to a lesser extent, the remaining 'Direct Primitives':
 * TransmitConfirm
 * ReceiveRelease
 * ReceiveLookAhead
 * ReceiveChain
 * IndicationOn
 * IndicationOff
 * IndicationComplete

For DOS NDIS MAC device drivers, the InterruptRequest primitive must be supported, else the performance of protocol drivers (including the DOS LAN Support Program) will be severely degraded.

The two methods of handling received data merit some discussion. ReceiveChain should be used if all the data received is system addressable when the MAC device driver gets the interrupt. ReceiveLookAhead should be used if either of two conditions hold. First, ReceiveLookAhead should be used if not all of the data is system addressable when the MAC device driver is interrupted. In this case, the data will need to be copied and this is a situation ReceiveLookAhead and the associated TransferData handle well. Second, ReceiveLookAhead should be used if not all the data has been received by the adapter when the interrupt is raised. In this case, protocol drivers may inspect and process the head of the frame without waiting for the entire frame to be received.

MAC driver processing in support of the MAC to adapter interface is also crucial for best critical path performance and will be discussed in the next section, Interrupt Processing. Since much of the processing of direct primitives involves interrupt processing, these two sections are necessarily interrelated.

TransmitChain
TransmitChain is the NDIS primitive used by protocols to transmit data. The protocol passes a transmit buffer descriptor consisting of a pointer to up to 64 bytes of immediate data and a list of data blocks. Each data block contains a pointer to up to 64 kilobytes of data. The key to a high performing implementation of TransmitChain is to pass the data to the adapter as quickly as possible. The actual details will vary with the adapter's software interface, but certain ideas are common to all adapters. The key idea is to eliminate any activities on this critical path which could be done in advance, either in initialization or during other non-critical path processing. Hence, any device driver data areas should be allocated in advance, probably during initialization. The immediate data, which must be copied to device driver buffer space, should be copied using a double word copy. Additionally, the destination buffer for the immediate data should be double word aligned. The number of transmission commands to the adapter should be minimized, though achieving this goal is adapter dependent. There is a performance benefit when several TransmitChains in a row are issued to the MAC driver. Finally, as mentioned in the previous section, high priority must be given to detecting previously submitted frames have been transmitted. If this check is made, then these TransmitConfirms for these frames can be issued directly within the TransmitChain path.

The NDIS specification also provides a performance recommendation for TransmitChain. In the introduction to Chapter 5 of 3Com/Microsoft LAN Manager Network Driver Interface Specification, the following can be found. "It is recommended that a MAC release the internal resources associated with either TransmitChain or a request before calling the confirmation handler. This allows the protocol to submit a new TransmitChain or request from the confirmation handler. Failure to do so may have a significant impact on performance."

TransferData
The TransferData primitive is issued by the protocol to the MAC device driver during a ReceiveLookAhead. The protocol asks the MAC device driver to copy the received data into the buffer spaces specified in a list of data blocks. The key to this routine is setting up the source and destination selectors and deciding how much data to copy. This process may be repeated many times depending upon how many data buffers the received data is stored in and how many buffers it is being transferred to. Each copy should be done using a double word copy. If the received data is in device driver buffers, then they should be double word aligned for fastest copy time. Tied in with these techniques is a suggestion for use in the ReceiveLookAhead routine, which will occur on interrupt. The lookahead buffer that is passed to the protocol should be the actual buffer into which the adapter has placed the data. The idea is to avoid allocating separate lookahead data buffers and the subsequent need to copy data into them. Further, in this method the lookahead buffer can potentially contain the entire frame, eliminating the need for the protocol to issue a TransferData primitive call.

TransmitConfirm
The TransmitConfirm primitive is issued by the MAC device driver to the protocol to indicate completion of a previous TransmitChain. This call serves an important function as an asynchronous indication to the protocol that the MAC device driver is ready to process additional transmit requests. A possible pitfall in NDIS MAC device driver design is to try to eliminate the need for TransmitConfirm calls. This can the done by returning 'success' to the protocol upon TransmitChain. This approach seemingly saves MAC device driver to protocol interactions. However, it may backfire in the following manner. Since the MAC device driver is actually queuing transmit requests internally and not completing them, when it runs out of transmit resources it will be forced to return 'out_of_resource'. This, in turn, will force the protocol to poll the MAC device driver to find out when resources are available again. The protocol must poll since no asynchronous indication, i.e. TransmitConfirm, will be forthcoming from the MAC device driver. Choosing the right polling interval is an intractable problem. If the interval chosen is too short, too much CPU is utilized. If the interval chosen is too long, the protocol will lose synchronization with the MAC device driver and not be passing it data as often as it could. Either way, serious throughput and overall performance degradation may occur. As a result, we strongly recommend that NDIS MAC device drivers use the TransmitConfirm primitive.

Interrupt Processing
The interrupt routine is critical to NDIS MAC device driver performance. Any excessive time spent in the interrupt handler can adversely affect performance. The two goals for an efficient interrupt handler are to minimize the number of interrupts (i.e. the number of times the operating system calls the interrupt handler) and to minimize the time spent processing each interrupt.

The two most common reasons for entering the interrupt routine are the "reception of data and notification of a completed transmission. Adapters may also interrupt the device driver if needed system resources, such as buffer space for DMA, have been depleted. NDIS primitives which the MAC device driver may issue to the protocol on interrupt include TransmitConfirm, ReceiveLookAhead, ReceiveChain, and StatusIndication. The IndicationComplete may be issued on interrupt when running OS/2, but not when running DOS.

There are a few general performance guidelines for efficient interrupt processing. We list them here and then conclude this section with several more specific suggestions for improving performance. The interrupt handler is a time critical routine in and of itself. Interrupts should not remain masked for more than approximately 500 microseconds. This means that the End of Interrupt should be issued to the interrupt controller within 500 microseconds of entry into the device driver's interrupt handler. The MAC driver interrupt handler must perform the following actions before issuing the end of interrupt:
 * determining if the interrupt is for this MAC driver
 * masking the adapter to prevent further interrupts
 * reading any adapter unique status which may be overwritten, and hence lost, due to subsequent events.

Only useful and necessary processing should take place on interrupt. Necessary processing includes time critical adapter interfacing, such as rearming the adapter. In addition, to reduce the overhead inherent in processing interrupts, the interrupt handler should check for additional work before exiting. This decreases latency and reduces CPU utilization by lessening costly context switches back to the operating system.

As mentioned above in the Performance Issues in NDIS MAC Device Driver Design section, the interrupt handler must balance processing efficiently with the need to detect higher priority work. Once the interrupt handler is in control through an interrupt driven event, further interrupts from the adapter should be masked until all work discoverable by checking adapter status has been completed. This will minimize kernel interrupt processing overhead.

The interrupt handler must also balance its processing between handling transmit completes and handling received data. As indicated previously, the balance that must be maintained between keeping the adapter as busy as possible with data to be transmitted and keeping the protocol as busy as possible processing received data. One way to achieve this balance is to limit the number of transmit completes to process before checking for higher priority interrupt and receive frame processing.

OS/2 System Service Usage
Device drivers need to use OS/2 system services to access functions such as memory allocation and address translations. The high performing NDIS MAC device driver will need to limit the use of OS/2 system services, particularly on the critical transmit and receive paths. A discussion of these critical paths can be found in the previous two sections of this document. Limiting operating system usage to only necessary calls will both reduce path length and lessen CPU utilization. One way this may be achieved is to save useful results obtained via system calls, eliminating the need to repeat the call later. Address translations are a good example of calls to which this technique applies.

One specific piece of advice relating to use of system services comes from the OS/2 device driver reference manuals. It is stated there that a device driver strategy routine should be prepared to yield the CPU about every 3 milliseconds. Limiting system service call will help ensure adherence to this rule.

Programming Practices
This document has detailed a number of ideas which improve the performance of NDIS MAC device drivers. However, the fundamental method for producing high performing code is writing efficient code. In that sense, the tips and techniques which we will summarize in this section underlie every other section of this guide.

The ideas in this section all fall under the general heading of code tuning. The goals of code tuning are to:
 * decrease the number of processor cycles required by program
 * decrease the number of address calculations performed
 * decrease the number of memory fetches
 * decrease the size of object code.

The major way to eliminate address calculations and memory fetches is by alignment. With the Intel 80386 and successor architectures, the normal memory fetch is for a double word (four eight bit bytes) of data, aligned on a double word boundary. Hence, proper alignment of any memory will optimize fetch performance, while misalignment will degrade performance. The only exceptions to this rule are possible overrides by the memory management system of the processor for devices whose bus access is fewer than 32 bits wide. Even for those devices, access will almost surely be at a word granularity and following double word alignment guidelines would not degrade performance.

The most important items to align are data structures. The rules to follow are:
 * align the data structure on a double word boundary. This means that the starting (physical) address of the structure should be on a four byte boundary.
 * elements of the data structure containing four or more bytes should be started on double word boundaries.
 * elements of the data structure containing fewer than four bytes should not cross double word boundaries.

Achieving alignment for each element in a data structure may require reordering the elements and the addition of null elements.

Code itself may also be aligned. The most important instructions to align are the targets of jump instructions. These instructions may be double word aligned by using an assembler directive which will automatically generate any needed null, aligning, instructions.

Data copying is one of the most CPU intensive operations. The time to copy data depends upon three factors: the granularity of the move instruction (byte, word or double word), the alignment of the source buffer and the alignment of the target buffer. The fastest copy time is achieved when both buffers are double word aligned, and the double word move instruction (REP MOVSD) is used. If either buffer is not double word aligned, the following technique produces an efficient string move routine. First, copy up to three bytes until the destination address is double word aligned. Second, copy the remaining bytes using the double word string copy instruction. Use of the double word copy move instruction ensures the smallest overhead for data transfer.

Two other suggestions may also help speed up code execution. The first is to order tests so that jumps are to the least likely cases. This ensures that the less costly fan through case (no jump) is to the most likely case. This also assumes that the code can be structured so that there is no jump instruction at the end of the code for the most likely case. It may be difficult to structure the code in this manner. The second suggestion is to take advantage of the 32 bit nature of the Intel 80386 (and successor) chips. One major advantage of the 80386 architecture over the 80286 architecture is the inclusion of too additional segment registers. Careful use of these registers will eliminate unnecessary and costly segment register loads. The additional high order word provided in each general purpose registers may also be used to lessen accesses to secondary memory.

We conclude this section by summarizing a number of the points made in an invaluable book. The book is "Writing Efficient Programs" by Jon Louis Bentley (1982, Prentice-Hall). The rules in the book are general in nature and their indiscriminate application is strongly discouraged. In addition, always keep in mind that the efficiency of a program is secondary to its correctness. Once again, there are two main areas where modifications may improve the efficiency of a program: in the data structures and in the code.

Data structures can be either modified or completely replaced. The latter approach is a system design issue and is not germane to the current discussion. Simple modifications to data structures can help reduce a program's time and/or space. Bentley suggests four methods for trading off more space for less time.


 * Data Structure Augmentation This refers to either adding additional information in the structure to or to changing the structure so its components can be accessed more easily. Data structure alignment, discussed above, is an example of this method.
 * Store Precomputed Results This should be done to save the cost of recomputing an expensive function. Storage of results of OS/2 system service calls fall into this category.
 * Caching Data that is accessed most often should be cheapest to access. Proper use of the two extra segment registers found in the 80386 chip, as opposed to only using 80286 registers, would be an example of this rule.
 * Lazy Evaluation This idea is to only evaluate expressions when necessary, thereby avoiding unneeded work.

It is possible to speed up small pieces of code by making local transformations. These changes fall into four general categories: loops, logic, procedures, and expressions. The referenced book contains a clear and concise discussion of rules to apply in these categories; we shall not repeat it here. Rather, we close by listing four fundamental rules which underlie all of the suggestions for writing efficient programs. These fundamental rules are (excerpted from pages 104 and 105 of Bentley):
 * Code Simplification Most fast programs are simple. Therefore keep code simple to make it faster.
 * Problem Simplification To increase the efficiency of a program, simplify the problem it solves.
 * Relentless Suspicion Question the necessity of each instruction in a time-critical piece of code and each field in a space-critical data structure.
 * Early Binding Move work forward in time. Specifically, do work now just once in hopes of avoiding doing it many times later.

Notices
May 1993

The following paragraph does not apply to the United Kingdom or any country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This publication could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or program(s) described in this publication at any time.

It is possible that this publication may contain reference to, or information about, IBM products (machines and programs), programming, or services that are not announced in your country. Such references or information must not be construed to mean that IBM intends to announce such IBM products, programming, or services in your country.

Copyright Notices
© '''Copyright International Business Machines Corporation 1993. All rights reserved.'''

Note to U.S. Government Users - Documentation related to restricted rights - Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Disclaimers
References in this publication to IBM products, programs, or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM's product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any of IBM's intellectual property rights or other legally protectible rights may be used instead of the IBM product, program, or service. Evaluation and verification of operation in conjunction with other products, programs, or services, except those expressly designated by IBM, are the user's responsibility.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the

IBM Director of Commercial Relations,

IBM Corporation,

Purchase, NY 10577.

Trademarks
The following terms, denoted by an asterisk (*) in this publication, are trademarks of the IBM Corporation in the United States and/or other countries:
 * IBM
 * OS/2
 * Extended Services

The following terms, denoted by a double asterisk (**) in this publication, are trademarks of other companies as follows:

All other company, brand, and product names are trademarks or registered trademarks of their respective owners.