Breaking the Warp 9 I/O Barrier: A Faster FASTIO

By Alger Pike and Holger Veit.

As it becomes apparent that Windows 95/NT (Win95/NT) are not suitable for real time data acquisition, more people are trying Operating System/2 (OS/2). From experience I know that Win95/NT interrupt handlers start to lagg behind their applications at about 500 Hz, regardless of the timer. These slow rates are fine for video cards or sound boards, but not acceptable when trying to do data acquisition. Controlling stepper motors, or getting laser ionizatation time-of-flight mass spectral data, both require interrupt rates in excess of 1000 Hz. OS/2 can easily handle such high rates. In our lab we routinely acieve 3 kHz for interrupt driven data and rates approaching 1MHz using contiuous sampling.

As the interrupt rate approaches such high rates, the speed of the interrupt handler, and associated processing code become increasingly important. The faster one can process the I/O requests between each interrupt, the faster the ultimate interrupt rate that can be achieved. Usually the bottle neck in such applications, (especially in protected mode OS's) is the speed at which one can access the device. Optimizing this step of the process, can mean large performance gains.

Up until this point, I had been using the method of I/O described by Holger Veit in the August 1995 issue of EDM/2. Briefly, this method configures a call gate which allows a 32-bit ring 3 segment to call code which is in a ring 0 code segment. This allows the 32-bit segment to perform I/O intsructions, without going through much of the overhead usually associated with such operations. The question then becomes, "Can one optimize this process?". Most of the overhead with this I/O method is in using the call gate. Any optimized process therefore will do the I/O without using a call gate. "Can I/O be done directly from a 32-bit segment without using a callgate?". The answer to this sure enough is Yes!

Ring zero is not the only privilege level which is allowed to do I/O. There is a two bit field in the EFLAGS register called Input Output privilege level (IOPL). This field determines whether the segment can do I/O. Any segment with a current privilege level (CPL) less than or equal to 2 (IOPL) can do I/O. In fact having a privilege of IOPL or less means that the segment can use the following instructions without generating an exception: IN, OUT, INS, OUTS, CLI and STI. This means, that if we can somehow change the IOPL field of our application from 3 to 2, then we would be able to do direct I/O without a call gate. As I have shown previously using ring 2 IOPL 16-bit DLL's are also out. Although this is a valid I/O technique, it is much slower than using a call gate. So using this technique does not give us the desired performance enhancement, that we seek.

In order to do I/O directly from our application we must find some way to change its IOPL bit field to 2. Simple, we just link the code such that it has the IOPL bit field equal to two. Unfortunately, it is not this simple. Although the processor itself would have no objections, the linker does. No OS/2 linker allows 32-bit application code to have the IOPL field equal to 2 or less. Only code that resides in a 16-bit dynamic link library can have the IOPL field set to 2. (Some linkers don't even allow for this provision.) Well, you might also think that you could patch the appropriate bytes in the executable itself. This method does not work either. The OS/2 program loader overrides this action; it resets the IOPL field to 3 when your executable is loaded. In order to change the IOPL field it must be done after the program is loaded. This means that in order for our application to gain the proper privelidge then we must attempt to modify the EFLAGS register, only after our application has been loaded and is running.

There are assembly directives which allow one to read and write the EFLAGS register. These instructions are POPF and PUSHF. The problem is that they require privilege level 0 in order to execute without generating an exception. The code to change the EFLAGS register will therefore have to be in a device driver.

You might think that you could do the required changing in an IOCTL. However, when the IOCTL code is executing the EFLAGS register contains the information about the ring 0 segment, not our application segment which is what we want. Setting the IOPL field to 2 in this case would therefore raise the privilege level of the device driver code from 0 to 2; this is a non-feature. There must be a way to access the application EFLAGS register for our application while in the ring 0 code.

Re-enter the concept of the call gate. Recall that the call gate is like a doorway to a lower privilege level. It allows code of x privilege level to call code of equal or lesser privilege level. This technique was used by Holger to implement a fast method of accessing I/O ports. We will use this same technique to change the EFLAGS register of the calling thread.

When in the ring 0 code of a call gate, the EFLAGS register contains the information of the less privileged calling segment. (Please note that each thread of execution has its own EFLAGS register and the IOPL field will have to be set to 2 for each thread that wants to do I/O). It now becomes just a matter of reading the EFLAGS register and setting the IOPL field to 2. This allows our application to do I/O. To turn off the ability to do I/O we reset the IOPL field to 3. See Holger's August 1995 article for the details on how to setup the required call gate. Also see my "Hello World Device Driver" Series for more information on wrting an OS/2 device driver. At this point I will assume the reader already has a driver which does I/O based on the principles given in those articles.

To implement the feature into the device driver we need to add a couple of more table entries. One table entry will access the EFLAGS register to turn on IOPL for our application. The other will access the EFLAGS register to turn off IOPL for our application. By having both of these functions availible, the developer can select when his program needs to do I/O. (To be consistent with the ideas behind a protected mode OS, the developer should turn off the I/O abiltiy when he does not need it.) Following are the two table entries that need to be added to the call gate function table:

iotbl:	dw iofret		;0 reserved
	dw iof1			;1 inb
	dw iof2			;2 inw
	dw iof3			;3 inl
	dw iof4			;4 outb
	dw iof5			;5 outw
	dw iof6			;6 outl
	dw iof7			;7 ints of board on
	dw iof8			;8 ints of board off
	dw iof9			;9 set count to value
	dw iof10		;10 read count value
	dw iof11		;11 machine ints on
	dw iof12		;12 machine ints off
	dw iof13		;13 iopl = 2 allow I/O from 32-bit seg
	dw iof14		;14 iopl = 3 no I/O from 32-bit seg
	dw iofret		;15 reserved

The enties of importance are 13 and 14. Pay no mind to the others, they are just entries I have added which are specific for the hardware that I am using. Also keep in mind that you need to increase your table size to 15 entries. This is to be compatible with the way Holger checks to see if the entry is valid or not in the wrapper functions. All of your unsed entires should be filtered through iofret to return without doing anything. Now you should be ready to define the table entries as follows:

iof13:	pushf
	pop	ax
	or	ax, 0011000000000000b
	push	ax
	popf
	retfd
iof14:	pushf
	pop	ax
	or	ax, 0010000000000000b
	and	ax, 1110111111111111b
	push	ax
	popf
	retfd

You can see from the above that after executing the pushf directive, ax contains the EFLAGS register. Notice also how AND and OR are used to toggle the bits to the desired values. Once the IOPL field has been changed (bits 13 and 14). We can make the required changes to the application segment by issuing the popf directive.

The final step in incorporating this code into the driver is to setup the wrapper functions. These set the function code and are called from your C routine. Your wrapper functions might look something like this:

PUBLIC c_ioplon
c_ioplon 	PROC
	PUSH	EBP
	MOV	EBP,ESP
	PUSH	EBX				; save regs
	MOV	bx, 13				;function code
	CALL FWORD PTR [_ioentry]		;call intersegment indirect 16:32
	POP	EBX
	POP	EBP
	RET
c_ioplon		ENDP

PUBLIC c_ioploff
c_ioploff 	PROC
	PUSH	EBP
	MOV	EBP,ESP
	PUSH	EBX				; save regs
	MOV	bx, 14				;function code
	CALL FWORD PTR [_ioentry]		;call intersegment indirect 16:32
	POP	EBX
	POP	EBP
	RET
c_ioploff		ENDP

Remember to declare these functions with stack calling conventions in your include file, since they use C calling convention. To give your application IOPL simply call c_ioplon from your application.

Once the IOPL field for application is set to 2, calling I/O code becomes simple. Include conio.h and use inp and family. Now that the IOPL field is 2 these functions produce the desired result, i.e. they no longer give an exception which causes your application to terminate. At this point we are now doing I/O as fast as OS/2 will allow; it is direct from our application.

There are also other performance benefits. On a Pentium Pro 200, this difference is a factor of three. Using the call gate I/O method, it takes three seconds to do 1,000,000 I/O instructions. Interestingly, this is the same amount of time it takes for a Pentium 120. This shows that although it is called from a 32-bit segment, the code inside the call gate is 16-bit. It suffers the Pentium Pro 16-bit code bottleneck. The direct I/O method relieves this restriction allowing the I/O to go full speed ahead. Using this method a Pentium Pro executes 1,000,000 I/O instructions in one second, three times faster than a Pentium 120 like it should.

Hopefully you now have a sense of what it takes to break the WARP 9 I/O barrier. Using the direct I/O method above I have shown you how to achieve a whopping WARP 13. The speed improvements have two direct consequences: 1) You can now do more I/O instructions before the next interrupt occurs increasing your ultimate interrupt rate, and 2) You eliminate bottleneck associated with running 16-bit code on a Pentium Pro. (Note: 1 is only true for users who use the interrupt processing method presented in my article From Hello World to Real World- Writing a Device Driver for Your Own Data Acquisition Card Part III). Using these methods we have approached interrupt rates in excess of 3 kHz and continuous sampling rates that approach 1 MHz in our lab.