Inside INF
by Peter Childs
Contents
Introduction
The Problem
As I sat debating the approach I would adopt for this article it became distressingly obvious that there would be only a small audience of fanatics that would share my enthusiasm for the intricate details of the INF file format.
I also had to painfully admit that others would not even know that INF/HLP files are the backbone of OS/2's online help system, and that most of those that did would be quite happy to consider the issue finished at that point.
The Solution
In this article I will attempt to present an overview of the OS/2 INF/HLP file format in such a manner that a cursory read will leave the reader with a general idea of the file format. I will also explain some of the compression ideas used in the INF files and provide additional information for those wishing to investigate deeper.
As my talents as a programmer are fairly limited I will not include large chunks of source. I am, however, working on developing some C++ classes to allow easy access to the information in INF files. If there is sufficient demand I may do an article showing the possible use of these classes.
What does it mean?
Some Basic Terms
In this document we will be discussing the INF/HLP file format.
INF files and HLP files are basically identical with the exception that INF files are generally designed to be viewed with the view.exe program, like this magazine, and HLP files are developed to provide online help for applications. To the best of my knowledge the file format is identical except for a single flag bit. From here on, when I refer to INF files I mean both INF and HLP files.
INF files are compiled with the IPFC (Information Presentation Facility Compiler) available with most OS/2 compilers, and the OS/2 Toolkits. The source used is a form of fairly simple markup, with the power to do just about anything you could want.
The IPF Online Reference (1st Ed 1994) describes IPF:
The Information Presentation Facility (IPF) is a tool that enables you to create online information, to specify how it will appear on the screen, to connect various parts of the information, and to provide help information that can be requested by the user.
It is important to realize the difference between IPF source markup, and INF files. The INF files are the compiled versions, and the difference is as marked as the difference between C source and an executable. The IPF source markup is well documented, whereas the INF file format is not officially documented.
In the beginning
The Header
The header is described here as a structure. When I began playing with INF files I just defined this structure and then read 155 bytes into it starting at offset 0. Although this worked fine for Borland C++ I had to muck with things with gcc to force the structure to be packed.
Most compilers offer a method of packing structures but if your code has to be portable then you will also have to consider the big-endian/small-endian stuff (ie. the bytes are stored differently in memory on SPARCs than PCs). Although probably obvious to most programmers, this had me stumped!
Starting at file offset 0 the following structure resembles the header (packed):
struct os2infheader { int_16 ID; // ID magic word (5348h = "HS") int_8 unknown1; // unknown purpose, could be third letter of ID int_8 flags; // probably a flag word... // bit 0: set if INF style file // bit 4: set if HLP style file // patching this byte allows reading HLP files // using the VIEW command, while help files // seem to work with INF settings here as well. int_16 hdrsize; // total size of header int_16 unknown2; // unknown purpose int_16 ntoc; // 16 bit number of entries in the tocarray int_32 tocstrtablestart; // 32 bit file offset of the start of the // strings for the table-ofcontents int_32 tocstrlen; // number of bytes in file occupied by the // table-of-contents strings int_32 tocstart; // 32 bit file offset of the start of tocarray int_16 nres; // number of panels with resource numbers int_32 resstart; // 32 bit file offset of resource number table int_16 nname; // number of panels with textual name int_32 namestart; // 32 bit file offset to panel name table int_16 nindex; // number of index entries int_32 indexstart; // 32 bit file offset to index table int_32 indexlen; // size of index table int_8 unknown3[10]; // unknown purpose int_32 searchstart; // 32 bit file offset of full text search table int_32 searchlen; // size of full text search table int_16 nslots; // number of "slots" int_32 slotsstart; // file offset of the slots array int_32 dictlen; // number of bytes occupied by the // "dictionary" int_16 ndict; // number of entries in the dictionary int_32 dictstart; // file offset of the start of the dictionary int_32 imgstart; // file offset of image data int_8 unknown4; // unknown purpose int_32 nlsstart; // 32 bit file offset of NLS table int_32 nlslen; // size of NLS table int_32 extstart; // 32 bit file offset of extended data block int_8 unknown5[12]; // unknown purpose char8 title[48]; // ASCII title of database }
Figure 1) INF header structure
Most of these values come in handy.
Our Sample File
Below is a simple IPF (source) file which I have compiled into an INF for use in examples.
:userdoc. :title. Sample INF file... :h1.Header One :p.This is a test. Os/2, lies, and Windows 95. :p.1234.5 :artwork name='in_inf.bmp'. Hello :p. :artwork name='tocarray.bmp'. :i1. This is an index entry :euserdoc.
Figure 2) Sample IPF file
Master Dictionary
Each INF file has a master dictionary which holds all of the words and symbols used in the articles that make up the INF file. The dictionary starts at offset dictstart, has a length dictlen, and comprises of ndict words.
In the example case above the dictionary is like this:
[0] : (,) [1] : (.) [2] : (/) [3] : (1234) [4] : (2) [5] : (5) [6] : (95) [7] : (a) [8] : (and) [9] : (Hello) [10] : (is) [11] : (lies) [12] : (Os) [13] : (test) [14] : (This) [15] : (Windows)
Figure 4) Master dictionary for the IPF sample
Some things to note are that the source contains the word Os/2, whereas the dictionary contains the words '/', '2', and 'Os'.
One way of loading the dictionary is detailed below (C++ code snippet).
dict = new char*[ infHeader.ndict ]; // our array of pointers // change all length bytes to '\0' and set pointers // to start of each word while( i < infHeader.dictlen && j < infHeader.ndict ) { add = dictstore[i]; dict[j++] = &( dictstore[i+1] ); dictstore[i] = '\0'; i += add; }
Figure 5) C++ code for loading the dictionary
The method used is irrelevant but you need some way of mapping i to the i'th element of the dictionary. Also don't forget to delete the allocated memory if you use the above sample with:
delete[] dictstore; delete[] dict;
Figure 6) C++ to delete the dictionary
Articles
Each article in an INF file is comprised of one or more slots. There are several structures that deal with slots. One is an array of offsets mapping i to the i'th slot's position in the file. Another is also the structure of the slot itself. Each slot also has a local dictionary that maps items in the slot to words in the master dictionary.
The Slots Array
Beginning at file offset slotsstart (from the header) there is an array of int32's. These are offsets in the INF file in which the i'th slot can be found.
int_32 slots[nslots]
Figure 7) Slots area declaration
The Slots Themselves
Beginning at the file offset slots[i] the following structure can overlay the file:
{ int_8 stuff; // ?? [always seen 0] int_32 localdictpos; // file offset of the local dictionary int_8 nlocaldict; // number of entries in the local dictionary int_16 ntext; // number of bytes in the text int_8 text[ntext]; // encoded text of the article }
Figure 8) Slots structure
The Local Dictionary
The local dictionary is used to map items in the encoded text of the slot to words in the master dictionary. Take note that the nlocaldict variable in a slot's structure is a byte in size, hence a single slot can only have a maximum of 255 (really 250 - we'll discuss that later) different words from the master dictionary in it.
Beginning at file offset localdictpos (for each article) there is an array:
int_16 localdict[nlocaldict]
Figure 9) Local dictionary declaration
The Text Itself
The encoded text is decoded somewhat like the following:
bool space = TRUE; while( i++ < ntext ) switch( text[i] ) { case 0xfa: // end of paragraph, sets space to TRUE break; case 0xfb: // [unknown] break; case 0xfc: // spacing = !spacing break; case 0xfd: // line break, set space to TRUE if not monospaced // example break; case 0xfe: // space break; case 0xff: // escape code break; default: // output dict[localwords[text[i]]] and, if // space==TRUE a space. break; }
Figure 10) Sample code for decoding text
It is pretty obvious that this doesn't leave a lot of space for formatting commands. This is where the escape codes come in. The general format for an escape code is:
{ int_8 FF; // always equals 0xFF int_8 esclen; // length of sequence // (including esclen, excluding 0xFF) int_8 escCode; // which escape code }
Figure 11) Escape codes structure
These escape codes define things like setting margins, inter document links, and the like. We will ignore them here for the moment. There are described in the inf03.txt.
So Show Me Something That Works!
Ok, here's a snippet that shows the basic idea behind decoding slots. For the sake of simplicity we ignore most of the nasty stuff. You should also note that although each slot contains some text - the way these slots fit together is described next in Table of Contents.
Included with this issue of EDM/2 is the source for a small program, called exttext.cc, that extracts all the textual information from an INF file. Also included is a simple INF header class.
A few quick notes I've noticed about decoding articles that isn't mentioned in the inf02a.doc. When you decode multi- slots articles, the state of SPACE (ie true or false) is retained between the article and the next. Basically, although the local dictionary changes, pretend that each document is merge onto the next -- with regard to settings like the left and right margins, fonts, colours, font styles, and space.
Table of Contents
The Table of contents is created by loading in an array of [ntoc] 32-bit offsets, starting at offset tocstart.
At the offset ( tocentrystart[i] ) a toc entry structure is located that contains information including the title, the items level in the table of contents, if it is hidden or not, and, most importantly, how many and which slots make up the item. There is also a 'has_children' flag which if true means the following entry has a higher level.
Index
The index is pretty simple and relies on table of contents a fair bit. Beginning at file offset indexstart there is nindex structures like the following stored.
{ int_8 nword; // size of name int_8 level; // indent level int_8 stuff; int_16 toc; // toc entry number of panel char8 word[nword]; // index word [not zero-terminated] }
Figure 13) Index structure
Bitmaps
I am only going to cover this at a superficial level. I am only going to describe the compression used in newer INF files. The older (ie. v1.3 etc) INF files use a proprietary compression scheme.
The newer INF files use an LZW based compression scheme. This scheme is basically the same as the one covered in 'LZW Revisited' (Dr.Dobbs June 1990). You must alter the decompression code to use MAX_BITS 12 or you will spend a long time figuring out that after that last input byte 512 your output is all wrong [grin].
Getting Started
There seems to be no array of bitmap offsets in the file anywhere but there is a general start for the image information - imgstart.
Here's the basic rundown on decompression:
When during the decompression of a slot you come across a sequence something like 0xff + 0x07 + 0x0E + 0x01 + 0x00 + 0x00 + 0x00 + 0x00
you know is an escape code (0xff) of length 7 bytes (0x07); that code is for a bitmap/metafile (0x0E). The next byte is the flags byte and it breaks down like this:
if( items[i] & 0x01 ) printf ("Left "); // 00000001 if( items[i] & 0x02 ) printf ("Right "); // 00000010 if( items[i] & 0x04 ) printf ("Center "); // 00000100 if( items[i] & 0x08 ) printf ("Fit "); // 00001000 if( items[i] & 0x10 ) printf ("Runin "); // 00010000
Figure 14) Alignment for bitmaps
The next four bytes are a 32-bit offset from imgstart to the bitmap/metafile. (i.e. you do an is.seekg( imgstart + offset) Also remember again here that if you are writing for cross platform support you'll have to deal with the big-endian, little-endian issue.)
The Bitmap Header and Colour table
If you do the seek and read in the next two bytes you'll be able to know what sort of image comes next. mf means a metafile (I think - I haven't seen one!). BM means the old bitmap compression, and bM is the one that we are happy to see.
So, if everything is OK (ie 'bM') then read in a basic OS2BITMAP_FILEHEADER and OS2BITMAP_INFOHEADER. Something like this:
{ // BITMAP FILE HEADER char8 usType[2]; // = 'bM'; int_32 cbSize; int_16 xHotSpot; int_16 yHotSpot; int_32 offBits; // =size(hdr)+size(colortbl) // BITMAP INFO HEADER int_32 cbFix; // =size(info_hdr) (usually = 12?) int_16 cx; // x size int_16 cy; // y size int_16 cPlanes; // color planes int_16 cBitCount; }
Figure 15) Bitmap information structure
A quick note that if you are going to use this structure "as is" to dump to a bitmap then you'll have to change the offBits like:
offBits = 14 + 12 + ( 3 * ( 1 << cBitCount ) );
Figure 16) Code change to use structure in figure 15
Next up after the header comes the colour table which is basically an array of ( 1 << cBitCount ) RGB entries (well actually 1 byte Blue, 1 byte Green, 1 byte Red)
Data Blocks
Next up (after all that) comes the Master Data Block, and one (or more) minor data blocks - each with their own compression type.
{ // Master Data Block int_32 num_to_follow; // total number of bytes to follow int_16 uncompressed_bytes; // uncompressed bytes in each block } { // Minor Data Block int_16 x_bytes_to_follow; // number of bytes in this block (to follow) int_8 comp_type; // compression type 0=uncompressed, 2=lzw-based }
Figure 17) Data block structures
To help with your understanding here is the output from a program I wrote while working out the decompression routine.
[Editor's note - my word processor completely destroyed the alignment that was present in the output, so I feel forced to delete it. My apologies.]
Parting words
All my knowledge of the INF file format stemmed from the work of others. I hopefully have added some small bits of useful information and my motivation for writing this article is to make that information available to others.
The document that encouraged me into investigating the INF file format is available at OS/2 2.0 Information Presentation Facility (IPF) Data Format and was authored by Carl Hauser, and updated by Marcus Groeber. I lifted lots of stuff out of it for this article and have included inf03.txt as a slightly updated version.