Inside INF

by Peter Childs

The Problem
As I sat debating the approach I would adopt for this article it became distressingly obvious that there would be only a small audience of fanatics that would share my enthusiasm for the intricate details of the INF file format.

I also had to painfully admit that others would not even know that INF/HLP files are the backbone of OS/2's online help system, and that most of those that did would be quite happy to consider the issue finished at that point.

The Solution
In this article I will attempt to present a overview of the OS/2 INF/HLP file format in such a manner that a cursory read will leave the reader with a general idea of the file format. I will also explain some of the compression ideas used in the INF files and provide additional information for those wishing to investigate deeper.

As my talents as a programmer are fairly limited I will not include large chunks of source. I am, however, working on developing some C++ classes to allow easy access to the information in INF files. If there is sufficient demand I may do an article showing the possible use of these classes.

Some Basic Terms
In this document we will be discussing the INF/HLP file format.

INF files and HLP files are basically identical with the exception that INF files are generally designed to be viewed with the view.exe program, like this magazine, and HLP files are developed to provide online help for applications. To the best of my knowledge the file format is identical except for a single flag bit. From here on, when I refer to INF files I mean both INF and HLP files.

INF files are compiled with the IPFC (Information Presentation Facility Compiler) available with most OS/2 compilers, and the OS/2 Toolkits. The source used is a form of fairly simple markup, with the power to do just about anything you could want.

The IPF Online Reference (1st Ed 1994) describes IPF:

The Information Presentation Facility (IPF) is a tool that enables you to create online information, to specify how it will appear on the screen, to connect various parts of the information, and to provide help information that can be requested by the user.

It is important to realize the difference between IPF source markup, and INF files. The INF files are the compiled versions, and the difference is as marked as the difference between C source and an executable. The IPF source markup is well documented, whereas the INF file format is not officially documented.

The Header
The header is described here as a structure. When I began playing with INF files I just defined this structure and then read 155 bytes into it starting at offset 0. Although this worked fine for Borland C++ I had to muck with things with gcc to force the structure to be packed.

Most compilers offer a method of packing structures but if your code has to be portable then you will also have to consider the big-endian/small-endian stuff (ie. the bytes are stored differently in memory on SPARCs than PCs). Although probably obvious to most programmers, this had me stumped!

Starting at file offset 0 the following structure resembles the header (packed): Figure 1) INF header structure

Most of these values come in handy.

Our Sample File
Below is a simple IPF (source) file which I have compiled into a INF for use in examples. Figure 2) Sample IPF file

Master Dictionary
Each INF file has a master dictionary which holds all of the words and symbols used in the articles that make up the INF file. The dictionary starts at offset dictstart, has a length dictlen, and comprises of ndict words.



In the example case above the dictionary is like this: Figure 4) Master dictionary for the IPF sample

Some things to note are that the source contains the word Os/2, whereas the dictionary contains the words '/', '2', and 'Os'.

One way of loading the dictionary is detailed below (C++ code snippet). Figure 5) C++ code for loading the dictionary

The method used is irrelevant but you need some way of mapping i to the i'th element of the dictionary. Also don't forget to delete the allocated memory if you use the above sample with: delete[] dictstore; delete[] dict; Figure 6) C++ to delete the dictionary

Articles
Each article in an INF file is comprised of one or more slots. There are several structures that deal with slots. One is a array of offsets mapping i to the i'th slot's position in the file. Another is also the structure of the slot itself. Each slot also has a local dictionary that maps items in the slot to words in the master dictionary.

The Slots Array
Beginning at file offset slotsstart (from the header) there is an array of int32's. These are offsets in the INF file in which the i'th slot can be found. int_32 slots[nslots] Figure 7) Slots area declaration

The Slots Themselves
Beginning at the file offset slots[i] the following structure can overlay the file: Figure 8) Slots structure

The Local Dictionary
The local dictionary is used to map items in the encoded text of the slot to words in the master dictionary. Take note that the nlocaldict variable in a slot's structure is a byte in size, hence a single slot can only have a maximum of 255 (really 250 - we'll discuss that later) different words from the master dictionary in it.

Beginning at file offset localdictpos (for each article) there is an array: int_16 localdict[nlocaldict] Figure 9) Local dictionary declaration

The Text Itself
The encoded text is decoded somewhat like the following: Figure 10) Sample code for decoding text

It is pretty obvious that this doesn't leave a lot of space for formatting commands. This is where the escape codes come in. The general format for an escape code is: Figure 11) Escape codes structure

These escape codes define things like setting margins, inter document links, and the like. We will ignore them here for the moment. There are described in the inf03.txt.

So Show Me Something That Works!
Ok, here's a snippet that shows the basic idea behind decoding slots. For the sake of simplicity we ignore most of the nasty stuff. You should also note that although each slot contains some text - the way these slots fit together is described next in Table of Contents.

Included with this issue of EDM/2 is the source for a small program, called exttext.cc, that extracts all the textual information from an INF file. Also included is a simple INF header class.

A few quick notes I've noticed about decoding articles that isn't mentioned in the inf02a.doc. When you decode multi- slots articles, the state of SPACE (ie true or false) is retained between the article and the next. Basically, although the local dictionary changes, pretend that each document is merge onto the next -- with regard to settings like the left and right margins, fonts, colours, font styles, and space.

Table of Contents
The Table of contents is created by loading in an array of [ntoc] 32-bit offsets, starting at offset tocstart.

At the offset ( tocentrystart[i] ) a toc entry structure is located that contains information including the title, the items level in the table of contents, if it is hidden or not, and, most importantly, how many and which slots make up the item. There is also a 'has_children' flag which if true means the following entry has a higher level.



Index
The index is pretty simple and relies on table of contents a fair bit. Beginning at file offset indexstart there is nindex structures like the following stored. Figure 13) Index structure

Bitmaps
I am only going to cover this at a superficial level. I am only going to describe the compression used in newer INF files. The older (ie. v1.3 etc) INF files use a proprietary compression scheme.

The newer INF files use a LZW based compression scheme. This scheme is basically the same as the one covered in 'LZW Revisited' (Dr.Dobbs June 1990). You must alter the decompression code to use MAX_BITS 12 or you will spend a long time figuring out that after that last input byte 512 your output is all wrong [grin].

Getting Started
There seems to be no array of bitmap offsets in the file anywhere but there is a general start for the image information - imgstart.

Here's the basic rundown on decompression:

When during the decompression of a slot you come across a sequence something like 0xff + 0x07 + 0x0E + 0x01 + 0x00 + 0x00 + 0x00 + 0x00

you know is an escape code (0xff) of length 7 bytes (0x07); that code is for a bitmap/metafile (0x0E). The next byte is the flags byte and it breaks down like this: Figure 14) Alignment for bitmaps

The next four bytes are a 32-bit offset from imgstart to the bitmap/metafile. (i.e. you do a is.seekg( imgstart + offset ) Also remember again here that if you are writing for cross platform support you'll have to deal with the big-endian, little-endian issue.)

The Bitmap Header and Colour table
If you do the seek and read in the next two bytes you'll be able to know what sort of image comes next. mf means a metafile (I think - I haven't seen one!). BM means the old bitmap compression, and bM is the one that we are happy to see.

So, if everything is OK (ie 'bM') then read in a basic OS2BITMAP_FILEHEADER and OS2BITMAP_INFOHEADER. Something like this: Figure 15) Bitmap information structure

A quick note that if you are going to use this structure "as is" to dump to a bitmap then you'll have to change the offBits like: offBits = 14 + 12 + ( 3 * ( 1 << cBitCount ) ); Figure 16) Code change to use structure in figure 15

Next up after the header comes the colour table which is basically an array of ( 1 << cBitCount ) RGB entries (well actually 1 byte Blue, 1 byte Green, 1 byte Red)

Data Blocks
Next up (after all that) comes the Master Data Block, and one (or more) minor data blocks - each with their own compression type.

Figure 17) Data block structures

To help with your understanding here is the output from a program I wrote while working out the decompression routine.

[Editor's note - my word processor completely destroyed the alignment that was present in the output, so I feel forced to delete it. My apologies.]

Parting words
All my knowledge of the INF file format stemmed from the work of others. I hopefully have added some small bits of useful information and my motivation for writing this article is to make that information available to others.

The document that encouraged me into investigating the INF file format is available at OS/2 2.0 Information Presentation Facility (IPF) Data Format and was authored by Carl Hauser, and updated by Marcus Groeber. I lifted lots of stuff out of it for this article and have included inf03.txt as a slightly updated version.