Inside INF

From EDM2
Revision as of 17:31, 15 December 2017 by Ak120 (Talk | contribs)

Jump to: navigation, search

by Peter Childs

Introduction

The Problem

As I sat debating the approach I would adopt for this article it became distressingly obvious that there would be only a small audience of fanatics that would share my enthusiasm for the intricate details of the INF file format.

I also had to painfully admit that others would not even know that INF/HLP files are the backbone of OS/2's online help system, and that most of those that did would be quite happy to consider the issue finished at that point.

The Solution

In this article I will attempt to present a overview of the OS/2 INF/HLP file format in such a manner that a cursory read will leave the reader with a general idea of the file format. I will also explain some of the compression ideas used in the INF files and provide additional information for those wishing to investigate deeper.

As my talents as a programmer are fairly limited I will not include large chunks of source. I am, however, working on developing some C++ classes to allow easy access to the information in INF files. If there is sufficient demand I may do an article showing the possible use of these classes.

What does it mean?

Some Basic Terms

In this document we will be discussing the INF/HLP file format.

INF files and HLP files are basically identical with the exception that INF files are generally designed to be viewed with the view.exe program, like this magazine, and HLP files are developed to provide online help for applications. To the best of my knowledge the file format is identical except for a single flag bit. From here on, when I refer to INF files I mean both INF and HLP files.

INF files are compiled with the IPFC (Information Presentation Facility Compiler) available with most OS/2 compilers, and the OS/2 Toolkits. The source used is a form of fairly simple markup, with the power to do just about anything you could want.

The IPF Online Reference (1st Ed 1994) describes IPF:

The Information Presentation Facility (IPF) is a tool that enables you to create online information, to specify how it will appear on the screen, to connect various parts of the information, and to provide help information that can be requested by the user.

It is important to realize the difference between IPF source markup, and INF files. The INF files are the compiled versions, and the difference is as marked as the difference between C source and an executable. The IPF source markup is well documented, whereas the INF file format is not officially documented.

In the beginning

The Header

The header is described here as a structure. When I began playing with INF files I just defined this structure and then read 155 bytes into it starting at offset 0. Although this worked fine for Borland C++ I had to muck with things with gcc to force the structure to be packed.

Most compilers offer a method of packing structures but if your code has to be portable then you will also have to consider the big-endian/small-endian stuff (ie. the bytes are stored differently in memory on SPARCs than PCs). Although probably obvious to most programmers, this had me stumped!

Starting at file offset 0 the following structure resembles the header (packed):

struct os2infheader
{
   int_16 ID;               // ID magic word (5348h = "HS")
   int_8  unknown1;         // unknown purpose, could be third letter of ID
   int_8  flags;            // probably a flag word...
                            // bit 0: set if INF style file
                            // bit 4: set if HLP style file
                            // patching this byte allows reading HLP files
                            // using the VIEW command, while help files
                            // seem to work with INF settings here as well.
   int_16 hdrsize;          // total size of header
   int_16 unknown2;         // unknown purpose
   int_16 ntoc;             // 16 bit number of entries in the tocarray
   int_32 tocstrtablestart; // 32 bit file offset of the start of the
                            // strings for the table-ofcontents
   int_32 tocstrlen;        // number of bytes in file occupied by the
                            // table-of-contents strings int_32 tocstart;
                            // 32 bit file offset of the start of tocarray
   int_16 nres;             // number of panels with resource numbers
   int_32 resstart;         // 32 bit file offset of resource number table
   int_16 nname;            // number of panels with textual name
   int_32 namestart;        // 32 bit file offset to panel name table
   int_16 nindex;           // number of index entries
   int_32 indexstart;       // 32 bit file offset to index table
   int_32 indexlen;         // size of index table
   int_8  unknown3[10];     // unknown purpose
   int_32 searchstart;      // 32 bit file offset of full text search table
   int_32 searchlen;        // size of full text search table
   int_16 nslots;           // number of "slots"
   int_32 slotsstart;       // file offset of the slots array
   int_32 dictlen;          // number of bytes occupied by the
                            // "dictionary"
   int_16 ndict;            // number of entries in the dictionary
   int_32 dictstart;        // file offset of the start of the dictionary
   int_32 imgstart;         // file offset of image data
   int_8  unknown4;         // unknown purpose
   int_32 nlsstart;         // 32 bit file offset of NLS table
   int_32 nlslen;           // size of NLS table
   int_32 extstart;         // 32 bit file offset of extended data block
   int_8  unknown5[12];     // unknown purpose
   char8 title[48];         // ASCII title of database
}

Figure 1) INF header structure

Most of these values come in handy.

Our Sample File

Below is a simple IPF (source) file which I have compiled into a INF for use in examples.

:userdoc.
:title. Sample INF file...
:h1.Header One
:p.This is a test. Os/2, lies, and Windows 95.
:p.1234.5
:artwork name='in_inf.bmp'.
Hello
:p.
:artwork name='tocarray.bmp'.
:i1. This is a index entry
:euserdoc.

Figure 2) Sample IPF file

Master Dictionary

Each INF file has a master dictionary which holds all of the words and symbols used in the articles that make up the INF file. The dictionary starts at offset dictstart, has a length dictlen, and comprises of ndict words.

Figure 3) Master dictionary layout (in the file)

In the example case above the dictionary is like this:

[0] : (,)
[1] : (.)
[2] : (/)
[3] : (1234)
[4] : (2)
[5] : (5)
[6] : (95)
[7] : (a)
[8] : (and)
[9] : (Hello)
[10] : (is)
[11] : (lies)
[12] : (Os)
[13] : (test)
[14] : (This)
[15] : (Windows)

Figure 4) Master dictionary for the IPF sample

Some things to note are that the source contains the word Os/2, whereas the dictionary contains the words '/', '2', and 'Os'.

One way of loading the dictionary is detailed below (C++ code snippet).

dict = new char*[ infHeader.ndict ]; // our array of pointers

// change all length bytes to '\0' and set pointers
// to start of each word
while( i < infHeader.dictlen && j < infHeader.ndict )
     {
     add = dictstore[i];
     dict[j++] = &( dictstore[i+1] );
     dictstore[i] = '\0';
     i += add;
     }

Figure 5) C++ code for loading the dictionary

The method used is irrelevant but you need some way of mapping i to the i'th element of the dictionary. Also don't forget to delete the allocated memory if you use the above sample with:

delete[] dictstore;
delete[] dict;

Figure 6) C++ to delete the dictionary

Articles

Each article in an INF file is comprised of one or more slots. There are several structures that deal with slots. One is a array of offsets mapping i to the i'th slot's position in the file. Another is also the structure of the slot itself. Each slot also has a local dictionary that maps items in the slot to words in the master dictionary.

The Slots Array

Beginning at file offset slotsstart (from the header) there is an array of int32's. These are offsets in the INF file in which the i'th slot can be found.

int_32 slots[nslots]

Figure 7) Slots area declaration

The Slots Themselves

Beginning at the file offset slots[i] the following structure can overlay the file:

{
     int_8	stuff;	   // ?? [always seen 0]
     int_32 localdictpos;  // file offset  of  the  local dictionary
     int_8	nlocaldict;    // number of entries in  the local dictionary
     int_16 ntext;	   // number of bytes in the text
     int_8	text[ntext];   // encoded text of the article
}

Figure 8) Slots structure

The Local Dictionary

The local dictionary is used to map items in the encoded text of the slot to words in the master dictionary. Take note that the nlocaldict variable in a slot's structure is a byte in size, hence a single slot can only have a maximum of 255 (really 250 - we'll discuss that later) different words from the master dictionary in it.

Beginning at file offset localdictpos (for each article) there is an array:

int_16 localdict[nlocaldict]

Figure 9) Local dictionary declaration

The Text Itself

The encoded text is decoded somewhat like the following:

bool space = TRUE;
while( i++ < ntext )
     switch( text[i] )
     {
	    case  0xfa: // end of paragraph, sets space  to TRUE
		   break;
	    case 0xfb: // [unknown]
		   break;
	    case 0xfc: // spacing = !spacing
		   break;
	    case 0xfd: // line break, set space to TRUE if not monospaced
				   // example
		   break;
	    case 0xfe: // space
		   break;
	    case 0xff: // escape code
		   break;
	    default:   // output dict[localwords[text[i]]] and, if
				   // space==TRUE a space.
		   break;
     }

Figure 10) Sample code for decoding text

It is pretty obvious that this doesn't leave a lot of space for formatting commands. This is where the escape codes come in. The general format for an escape code is:

{
     int_8	FF;		 // always equals 0xFF
     int_8	esclen;	 // length of sequence
		   //   (including esclen, excluding 0xFF)
     int_8	escCode;  // which escape code
}

Figure 11) Escape codes structure

These escape codes define things like setting margins, inter document links, and the like. We will ignore them here for the moment. There are described in the inf03.txt.

So Show Me Something That Works!

Ok, here's a snippet that shows the basic idea behind decoding slots. For the sake of simplicity we ignore most of the nasty stuff. You should also note that although each slot contains some text - the way these slots fit together is described next in Table of Contents.

Included with this issue of EDM/2 is the source for a small program, called exttext.cc, that extracts all the textual information from an INF file. Also included is a simple INF header class.

A few quick notes I've noticed about decoding articles that isn't mentioned in the inf02a.doc. When you decode multi- slots articles, the state of SPACE (ie true or false) is retained between the article and the next. Basically, although the local dictionary changes, pretend that each document is merge onto the next -- with regard to settings like the left and right margins, fonts, colours, font styles, and space.

Table of Contents

The Table of contents is created by loading in an array of [ntoc] 32-bit offsets, starting at offset tocstart.

At the offset ( tocentrystart[i] ) a toc entry structure is located that contains information including the title, the items level in the table of contents, if it is hidden or not, and, most importantly, how many and which slots make up the item. There is also a 'has_children' flag which if true means the following entry has a higher level.

Figure 12) TOC (Table of Contents) entries

Index

The index is pretty simple and relies on table of contents a fair bit. Beginning at file offset indexstart there is nindex structures like the following stored.

{
     int_8	nword;	     // size of name
     int_8	level;	     // indent level
     int_8	stuff;
     int_16    toc;	     // toc entry number of panel
     char8	word[nword]; // index word [not zero-terminated]
}

Figure 13) Index structure

Bitmaps

I am only going to cover this at a superficial level. I am only going to describe the compression used in newer INF files. The older (ie. v1.3 etc) INF files use a proprietary compression scheme.

The newer INF files use a LZW based compression scheme. This scheme is basically the same as the one covered in 'LZW Revisited' (Dr.Dobbs June 1990). You must alter the decompression code to use MAX_BITS 12 or you will spend a long time figuring out that after that last input byte 512 your output is all wrong [grin].

Getting Started

There seems to be no array of bitmap offsets in the file anywhere but there is a general start for the image information - imgstart.

Here's the basic rundown on decompression:

When during the decompression of a slot you come across a sequence something like 0xff + 0x07 + 0x0E + 0x01 + 0x00 + 0x00 + 0x00 + 0x00

you know is an escape code (0xff) of length 7 bytes (0x07); that code is for a bitmap/metafile (0x0E). The next byte is the flags byte and it breaks down like this:

if( items[i] & 0x01 ) printf ("Left ");   // 00000001
if( items[i] & 0x02 ) printf ("Right ");  // 00000010
if( items[i] & 0x04 ) printf ("Center "); // 00000100
if( items[i] & 0x08 ) printf ("Fit ");    // 00001000
if( items[i] & 0x10 ) printf ("Runin ");  // 00010000

Figure 14) Alignment for bitmaps

The next four bytes are a 32-bit offset from imgstart to the bitmap/metafile. (i.e. you do a is.seekg( imgstart + offset ) Also remember again here that if you are writing for cross platform support you'll have to deal with the big-endian, little-endian issue.)

The Bitmap Header and Colour table

If you do the seek and read in the next two bytes you'll be able to know what sort of image comes next. mf means a metafile (I think - I haven't seen one!). BM means the old bitmap compression, and bM is the one that we are happy to see.

So, if everything is OK (ie 'bM') then read in a basic OS2BITMAP_FILEHEADER and OS2BITMAP_INFOHEADER. Something like this:

{ // BITMAP FILE HEADER
     char8	   usType[2];   // = 'bM';
     int_32    cbSize;
     int_16    xHotSpot;
     int_16    yHotSpot;
     int_32    offBits;     // =size(hdr)+size(colortbl)
     // BITMAP INFO HEADER
     int_32    cbFix;	    // =size(info_hdr) (usually = 12?)
     int_16    cx;	    // x size
     int_16    cy;	    // y size
     int_16    cPlanes;     // color planes
     int_16    cBitCount;
}

Figure 15) Bitmap information structure

A quick note that if you are going to use this structure "as is" to dump to a bitmap then you'll have to change the offBits like:

offBits = 14 + 12 + ( 3 * ( 1 << cBitCount ) );

Figure 16) Code change to use structure in figure 15

Next up after the header comes the colour table which is basically an array of ( 1 << cBitCount ) RGB entries (well actually 1 byte Blue, 1 byte Green, 1 byte Red)

Data Blocks

Next up (after all that) comes the Master Data Block, and one (or more) minor data blocks - each with their own compression type.

{ // Master Data Block
   int_32 num_to_follow;      // total number of bytes to follow
   int_16 uncompressed_bytes; // uncompressed bytes in each block
}
{ // Minor Data Block
   int_16 x_bytes_to_follow;  // number of bytes in this block (to follow)
   int_8  comp_type;          // compression type 0=uncompressed, 2=lzw-based
}

Figure 17) Data block structures

To help with your understanding here is the output from a program I wrote while working out the decompression routine.

[Editor's note - my word processor completely destroyed the alignment that was present in the output, so I feel forced to delete it. My apologies.]

Parting words

All my knowledge of the INF file format stemmed from the work of others. I hopefully have added some small bits of useful information and my motivation for writing this article is to make that information available to others.

The document that encouraged me into investigating the INF file format is available at OS/2 2.0 Information Presentation Facility (IPF) Data Format and was authored by Carl Hauser, and updated by Marcus Groeber. I lifted lots of stuff out of it for this article and have included inf03.txt as a slightly updated version.