Feedback Search Top Backward Forward
EDM/2

Building Smaller OS/2 Executables (Part 1)

Written by Pete Cassetta

 
Part1 Part2

Introduction

Not so long ago, programmers used to pride themselves on how tight and fast their code was. RAM was expensive and scarce, so it was necessary to shoehorn as much functionality as possible into the precious memory available. Today, of course, all that has changed. RAM has become cheap and plentiful, and, even more importantly, modern PC operating systems such as OS/2 support virtual memory. This provides a huge address space to play in, with no more than a fairly minor performance hit when programs need more memory than is physically present and free in the system. Now that memory constraints are less of a concern, programmers naturally focus more attention on other issues such as making deadlines, adding functionality, and getting the bugs out (often in that order). If code size is considered at all, it usually lags well behind these other concerns.

Users, on the other hand, pay a lot of attention to code size. Their hard disk space is always at a premium, and they complain loudly about the disk requirements of today's multi-megabyte programs. Also, while large programs may still run, users notice it when programs take a long time to load or force OS/2 to spend too much time swapping over-committed memory to disk. Making your program smaller will help endear it to your users and give it an edge in competing with other programs for the limited disk space on your potential users' computers.

The good news is that OS/2 actually provides programmers with more facilities for producing optimally small executables than perhaps any other PC operating system available today. Although these features often go unused, they are really very quick and easy to implement. This series of articles will detail five simple steps you can take to get your OS/2 executables down to size.

Executable

I will use executable as a cover term to refer to both EXEs and DLLs. While some of my comments also apply to device drivers, these are really beyond the scope I wish to address, so I won't mention them specifically.

Tradeoffs Between Size and Speed

Before I lose the speed demons out there, I'd better reassure you that it's possible to optimize code size without introducing too much of a speed penalty.

Every program has bottlenecks where special attention must be given to execution speed. These fall into two broad categories: I/O operations involving the disk, screen, or to a lesser extent, the printer, and computationally expensive operations such as searching, sorting, or updating data structures. I suggest code involving I/O operations is the place to optimize for size. The reason for this is that OS/2 programs tend to do I/O at a fairly high level, thanks to the rich set of I/O facilities OS/2 provides via its Dos, Gpi, Win, and other API functions. To optimize the speed of your I/O operations under OS/2 you usually need to improve your algorithms and possibly also the selection and sequence of API calls you use. Once you've done this, you are generally free to optimize the code you've written for size, since any speed degradation this introduces will be negligible. After all, only a minuscule portion of the time required by an I/O operation will be spent in your code; the significant time is spent within OS/2's system code.

Keeping this in mind, it is helpful to organize your source code into modules which will be optimized for size and others where you'll go all-out for speed. This makes it relatively easy to select different compiler options for these two categories of code.

Finally, while optimizing for size and speed are often mutually exclusive goals, there are a number of things you can do that will actually improve both size and speed. As I discuss each step below, I'll detail the side effects on execution speed that you're likely to encounter.

Five Simple Steps

Shown below are five steps you can take to reduce executable size. They are listed according to the order in which you should take them, especially if your time is limited. Earlier steps are both quicker to implement and more likely to show significant results for your effort.

  1. Compress Your Resources
  2. Put Your Linker to Work
  3. Experiment with Compiler Switches
  4. Try a Different Compiler
  5. Use API Wrapper Functions

I'll cover steps 1 and 2 in this article, and leave steps 3 - 5 for next time.

At certain points I'll need to apply my comments to specific compiler packages. I own three OS/2 compilers, Borland C++ 1.5, IBM C Set++ 2.0 (CSD level 9), and Watcom C/C++ 10.0, so these are the ones I'll cover. My apologies if your compiler isn't included.

Step 1 - Compress Your Resources

The OS/2 resource compiler has a little-known switch, -x, which enables compression of resources as they are being bound to the executable. The compression is basically run-length encoding, so it gives best results for resource types that tend to have repeating sequences of data, e.g. bitmaps, icons, pointers, and fonts. If you haven't been using this switch, I encourage you to take a break right now and try it out on one of your programs. Simply re-bind the resources using a command like the following:


     rc -x myprog.res myprog.exe
You may have noticed that this took longer than usual; resource binding is much slower when the -x switch is used. For this reason, you may want to reserve this switch for making release versions of your executables.

Even Better Under Warp

In OS/2 Warp, IBM has made a good thing better by adding a second compression algorithm for resources. The Warp toolkit comes with version 2.02.001 of the resource compiler. This version accepts the switch -x2, which enables both run-length encoding and the new algorithm, which I expect is a form of Lempel-Ziv-Welch. Note that you can still use -x, or its new synonym -x1, if you want run-length encoding only. The bad news is that versions of OS/2 prior to Warp don't know how to load resources bound with - x2. If you want your program to run under OS/2 2.x, you'll have to stick with -x1 for now.

Compression Schemes

Run-length encoding is a simple scheme whereby a sequence of identical data units is replaced by a flag, a count and then one copy of the data unit. So a sequence like "00 00 00 00 00 00" is compressed to "Flag 06 00". This can be quite effective for compressing graphical data, but rarely gives any significant savings for text.

Lempel-Ziv-Welch is an algorithm that replaces repeated strings by shorter tokens. For example, the string " rep" occurs twice in the last sentence so it would be a good candidate for replacement by a token. This algorithm gives very good results for compressing a variety of data types. Variants of it are used in commercial products like PKZIP, Stacker and many others.

A Look at the Results

The following chart shows the results I obtained by using the -x1 and -x2 switches while binding resources to a variety of executables:

Executable Origin Original Size Size with -x1 Size with -x2
I495.EXE EDM 2-7 129,792 68,448 (53%) 50,752 (39%)
BPMCC.DLL Borland C++ 1.5 99,456 84,608 (85%) 64,640 (65%)
BCRES.DLL Borland C++ 1.5 166,042 153,242 (92%) 114,330 (69%)
FPWCAT.DLL OS/2 Warp 237,901 155,581 (65%) 124,573 (52%)
MAHJONGG.EXE OS/2 Warp 620,731 562,139 (91%) 449,199 (72%)
OS2-CIM.EXE OS/2 Warp 1,275,350 1,145,962 (90%) 1,023,670 (80%)
Figure 1 - Effect of compressing resources on the size of various executables.

The first executable, I495.EXE, was supplied as sample code to an accompanying EDM/2 article, so it's not too surprising that it didn't use this switch. But note that all the others were taken from commercial products! I've found that most OS/2 products are released with uncompressed resources, needlessly wasting space on your hard disk and mine.

As a disclaimer, I should mention that I chose these particular executables because they contain a lot of resources. The savings is far less spectacular in executables which have fewer resources. Also, you may be wondering where I got the RES files for all these executables. I had source code for I495.EXE, but for each of the others, I used Borland's Resource Workshop to extract the resources and then saved them as a RES file. I generated new executables with commands like the following:


     rc -x2 bpmcc.res bpmcc.dll

Tradeoffs and Limitations

Compressed resources do add run-time overhead, since OS/2 must decompress the resources as they are loaded. Keep in mind, however, that the time spent in decompression is at least partly offset by the time savings of reading fewer bytes from disk. In effect, using compressed resources shifts some of the run-time burden from I/O (reading resources from disk) to processor (decompression). Since processor speed is improving much more rapidly than hard disk speed these days, this seems to be a sensible tradeoff. In my own informal testing, I've found that programs with compressed resources tend to load up to 5% slower in a 386 machine, but they often load quicker in a machine with a faster processor.

You should be aware that compressing your resources saves nothing but disk space; once loaded, the resources still consume the same amount of RAM as they would otherwise. Nevertheless, the disk space savings can be considerable, and I see no reason why every OS/2 executable shouldn't be built with compressed resources before release.

Step 2 - Put Your Linker to Work

Having done all we can with resources, let's turn our attention to the link step. First I'll describe the four types of optimizations that can be done during a link, then I'll look at how specific linkers support these optimizations.

Aligning Pages of Code and Data

In OS/2 executables, every page of code or data begins at an offset from the beginning of the file that is an even multiple of the page alignment. The alignment must be a power of two, and many linkers use 512 as the default. 512 is used because this is the size of a disk sector, and starting pages on sector boundaries ensures that OS/2 doesn't need to read any more sectors than necessary to load a page. (A sector is the smallest unit of disk space OS/2 can read.) To see how this works, consider what happens when OS/2 must load a 32-byte page that begins at file offset 1000. In this case, two sectors must be read: the one which begins at offset 512 and contains the first 24 bytes of the page, and the one which begins at offset 1024 and contains the final 8 bytes. Note that if this page were to begin at offset 1024, then OS/2 could load it with a single sector read.

The downside here is that the linker must pad pages with null bytes in order to make their length an even multiple of the page alignment. As a result, larger alignment values increase executable size. If the above example were linked with a page alignment of 512, then the 32-byte page would need 512 - 32, or 480 bytes of padding. So you see the tradeoff: linking with a smaller page alignment reduces executable size, but it may also cause more pages to span sector boundaries, increasing load time. Also, keep in mind that using a smaller page alignment saves disk space only; it has no effect on the amount of memory your program will need once loaded.

Eliminating Internal Fixups in EXEs

For every function call made in a program, the linker must somehow supply the function's address. Normally, the address isn't known at link time, so the linker creates what is known as a relocation record, or fixup for short. Fixups come in two flavors: internal and external. Internal fixups reference functions in the EXE or DLL being linked, while external fixups reference functions in external DLLs. Each fixup is resolved at load time, which means that the call statement is patched with the linear run-time address of the function being called. You should minimize the number of fixups whenever possible, because they are bulky and time-consuming to resolve.

As it turns out, internal fixups aren't really necessary in EXEs. This is because OS/2 always loads programs at a starting address of 64K. If you tell the linker that your program will load at a starting address of 64K, it can determine the run-time addresses of all functions local to that EXE, and supply these addresses directly instead of creating fixups. Note that this applies only to EXEs; there is no way to reliably predict the load address of an OS/2 DLL at link time.

I've found that when internal fixups are eliminated, most EXEs shrink by 5 to 15% and load perceptibly quicker. Yet a surprising number of commercial OS/2 products contain EXEs with internal fixups. Here are a couple of examples:

Product EXEs with Internal Fixup
Borland C++ 1.5 BRCC.EXE, IMPDEF.EXE, MAKE.EXE, WORKSHOP.EXE, ...
OS/2 Warp FPWPIM.EXE, GOPHER.EXE, MAHJONGG.EXE, TELNETPM.EXE, ...
Figure 2 - Examples of OS/2 executables which contain internal fixups.

To check whether an EXE contains internal fixups, you can use either IBM's EXEHDR or Borland's TDUMP. If a program has no internal fixups, you'll see the following lines near the beginning of EXEHDR's display:


Module type:   Program
	    NO internal fixups in executable image
When internal fixups are present, the second line is omitted. If you run TDUMP on a program that has no internal fixups, you'll see something similar to the following:

Module flags			     00000312
     Instance init		: No
     Internal fixups removed	     : Yes
When internal fixups are present, the "Yes" is changed to a "No".

Chaining Fixups

As mentioned above, every time you call a function whose address isn't known at link time, the linker creates a fixup. So if your program makes 25 calls to WinSendMsg, the linker will create 25 fixups that all reference WinSendMsg in PMWIN.DLL. This is rather redundant, since much of the information in these 25 fixups is identical.

Fixup chaining is a very elegant optimization that addresses this redundancy. Instead of generating 25 fixups, the linker just generates one fixup and creates a linked list to all the places in the executable where this fixup must be applied. This reduces the size of your executable, and also speeds up loading. The time savings comes about because the loader only needs to determine the address of the function once; it then traverses the list and applies this address at each node on the list.

Compressing Code and Data

Earlier we looked at resource compression. In a similar vein, OS/2 supports compression of code and data, which is done during the link step. OS/2 2.x supports run-length encoding only, but OS/2 Warp also supports a newer scheme that usually yields much better results.

The compression tradeoffs discussed in relation to resources apply here as well. Programs with compressed code and data may load a bit slower in a 386 machine, but often load faster in machines with faster processors. Also, compressing your code and data saves disk space only; it doesn't reduce your program's RAM requirements.

Applying Linker Optimizations

Now I'll take a look at various linkers. For each one, I'll mention which optimizations are supported, and how to enable or disable these optimizations.

TLINK (Borland C++)

Borland's linker, TLINK, has three switches that help reduce the size of a generated executable: /A:dd, /B:0x10000, and /Oc. /A:dd specifies the page alignment for code and data. Remember that the alignment must be a power of two, and dd specifies which power of two you wish to use. For EXEs, /B:0x10000 specifies a load address of 64K (0x10000 in hexadecimal), eliminating all internal fixups. /Oc enables fixup chaining, which is disabled by default. Unfortunately, TLINK doesn't support compression of code or data.

To see how these options affect the size of executables, I've linked the "Clock" sample program from Borland C++ 1.5, using a variety of switches:

Alignment Size Size with
/B:0x10000
Size with
/B:0x10000 /Oc
/A:1 89,350 76,820 75,388
/A:2 89,356 76,824 75,392
/A:4 89,412 76,884 75,444
/A:9 92,724 79,924 78,388
Figure 3 - Effect of various TLINK switches on the size of CLOCK.EXE.

This table has a row for each alignment I tried, and columns for various combinations of the /B:0x10000 and /Oc switches. Recent recommendations I've seen from IBM call for using an alignment of 2 or 4 bytes. Since there's not much difference between the two, I generally use 4 bytes. I included the 16 byte row for comparison purposes, since this value is commonly used by developers. As you can see from the final column, linking with /B:0x10000 /Oc gives a significant savings in size; these switches are highly recommended for release builds.

LINK386 (IBM C Set++)

IBM's linker, LINK386, has four switches that can help you reduce your code size: /A[LIGNMENT]:n, /BAS[E]:0x10000, /E[XEPACK:{1|2}], and /NOS[ECTORALIGNCODE]. The portion of each command which is enclosed in square brackets is optional; I'll use the shorter forms from now on.

The first of these, /A:n, lets you specify the alignment for code and data pages. The parameter n is the desired alignment in bytes, e.g. 4, 16, etc. LINK386 uses 512 as its default alignment when /A:n isn't specified.

I'll discuss /NOS next, since it relates to code alignment. This switch is new, first appearing in LINK386 version 2.02.001, which comes with the Warp toolkit. Previous versions of LINK386 used the /A:n switch for aligning both code and data pages. With version 2.02.001 (and presumably later versions) of LINK386, /A:n affects only data pages; pages of code are always aligned on 512 byte boundaries so that they will begin on sector boundaries. To override this new behavior, you can supply the /NOS switch. This forces LINK386 to use the value given with /A:n for both code and data pages, just as earlier versions of LINK386 did. Note that this switch has no affect when you use an alignment of 512.

Since IBM has changed the default behavior of LINK386, you might imagine they feel pretty strongly about keeping pages of code aligned on sector boundaries. This seems to be the case; all recent build recommendations I've seen from IBM suggest omitting the /NOS switch. Remember that code pages are discardable, so they may be loaded and reloaded a large number of times during a single execution of your program. I guess IBM feels it is very important to optimize this load process, and keeping pages aligned on disk sectors is a help.

For EXEs, /BAS:0x10000 specifies a load address of 64K, eliminating internal fixups.

Before I discuss the final switch, I should mention that fixup chaining is always enabled in LINK386; no switches are provided to disable it.

The final switch, /E:{1|2}, allows you to compress your code and data. /E uses run-length encoding for the compression, and this option has been present in all versions of LINK386. Version 2.02.001 of LINK386 allows you to suffix this option with a number. /E:1 is a synonym for /E; /E:2 enables both run-length encoding and the new compression scheme.

To see how these options affect the size of executables, I've linked the "Clock" sample program from the 2.x toolkit, using a variety of switches (/BAS:0x10000 was used for all links):

Alignment Size Size with /NOS Size with /E:1 Size with /E:1 /NOS Size with /E:2 Size with /E:2 /NOS
/A:2 74,064 73,814 73,552 73,064 56,302 52,828
/A:4 74,068 73,820 73,556 73,072 56,308 52,848
/A:16 74,116 73,876 73,604 73,124 56,356 52,980
/A:512 76,852 76,852 76,340 76,340 59,444 59,444
Figure 4 - Effect of LINK386 switches on the size of CLOCK.EXE.

This table has a row for each alignment I tried, and columns which show various combinations of the /E:{1|2} and /NOS switches. By scanning across the bottom row, you'll see that /NOS indeed has no effect with an alignment of 512. Notice that /NOS makes the most difference when used with the /E:2 switch; it has a much smaller impact elsewhere.

WLINK (Watcom C/C++)

Watcom's linker, WLINK, provides two options that affect executable size. The Alignment option lets you specify page alignment, and the Offset option lets you specify the load address for EXEs. Unlike the other linkers I've discussed, WLINK's defaults produce optimally small executables. You won't be able to reduce executable size any further by overriding these defaults. Unfortunately, WLINK doesn't support fixup chaining or compression of code and data.

To Be Continued...

So far I've shown that judicious selection of switches during linking and resource binding can reduce executable size significantly. Next time I'll look at the compile step and also discuss a simple source code modification that can produce further savings.

In the meantime, if you haven't been using the techniques I've discussed in this article, I encourage you to give them a try. I'd enjoy hearing how much savings they give you. I'll also welcome any other comments, questions, or suggestions you might have on this topic.

Part1 Part2