Feedback Search Top Backward Forward
EDM/2

Building Smaller OS/2 Executables (Part 2)

Written by Pete Cassetta

 
Part1 Part2

Introduction

This article will conclude a two-part series on making your OS/2 executables as small as possible.

Executable

In this article, I'll continue to use executable as a cover term to refer to both EXEs and DLLs. While some of my comments also apply to device drivers, these are really beyond the scope I wish to address, so I won't mention them specifically.

A Quick Recap

Part 1 of this series appeared in Volume 3, Issue 3 (March 1995) of EDM/2 I began by providing some motivation for optimizing executable size, and then presented the following list of steps that can be taken toward this end:

  • Compress Your Resources
  • Put Your Linker to Work
  • Experiment with Compiler Switches
  • Try a Different Compiler
  • Use API Wrapper Functions

Steps 1 and 2 were covered in detail, while the final three steps were left for this article.

If you have not already done so, I recommend that you read Part 1 before continuing on with this article. While this is not absolutely necessary, it may make it easier to follow my discussion of Steps 4 and 5. Also, the above steps are shown in the order in which they should be taken, especially if your time is limited. Earlier steps are both quicker to implement and more likely to show significant results for your effort, so you won't want to skip the first two steps.

Errata

In Part 1 of this series, a production problem made Figures 3 and 4 very hard to understand. These figures show how various linker options affect executable size. Each figure contains four rows, which correspond to linker alignments of 2, 4, 16, and 512 bytes.
In Figure 3, the first column should have contained the values /A:1, /A:2, /A:4, and /A:9.
In Figure 4, the first column should have contained /A:2, /A:4, /A:16, and /A:512.

A Note about Compilers

At certain points in this article I'll apply my comments to specific compiler packages. I own three OS/2 compilers, Borland C++ 1.5, IBM C Set++ 2.0 (CSD level 9), and Watcom C/C++ 10.0, so these are the ones I'll cover. My apologies if your compiler isn't included.

Step 3 - Experiment with Compiler Switches

If you're like most programmers, you have probably monkeyed around with compiler switches at one time or another to try and optimize the size or speed of your programs. This takes very little effort, and careful selection of compiler switches usually yields a modest reduction in code size over what the default switches produce. As I mentioned in the first article of this series, it is a good idea to separate your code into modules which are optimized for size and others where you will focus on speed. In the following points I'll discuss selection of compiler switches for modules where you are optimizing for size. I'll start with a few general considerations, and then look at switch selection for specific compilers.

Remove All Debugging Information

Although it's obvious that you should omit debugging information in your release build, it is easy enough to forget to remove some, or even all, of it. Make sure you turn off generation of symbolic information, stack overflow checking, profiling hooks, and anything else that is irrelevant to users of your release build.

386, 486, or Pentium?

Many OS/2 compilers allow you to specify whether to optimize for the 386, 486, or Pentium processor. This is basically a choice of which processor your code will run most efficiently on; it will always run on all of these processors regardless of which one you optimize for.

In general, you'll get the smallest code size by optimizing for the 386. The 486 and Pentium can perform certain optimizations when fed a stream of small, simple instructions, so when you optimize for one of these processors the compiler prefers to generate longer sequences of simpler instructions rather than shorter sequences of more complex instructions. As a result, code optimized for the 486 or Pentium is almost always bulkier than code optimized for the 386.

Optional Language Features

Most compilers have switches that conditionally enable various language or run-time library features. This is especially common for floating point math and for C++. For example, the -xs switch of Watcom C/C++ 10.0 enables C++ exception handling. You'll notice that your executable becomes larger when you use this switch, but without it, you can't compile a C++ program that uses exception handling. In general, this class of switch lets you choose only the features you need for your program, so that you don't incur any unnecessary overhead for unused features. As I discuss various compilers below, I won't cover switches of this type. But the rule of thumb is simple: enable the features you need but nothing more.

Selecting Switches for Various Compilers

Now let's take a look at specific compilers. For each one, I'll provide a list of switches that tend to provide pretty small code. Note that this isn't an exact science; there usually isn't one set of switches that works best for all programs.

Borland C++

The following switches produce code that is about as small as possible under Borland C++:

Switch Description
-d Merge duplicate strings
-k- Disable generation of standard stack frame
-Oa Assume no aliasing
-Ob Eliminate stores into dead variables
-Oc Enable local optimization of code blocks with a single entry/exit
-Oe Enable global register allocation and data flow analysis
-Oi Enable inlining of intrinsic functions
-Os Minimize code size
-Oz Enable function-level optimizations
Figure 1 - Borland C++ switches for producing optimally small code.

Note that Borland also provides the -O1 switch as a shortcut; it is equivalent to -Obces. Also, the -Oa switch isn't safe for all programs; it causes errors in code which uses pointer aliasing. Finally, -Oi reduces code size only when each intrinsic function is called a very small number of times. If your program makes many calls to certain intrinsic functions, you'll usually get smaller code by turning off inlining for those functions. This can be accomplished by means of the intrinsic pragma; for example, to turn off inlining for strcat you can use:


     #pragma intrinsic -strcat

IBM C Set++

The following switches produce code that is about as small as possible under C-Set++:

Switch Description
-O+ Turn on optimization
-Ol+ Pass code through the intermediate code linker
Figure 2 - C Set++ switchesfor producing optimally small code.

The -Ol+ switch isn't well-documented, so I'm a bit fuzzy on just what the intermediate code linker does. At any rate, this switch usually reduces code size, but sometimes it actually increases it. Notice that very few switches need to be specified to optimize for size; most C Set++ switches have default values which produce optimally small code.

Watcom C/C++

The following switches produce code that is about as small as possible under Watcom C/C++:

Switch Description
-3r Target 386 processor, register-based calling
-oa Relax alias checking
-oi Intrinsic functions
-ol Loop optimizations
-ol+ Loop unrolling
-os Space optimizations
-s Disable stack depth checking
Figure 3 - Watcom C++ switches for producing optimally small code.

Watcom C/C++ seems to put a lot of weight on the -os switch. While some of the other switches listed above might be expected to increase code size, such as -ol and -ol+, the compiler is smart enough to apply these optimizations selectively so as to make sure they don't increase code size when -os is used.

As I mentioned for Borland C++, -oa isn't safe for all programs; it causes errors in code which uses pointer aliasing. Also, -oi reduces code size only when each intrinsic function is called a very small number of times. If your program makes many calls to certain intrinsic functions, you'll usually get smaller code by turning off inlining for those functions. This can be accomplished by means of the function pragma; for example, to turn off inlining for strcat you can use:


     #pragma function(strcat);

Step 4 - Try a Different Compiler

There are about a half dozen excellent C++ compilers available for OS/2. Each shines in certain areas and comes up short in other areas, so comparative reviews never seem to find one OS/2 compiler that does it all.

Using More than One

I find it beneficial to set up each project with a variety of compilers. This allows me to use the most appropriate compiler for each stage of development. Often where one product comes up short, another does quite well. For example, I use Borland C++ rather heavily while editing and compiling, because I like its integrated development environment, and because it compiles very quickly. I also use Borland's Resource Workshop quite heavily, and feel it has been worth the price of the entire product for me. When I need to do some serious debugging, however, I usually recompile my project under IBM C/Set++ so that I can use IPMD, my favorite debugger. For release builds, I use whichever compiler produces the most efficient code for the given project. This is usually C/Set++ or Watcom C/C++.

Not for Everybody

While I highly recommend using multiple compilers, this isn't practical for everybody so I might as well say so before some of you conclude I'm out of touch. Some common objections include budget constraints, working for a company that has standardized on a certain compiler, or use of a C++ class library or other add-on that doesn't support multiple compilers. As you'll see below, however, significant savings in code size can often be achieved simply by rebuilding your project under a different compiler.

A Hands-On Test

To show how different compilers measure up in terms of code size, I built the "Clock" and "Clipboard" sample programs with each of the OS/2 C++ compilers I own. I chose these two programs because they come with all my compilers, and because they show that no one compiler produces the smallest code for all projects. Figure 4 shows the results I obtained when building these programs:
Product CLOCK.EXE (2.x) CLOCK.EXE (Warp) CLIPBRD.EXE (2.x) CLIPBRD.EXE (Warp)
Borland C++ 74,467 70,579 51,180 43,981
IBM C/Set++ 70,345 48,268 33,246 23,579
Watcom C/C++ 48,462 44,574 36,115 28,915
Figure 4 - Size of CLOCK.EXE and CLIPBRD.EXE when built with various compilers.

This table has one row for each compiler, and one column for each program and target version of OS/2. Columns marked with "(2.x)" indicate that the executable will run under OS/2 2.x or later; those marked with "(Warp)" will run only under OS/2 Warp (or later).

All builds were optimized for size without any regard to how this might affect execution speed. Builds targeting OS/2 2.x used the resource compiler switch -x; those targeting Warp used -x2. Figure 5 shows the linker switches that I used:

Product Targeting OS/2 2.x Targeting Warp
Borland C++ /A:0 /B:0.10000 /c /Oc /A:0 /B:0x10000 /c /Oc
IBM C/Set++ /A:1 /BAS:0x10000 /E:1 /NOS /A:1 /BAS:0x10000 /E:2 /NOS
Watcom C/C++ (none) (none)
Figure 5 - Linker switches used for the builds of Figure 4.

Note that no special linker switches were needed to minimize code size with Watcom C/C++. The same linker switches were used for all builds under Borland C++. For IBM C/Set++, the Warp-only builds used /E:2 to take advantage of the new compression algorithm for code and data.

Figure 6 shows the compiler switches I used:

Product CLOCK.EXE CLIPBRD.EXE
Borland C++ -c -d -k- -Oabceisz -sm -c -d -k- -Oabceisz
IBM C/Set++ -c -Gm+ -O+ -Ol+ -Ss -c -O+ -Ol+ -Rn -Ss
Watcom C/C++ -3r -oails -ol+ -s -3r -oails -ol+ -s
Figure 6 - Compiler switches used for the builds of Figure 4.

The compiler switches reveal a major difference between the two programs that were built: CLOCK.EXE is multithreaded, but CLIPBRD.EXE is not. The switch -sm selects multithreaded run-time libraries in Borland C++, while -Gm+ serves this purpose in IBM C/Set++. Finally, note the use of -Rn when building CLIPBRD.EXE with IBM C/Set++; this switch selects what IBM calls a subsystem run-time environment. Most of the overhead of the normal C run-time environment is eliminated when you use this switch, giving a smaller executable. Unfortunately, this comes with many restrictions, the most significant being that you can't create multiple threads in such an executable.

Strengths and Weaknesses

A few brief comments are needed to interpret the results of Figure 4.

Borland C++ 1.5 produces relatively large executables. The main problem is that about 18K of C++ exception handling machinery is linked into all executables, whether it is needed or not. Even programs like CLOCK.EXE and CLIPBRD.EXE which use straight C get this overhead. Borland tells me that this situation has been fixed with version 2.0, but I haven't upgraded yet so I can't confirm this.

IBM C/Set++ 2.0 usually produces the smallest executables for simple programs like CLIPBRD.EXE that can use its subsystem run-time environment. This mostly applies to command-line utilities, very simple PM programs, and small DLLs. For multithreaded programs, C/Set++ tends to include a lot of run-time library code, making the executables rather large. C/Set++ uses LINK386, which is the most full-featured linker available for OS/2. It can compress code and data with the efficient algorithm newly added for Warp, so it often produces the smallest executables when you're targeting Warp.

Watcom C/C++ 10.0 produces remarkably small executables, thanks to its efficient code generator and its tendency to link in very little overhead from the run-time library. Watcom's linker is very primitive, however, which makes it even more remarkable that this product frequently generates executables that are smaller than those built with much more capable linkers.

Keep in mind that these sample programs are very simple; they do not use floating-point math or C++, and in general you can't easily extrapolate the code size results presented in Figure 4 to other programs. The main point of this exercise is just to show how much executable size can vary when you rebuild a project with a different compiler product.

Step 5 - Use API Wrapper Functions

The techniques I've discussed so far have been focused on the build process. In general, these have involved editing make files or changing some IDE settings and rebuilding. Several additional optimizations are possible if you are willing to make modifications to your source code. In this section I'll discuss just one of them: using API wrapper functions.

Writing a Wrapper Function

Certain API functions are used quite heavily by typical OS/2 programs. For example, many large PM programs contain well over 100 calls to WinSendMsg. If you have a program like this you'll find that you can reduce its size by writing a small "wrapper" function that does nothing but call WinSendMsg and return the result, and then replacing all calls to WinSendMsg by calls to your wrapper function. You'll need to name this function something other than WinSendMsg; MyWinSendMsg, SendMsg, or _WinSendMsg would all work. Here's what such a wrapper function looks like:


MRESULT  _WinSendMsg(HWND hwnd, ULONG msg, MPARAM  mp1, MPARAM mp2) {
   return (WinSendMsg(hwnd, msg, mp1, mp2));
}

The most transparent way to avoid the naming conflict is to create a source module which contains all your wrapper functions but nothing else. Then define macros like the following, and include these macros in all modules except the one containing the wrapper functions:


#define WinSendMsg _WinSendMsg

Why Wrapper Functions Reduce Executable Size

There are three reasons why wrapper functions reduce executable size.

First, calling an API function always adds an external fixup to your executable. Calling a wrapper function adds an internal fixup if your executable is a DLL, or no fixup at all if it's an EXE. Since an internal fixup is smaller than an external one, a call to a wrapper function always increases the executable size less than a direct call to an API function.

Second, your wrapper function can use a more efficient calling convention than that used for API calls. As a result, calls to your wrapper function will be more compact than direct calls to the API function. The more efficient calling convention is enabled by default in all the OS/2 compilers I'm familiar with, though you can also select it explicitly by declaring wrapper functions with __stdcall under Borland C++ and Watcom C/C++, or _Optlink under IBM C/Set++.

Finally, wrapper functions can often take fewer parameters than the API function. In this situation, the wrapper needs to supply default values for any omitted parameters when it calls the API function. For example, consider the following wrapper for WinQuerySysValue:


MRESULT _WinQuerySysValue(LONG iSysValue) {
   return (WinQuerySysValue(HWND_DESKTOP, iSysValue));
}

Note that the wrapper supplies HWND_DESKTOP as the first parameter to WinQuerySysValue. Another situation where parameters can be reduced is in pseudo-functions like WinDeleteLboxItem which are declared as macros in the toolkit header files. If you look at the definitionfor this macro, you'll find that it takes two parameters, but expands into a four-parameter call to WinSendMsg:


#define WinDeleteLboxItem(hwndLbox, index) \
   ((LONG)WinSendMsg(hwndLbox, LM_DELETEITEM, MPFROMLONG(index), \
   (MPARAM)NULL))

If your program uses this macro a lot, it would make sense to write a wrapper function for it:


LONG _WinDeleteLboxItem(HWND hwndLbox, LONG index) {
   return (WinDeleteLboxItem(hwndLbox, index));
}

Note that if you want to use the preprocessor to automatically convert references to WinDeleteLboxItem to _WinDeleteLboxItem, you'll first need to undefine this macro:


#undef  WinDeleteLboxItem
#define WinDeleteLboxItem _WinDeleteLboxItem

How Much Savings?

The amount of savings you'll get from writing a wrapper around an API function varies. The more your program calls the API function, and the more parameters your wrapper eliminates, the more savings you'll see. Also keep in mind that your wrapper function itself will add to the code size, so you probably won't see any savings at all unless your program calls the API function at least three or four times.

A few simple experiments using CLIPBRD.EXE and Borland C++ indicate that each call to the wrapper _WinSendMsg yields 4 bytes in savings as a result of using the more efficient calling convention, and an additional 7 bytes in savings from eliminating the external fixup (less when fixup chaining is enabled). All parameters to WinSendMsg are needed by the wrapper, so no parameter reduction is possible here. This program has eight calls to WinSendMsg, and the fully optimized executable ends up 21 bytes smaller whenusing a wrapper for this API function.

Other Advantages

In addition to modest improvements in executable size, there are at least two other advantages to using API wrapper functions. First, they help your program load more quickly. It is time-consuming for OS/2 to resolve hundreds or even thousands of fixups while loading a large program, but this number can be reduced dramatically by using wrapper functions. Second, wrapper functions are a convenient place to insert debugging hooks. For example, WinGetPS and WinReleasePS wrappers can maintain a list of handles to active presentation spaces. At times when your program knows there should be no active presentation spaces, it can check this list to make sure that none are still hanging around.

Disadvantages and Limitations

Although wrapper functions can be quite useful, they do take some time to set up. You first need to identify API functions that are called frequently by your program, and then write wrappers and make sure these are called in place of the API functions themselves.

Of course, using wrappers adds some overhead when calling API functions. In most cases this additional overhead is negligible, but in speed critical code that calls relatively quick API functions like WinNextChar, you may want to call the API functions directly to avoid the overhead.

Conclusion

OS/2 programs have somewhat of a reputation for being fat. This is unfortunate, because as I mentioned last time, OS/2 provides programmers with more facilities for producing optimally small executables than perhaps any other PC operating system available today. But these facilities don't do any good unless we use them, and I hope this series of articles has helped motivate and equip you to do just that.

As you implement the techniques I've discussed, I'd enjoy hearing how much savings they give you. I'll also welcome any other comments, questions, or suggestions you might have on this topic.

Part1 Part2