EDM/2 - Reducing the Code Load

Reducing the Code Load

Introduction

When it comes to improving application performance the optimization of code layouts is usually at the bottom of a developer's list. More often it does not appear on the list at all.
The surprising fact is that by changing the location of code both between and within executable files developers can reduce the memory used to store application code by up to 50% and gain a significant performance boost in low memory conditions.
This article describes how application code enters memory, the performance issues involved and details some techniques and a new development tool designed to optimize the process.
Loading Code

OS/2 v2.0 introduced fundamental changes to the operating system. For many developers the most important change was the ability to execute 32-bit flat memory model code. But v2.0 introduced another fundamental change, the redesign of the virtual memory system to take advantage of the built in paging abilities of the now required Intel 80386 and later processors.
When applications reference code not currently in memory the processor issues an exception. The operating system handles this exception, locates the required code on disk and loads it into memory. Control is returned to the application which continues to execute unaware of the endeavors performed on its behalf. For this operation to be performed efficiently code must be stored in a file format designed for the purpose.

Figure 1: "New Executable" file format.
OS/2 v1.x memory management was segment based. Code and data were transferred between memory and disk as variable length segments up to 64Kb in size. The "New Executable" (NE) file format used at that time reflected this need with code stored within the file in preformed segments. A segment table provided an index identifying the size and location of each segment. When a code segment was referenced that was not already in memory the loading mechanism translated the code selector to a segment table index. The segment was loaded into memory and the following relocation table used to apply relocations or fixups as they are often known.
With the introduction of "flat" 32 bit code a typical application transformed from a collection of 16 bit code segments to 32 bit code designed to run in a single code segment. If segment based swapping were retained the virtual memory manager would be forced to load and unload entire applications in a single operation. Not only would segment based swapping be impractical and grossly inefficient, the 4Gb limited segments might be larger than available RAM. As a result OS/2 v2.0 changed the fundamental unit of transfer of the virtual memory manager from a variable sized segment to a fixed 4Kb "page" of memory. The Intel 80386 and later processors contain built in memory management designed to allow the efficient management of 4Kb pages.
The segment based NE file does not allow for the efficient loading of 4Kb pages. A new executable file format was needed with page based internal indexing. The "Linear eXecutable" (LX) file format was born.

Figure 2: "Linear eXecutable" file format
Now when an attempt is made to access an instruction that is not in memory the page fault handler uses the index from the LX header to locate and read the page from the executable file. A similar index into the fixup record table identifies all the relocations that need to be applied to the page.
All application code within LX files is loaded on demand. When an EXE or DLL is "loaded" no application code is read, only when an attempt is made to execute an instruction is the containing page read into memory. This applies to all versions of OS/2 since v2.0, including Warp. You may have heard about code loading optimization techniques involving transfers of code pages to the swapfile; forget them, Warp reserves such techniques only for its system DLLs.
Most modern processors work with some form of cache for instructions. Our mental image of executing software often ignores the cache; we think of the CPU as fetching instructions from main memory. The cache is considered, and often is, part of the processor. When software executes in a virtual memory environment perhaps the most accurate view is of a CPU fetching each instruction directly from disk. This potentially performance crippling process has the benefit of a large cache memory (your system RAM) and a cache controller called OS/2.
Performance Effects

Modern superscalar pipelined processors live in a world limited by propagation delays and pipeline "bubbles". Processor designers strive to ensure that all stages of a pipeline are kept busy for maximum efficiency. If an instruction or data for the input stage is unavailable the pipeline continues to operate with an empty stage or bubble in place of a useful operation. On a 100MHz processor each unused stage represents 10 nanoseconds of wasted processing time. The recently announced Intel P6 will attempt to prefetch up to 30 instructions in advance of the current instruction pointer in an effort to keep the pipelines fed and avoid the formation of bubbles.
Compare this with an attempt to execute an instruction that is not in memory. The processor detects the reference to unavailable memory and issues an exception, a page fault. OS/2 handles the exception and locates the page in the executable file. The resulting disk access carries disk seek and rotation time penalties. After this delay the data is transferred to the disk controller and ultimately into memory. With the code now loaded, the suspended thread can resume execution.
When a modern processor hits a page fault we are transported from a nanosecond world bordering on quantum mechanics to one which relies on the physical movement of a metal stick with a magnetic head on the end. The performance impact is quite literally the computing equivalent of instantaneously stopping a 200MPH Ferrari and asking your granny to get out and push.
All application code starts life on disk. We cannot avoid page faults but we can reduce them. There are a few simple strategies that developers can employ which can more than halve the overhead of loading code.
Add More RAM!

A simple and effective solution for applications developed for dedicated hardware. When your machine runs low on memory your least recently used code will be discarded so that the 4Kb page of RAM it occupied can be reused. This code will need to be reloaded if an attempt is later made to execute an instruction within the page. The more RAM you have the less likely it is that code will be discarded. But demand paging loads code pages only when they are referenced. Increasing RAM will usually do little to reduce disk activity at application startup or when a particular application feature is first referenced.
Use As Few DLLs As Possible

Demand paging is not always the best loading strategy and can be seen at its worst when an application employs a large number of DLLs. Not all code page loads are equal, they depend on the response time of the disk hardware. A large component of this is the time taken to move that disk head, the further the head is from the data the longer the disk access will take. A good file system or one which is well maintained by defragmentation utilities will cause files to be relatively contiguous on disk. After a read access subsequent reads from the same file will tend to require little head movement and may often be satisfied by the read buffer in the disk hardware.
When code is demand loaded, the order in which pages are read is dictated by the order that instructions are first referenced. As execution of an application progresses, loads from an EXE file will be interspersed with loads from each of the DLLs. Each of these files may be a significant distance apart on the disk surface sending the disk head shuttling back and forth between files to service each page fault. If your application employs a large number of DLLs and your disk drive does a passable impersonation of a chain- saw on startup, now you know why!
Writing an application as DLLs separated by logical division can be extremely convenient in limiting rebuilds and separating development effort. Combining them into a single DLL (or ideally one large EXE) before you ship to your users can significantly reduce disk activity.
Don't Put Code in a DLL Unless You Need To

A DLL does not have a guaranteed load address and so must contain the internal relocations (fixups) required to allow it to be positioned anywhere in memory. These relocations are stored with the external relocations in the executable file and occupy memory when loaded. The less code in the DLL the fewer relocations there will be. Although EXE files also have relocations they (should - see below) only have external references to other files which are typically far fewer in number.
External relocations provide another good argument for reducing the number of DLLs. External file references (e.g. a CALL to a function in another DLL) require a relocation to be stored in the relocation table (consuming memory) and must be applied to a code page loaded from disk (consuming time). By contrast an internal call can be relative and hence unaffected by code location. These internal references will be resolved by the linker, eliminating the fixup completely.
Give Your EXE a Base Address

The linker will allow you to base your EXE file at 10000H. OS/2 guarantees that an EXE file can load at this address. The linker knows this and will completely remove all internal relocations from an EXE file when you give it a base address. The resulting EXE file will be smaller, use less memory and load faster. There is no down side. Do this!
Give Your DLLs a Base Address

If you cannot avoid using a DLL, give it a base address at link time. The linker can apply all the internal relocations assuming that the DLL will be loaded at that default address. It also sets a flag in the executable file header to tell OS/2 this has been done.
When the DLL is loaded OS/2 will check to see if it can be placed at its default address. If it can then although internal fixups will still be loaded (they are stored with the external fixups) they do not need to be applied, shortening the page load time. Fixups for each page are ordered such that external fixups appear first, when the first internal fixup is encountered the fixup applicator knows not to continue.
This entire strategy depends on selection of a base address at which enough free memory is likely to be available at which to load the entire DLL. You cannot guarantee a free load address but you can give your DLL a good fighting chance by selecting a likely free area. Finding such memory can be complicated by the use of multiple DLLs, particularly if some have been sourced externally. Your debugger can be useful here, both in identifying a potential area of memory and later checking to see if your DLL managed to load there.
Using Ordinals To Export from DLLs

When a fixup target is within another file the fixup must identify the target either by name or ordinal number. If a name is used the target DLLs list of exported names must be searched and string comparisons performed. If the exported DLL functions have been assigned an ordinal this search and comparison is unnecessary reducing the time taken to apply the fixup.
Exported entry points within a DLL are stored as a list. Exported ordinal number 23 must be the 23rd in the list. Each unused ordinal number results in a blank list entry. Although the exported entry table encodes blank entries very efficiently it is likely that these entries are expanded into an array when loaded for performance reasons. When assigning ordinal numbers do not group exports by use such as 1000+ for string functions, 2000+ for mathematical functions etc. For the most efficient numbering start at 1 and assign ordinals to exported functions sequentially.
Working Set Tuning

In the war against code load performance delays, the ultimate weapon in the programmers arsenal is working set tuning. The first section viewed code as executing from disk with system RAM forming a giant cache. The success of caching relies on the non random nature of access requests. The greater the locality of reference the more efficiently a cache operates.
The path of execution through code is not random. Instructions execute in sequences broken up by transfers of control such as CALL or JMP instructions. It is these transfers of control that often produce a cache miss resulting in a disk access. While the targets of these transfers are sometimes some distance from the current instruction pointer the destination is often known or predictable. It is this predictability on which a working set tuner for code is based.
The Theory

When applications are divided into pages it is done without regard to the underlying contents, every 4096 bytes the code is indiscriminately severed. Functions and even individual instructions are split across page boundaries. What appears on each page is dictated by the order of the code within the executable file. What dictates that order? You do.
Within each compilation unit the order in which code appears is generated by the compiler. It builds the instruction sequences and orders them roughly as they appear in the source file. The linker takes each object file and concatenates the contents, appends any used libraries and outputs the executable. The result is that code within an executable will appear in the order that it is typed in the source files and the order in which those files are processed by the linker.
Well so what - if code is only loaded when referenced then there's no problem, right? Take a look at the two code layouts below where A to L are individual code sequences which for simplicity have been grouped 3 to each 4Kb page. A comma represents a page boundary.
Pages: 1 2 3 4 Normal Layout: ABC,DEF,GHI,JKL Tuned Layout: GED,LBC,HIK,AFJ

Reduced Code Memory Requirements

If the normal path of execution is "GEDLBLBLBCHIK" the benefits of the tuned layout become clearer. The normal layout loads four pages in the sequence 3241, the tuned layout loads only three pages in the sequence 123. Both layouts contained and executed the same code but the tuned layout avoided one disk access and used 25% less memory.
This memory saving is a direct result of the sequences A, F and J not being executed. If this seems contrived consider that on average between 30% and 50% of a typical applications code is loaded but not executed during normal operation. If that sounds high examine your own code for error handling of memory allocation, file access or OS/2 API errors. Include with this the code that your users will rarely execute such as the import of that obscure file format, changing application configuration or displaying that list of programmer credits (try clicking on the Warp desktop and then press Ctrl-Alt-Shift-O). Such code often lies within branches of IF or CASE statements, lying dormant inside otherwise active functions. In an untuned code layout much of this code is loaded because it happens to share part of a code page with some other active code sequence.
Performance In Low Memory Conditions

Limit the code sequence to the use of one 4Kb page as a crude simulation of low memory conditions and more significant differences emerge. The normal layout will execute as "GEdLBLBLBcHiK" where an upper case character represents the need to load a page from disk. The result is a total of 10 page loads. The tuned layout sequence is "GedLblblbcHik". The total of 3 page loads has been unchanged by the restriction to one page of memory.
The reduction in page faults is due to the way that tuned code layouts group code by time of execution. When a tuned application executes a CALL or JMP instruction the target is much more likely to be on the same page of memory. In low memory conditions any attempt to execute an instruction on all but the most recently used pages is likely to be punished by a page fault.
While this small example conveys the basic operation, in reality the situation is much more complex. OS/2 recycles pages on approximately a least recently used basis. In low memory conditions applications and the operating system compete for ownership of memory pages, the number of pages available to an application varies dynamically based on current demand. As a result, when you run working set tuned code the reduced memory requirements benefit not just the tuned software but all other executing applications.
Page Load Ordering

The tuned layout has another less obvious advantage. The pages were loaded in the order in which they appeared in the executable file. Most file reads are sequential and disk accessing software and hardware are optimized to take advantage of this. After that first access the following pages are unlikely to require disk head movement and may already have been loaded by disk hardware removing all mechanical delays.
Creating a Tuned Code Layout

There are two ways to working set tune code, do it manually or use a tool.
Manual techniques are based on a developer predicting a programs flow of control and deciding which functions can be expected to be used together. By using compiler pragmas these functions may be placed into named groups which are combined by the linker. Unfortunately there are two problems which severely limit the effectiveness of this approach.
First, the prediction of code use is a non trivial exercise. On a large development project all but the most primitive of groupings are beyond human abilities. Good code arrangement strategies need complex algorithms. These identify temporal dependencies and position functions on pages to maximize the overall probability of their being in memory when their callers require them.
Second, the granularity is wrong. Pragmas only allow the grouping of functions yet what we are interested in is where each indivisible sequence of instructions should be placed. In the example above we had three sequences per page, in reality CALL, JMP and other transfer instructions divide each page into approximately 300 separate code sequences. Even the smallest of applications contain thousands of these sequences, each potentially an unused body of an IF or CASE statement.
To perform this task effectively requires a tool. This tool would need to:

Identify each indivisible code sequence
Trace the execution of the application to see when and from where each sequence is commonly used.
Apply an algorithm to the information gathered creating a code arrangement designed to minimize page faults.

A Tool

LXOPT is a new development tool designed specifically to working set tune code. Applied directly to EXE and DLL files it works in two stages. First it must process the software so that it gathers usage information as it executes. Then it uses this information to create a new version of the application with an optimized code layout.
Below LXOPT is preparing a simple "Hello World" program. Although 32-bit executable files are loaded into a single 32- bit segment, the divisions between code, data and resources remain. Each is loaded into a separate object analogous to segments in 16-bit applications. LXOPT identifies each object in the file and then processes object 1, the largest (and in this case only) 32-bit code object.
[C:\DEV]lxopt hello_w.exe OS/2 Linear Executable Swap Optimisation Utility V1.0 Copyright (C) 1994, Functional Software Ltd. All rights reserved. Objects: Object 1: 32 Bit Code - 16384 bytes Object 2: Stack - 6256 bytes Object 1 selected... Warning LXO0150: Unused area 0x1924 to 0x1933 contains fixups. Warning LXO0150: Unused area 0x2ff5 to 0x301e contains fixups. Warning LXO0150: Unused area 0x31f7 to 0x323b contains fixups. Code area statistics Code: 14754 bytes (90%) Data: 509 bytes (3%) Unused: 1121 bytes (7%) ====== Total: 16384 bytes Existing version of application has been renamed to "hello_w.ori". Recording version of application "hello_w.exe" created.
Figure 3: LXOPT application preparation
The code statistics table may come as a surprise, code objects do not just contain code. Some development tools take advantage of read-only code objects and put read-only data there. Compilers also create data, CASE statements are often compiled into jump tables and the table is positioned within the code. Here 14754 bytes within the code object form processor instructions, 509 bytes are data.
Those of you ever tempted into the occasional piece of DIY will be familiar with one of the "Great Laws of the Universe". If you take apart anything such as a washing machine, telephone or lawnmower and then reassemble it there is always a surplus pile of screws, washers, nuts or bolts. LXOPT has divided the application into its fundamental instruction sequences and reconstructed it. Software is not immune to the "Universal Laws" and in this case we have 1121 bytes of spare parts. These bytes are unreferenced by any part of the application. Some are padding used to align instructions for performance reasons but not all as the LXO0150 warnings show. Some are unreferenced instruction sequences. These are usually unused functions contained within the source or as in this case, pulled in from a library. LXOPT does not hoard spare parts, these unused functions are discarded.
Now we have a have a profiling version, lets run it.
[C:\DEV] hello_w Hello World
Figure 4: Running the application
Not very exciting but a directory listing shows a new item, "hello_w.rec". Within this recording file is a list of where and when each code sequence was executed. This is the information we need to create an optimized code arrangement.
[C:\DEV]lxopt hello_w.exe /arrange OS/2 Linear Executable Swap Optimisation Utility V1.0 Copyright (C) 1994, Functional Software Ltd. All rights reserved. Calculating page faults loading instructions from code object 1 ... Memory (Kb) Old Faults New Faults Percentage Reduction 4 48 10 79% 8 16 2 87% 12 9 2 77% 16 4 2 50% 9794 bytes (59%) of the code was parked. Recording version of application has been renamed to "hello_w.prp". Optimised version of application "hello_w.exe" created.
Figure 5: Creating an Optimized Arrangement
LXOPT creates the new version of the application and uses the recording file to simulate the execution of the old and new versions. A comparative table is produced showing the number of page faults needed to load the executed code with various fixed amounts of free memory. The old arrangement had executed code spread out over the full four pages (16Kb) of code. As memory is progressively constrained to 12, 8 and 4Kb the number of page faults immediately starts to rise. The new arrangement needs only 2 pages and fits all the executed code into 8Kb. Only when 4Kb of memory is free does the restriction begin to bite.
The line after the table tells us 9794 bytes were "parked". These are unexecuted instruction sequences, logically reachable but never used. These are not the unused sequences from the preparation stage, they were discarded when first detected. These sequences are dormant but potentially executable so they must be retained. Now that these unused code fragments are grouped together they will not be loaded into memory unless they are actually used.
How Much Better Does Working Set Tuned Code Run?

The processed "Hello World" program used half the code memory and produced an 87% best case page fault reduction. But "Hello World" is not a typical application. On average the 50% code memory reduction for trivial applications reduces to 30% as code size rises to a few hundred kilobytes. With increasingly larger applications the average reduction begins to rise again and often retains its 50% level as code size exceeds 1Mb. This interesting effect is probably due to the way in which large applications are produced. When you have this volume of code you either have a lot of programmers, a prolific code generator or some large bought in libraries. In all of these cases coding tends to be more defensive, functions handling a wider range of conditions, errors or input parameters that in reality are not normally encountered. In a tuned layout all such code is "parked" where it will not be loaded unless the conditions for execution arise.
Page fault reductions follow a similar pattern. The larger an application the more it is divided by programming effort. But source code is not divided by time of execution. A compilation unit is more likely to consist of string handling or matrix manipulation functions than the code executed in the first three seconds of application startup. As modules increase both in size and number temporally related code becomes more and more separated. When this code is tuned the matrix and string handling initialization functions will be moved along with all other initialization functions to the beginning of the application code space where they belong. Once used it will then be discarded, use of other string or matrix functions will not cause initialization to reload as they no longer share a code page.
Transfer these benefits to specific applications and you will get varying results. Working set tuning addresses two specific issues, the time taken to load code and the amount of memory needed to store it. If you are looking for a performance improvement and your code spends hours looking for prime numbers or generating fractals then you are looking in the wrong place. Tuned code executes the same instructions, they have just been moved to a more convenient location. Your prime numbers may not appear any faster but while the search continues in the background your word processor will appreciate the extra space it gets due to your more frugal memory usage.
Performance improvement and memory reductions are roughly proportional to code size. The "Hello World" example reduced code memory by 50% and page faults by up to 87%. In reality "Hello World" is unlikely to ever be memory restricted so we save 8Kb of RAM and the few milliseconds taken to handle two page faults. Tune a group of large applications and run them concurrently in low memory conditions and the benefit is hard to ignore.
If you are interested in working set tuning of code or are just plain curious to see how much of your application really gets executed there's a demo copy of LXOPT on the Developer Connection CD-ROM. The latest demo also appears as LXOPTnnn.ZIP (where nnn is the version number) on Compuserve (OS2DF1:Dev Tools), ftp-os2.cdrom.com (os2/demos) and the EMEA DAP bulletin board (DEMOS file area).
If you're not tempted now, keep an eye on this technology. The automated tracing of an executing application can be used not just to working set tune code but will progress to tune data accesses and ultimately invade the territory of compiler optimizers. Like it or not, the development cycle is about to get longer. Running your completed application and feeding the results back into a secondary optimizer is a technology that's here to stay.
References

LXSPEC.INF - Linear Executable File Format, Developer Connection CD-ROM