PDGuide - Guide to Instrumenting Your Code

Problem Determination Programmer's Guide
Introduction to Collecting and Managing Problem Determination Data Guide to Instrumenting Your Code Controlling FFSTProbe Calls Viewing and Analyzing Error Log Entries Analyzing Performance and Debugging Problems Using Trace Capturing and Saving Failure-Related Information through Dumps The Desktop Management Interface Summary of Functions and Interfaces Problem Determination APIs

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation

Code instrumentation improves problem analysis. Instrumented components of OS/2 Warp Version 4 use First Failure Support Technology (FFST) and trace. This chapter defines the required steps for instrumentation, and things you should consider before you instrument your code. This chapter also tells you what to expect when you use the FFSTProbe API and trace utility.

Introduction to FFST Instrumentation

FFST is a programming concept that uses a set of software tools and services to capture error information at the time of a code failure. You view the error information using system error log or PM Dump Facility dump formatter to determine the cause of the problem. You capture error information by placing a call to the FFSTProbe API in your code. You instrument your code by calling FFSTProbe and specifying which data to collect.

When your properly instrumented code encounters an unsuspected or unrecoverable error, the code immediately calls the FFSTProbe API to capture failure related information. Your code specifies the parameters to capture data when calling the FFSTProbe function. The system creates an error log entry each time your code calls the FFSTProbe function. The log entry will contain the information your code specifies in the call to FFSTProbe. After the call, the system returns control to your code unless the system triggered a system dump. System dumps automatically restart the system. Additional error information can be collected by using a Probe Control Table (PCT) entry. System dumps are triggered by using PCT entries. The captured information that is contained in the error log entry can include event trace data, program error information, or user-defined data.

Therefore, FFST consists of a collection of functions, commands, and utilities within the Problem Determination Tools folder. Use the utilities to do the following:

collect problem determination data
define the types of data collected
specify where to store the collected error data
override parameters on calls to the FFSTProbe function.

Summary of Functions and Interfaces, provides an overview of the interfaces to FFST. Problem Determination APIs, provides descriptions of the API functions.

This chapter provides the information you need to instrument your code. It may be helpful to have the OS/2 Warp Version 4 Tools Reference document available for reference while using this book. The associated references are available on the Toolkit CD ROM.

Benefits of Instrumenting for FFST

Instrumentation is key to providing adequate code serviceability. If problems occur, instrumented code allows you or service personnel to take full advantage of the FFST technology in OS/2 Warp Version 4. The system records problem determination data with no user or additional programmer intervention. Instrumentation decreases the need for reproducing user failures. System dumps and process dumps however do require intervention and problem reproduction.

The captured information that is recorded in the error log is essential to problem solving. An error log entry contains information that indicates the failing product, and the time the error occurred. By analyzing the captured information, you can determine the failing components, diagnose the causes of the error, and correct the problems.

Overview of FFSTProbe API

The FFSTProbe API is the key to problem analysis by signalling that your code has encountered a problem. FFSTProbe captures the requested data, and stores the data in the error log for use in problem analysis.

FFSTProbe Parameters

The FFSTProbe API parameters identify the product that reported the problem. The parameters specify which data to collect for the problem. You can use the parameters to specify the following:

the severity of the call
the associated error message data
the name of the formatting template that is used to display the error information
any system process information
any specific user data that you want collected.

The system stores module name and time stamp automatically. The FFSTProbe API can also initiate a system dump to capture data that resides in the main memory of the system. Refer to Problem Determination APIs for the FFSTProbe API and its parameters.

FFST Flow

FFST Flow.

The sequence of events that are shown in FFST Flow shows how FFST logs errors and captures data when your code calls FFSTProbe.

After you develop and install your code on the system, the application program box that is shown in the diagram above signifies your code. The FFSTProbe API is called when your code discovers a problem. If you specify to capture user data in the call to FFSTProbe, the system captures the data with the other error-related information.

Your code calls FFSTProbe to gather the following product information:

dump information
error message information
other error-related data

The system records the data in the error log entry. If your code has entries in the Probe Control Table, FFST uses the entry values instead of the FFSTProbe parameters that are used in the calls. The system records the data in the error log entry. FFST uses the configuration values to create the error log entry.

After FFST gathers the error-related information, it stores the data as an error log entry. The system stores FFST dump information in a file named FFxxxxxx.DMP, where xxxxxx signifies a six-digit identifier. If a trace snapshot is requested, a file named FFxxxxxx.TRC will be created. If a process dump is requested, a file named FFxxxxxx.PRC will be created. The error log information contains the name of the FFST dump file along with the trace file if applicable.

Use the SYSLOG utility to view the error log information. SYSLOG uses message files and template files to format and display error log records. Use SYSLOG to control the following log functions:

specify which error log file to use.
suspend or resume error logging.
change the size of the error log.

Steps for Instrumenting for FFST

The steps for instrumenting your code are as follows:

Planning for Instrumenting Your Code
Code the FFSTProbe API.
Compile the code.
Create the error record template file
Create message files.
Create DMI MIF files.

The remainder of the information in this chapter provides information about each step.

Planning for Instrumenting Your Code

There are several things to consider before you begin putting calls to the FFSTProbe API in your code. This section describes the following considerations and steps:

Define and ensure existence of Vital Product Data (VPD). VPD is the description of your code to the system. The system uses VPD to identify the product that is reporting a problem.
Decide how and where you should code your calls to the FFSTProbe function.
Decide what data you want the function to collect for code failures.

Defining Vital Product Data (VPD)

The DMI facility provides a standard way to register the hardware and software on the system. This allows both system software and system-based software (for example, application programs or device drivers) to register with the system. This information is called Vital Product Data (VPD). The system uses VPD to identify the source of error log entries. Various system management applications require access to the VPD information. When a product component uses FFSTProbe to log an error, the error logging function automatically includes the VPD information in the error record.

The VPD information allows systems management applications to assume a base level of VPD for all conforming products on a system. The VPD information for software products differs from the VPD information for hardware products. You can define additional specialized VPD information for your product.

Both your component's install object (that the feature installer uses to install your product) and the FFSTProbe parameter information must have identical information. This enables DMI to provide the template file that is specified on the call to FFSTProbe. Recommendations for these values are:

Vendor - a description of the organization or company that developed the product that is reporting an error (example: IBM).
Tag - a unique description of the product that is reporting an error (example: FFSTProbe SAMPLE).
Revision - optional description of the development organization's revision level, change level, or version of the product that is reporting an error. (example: 1.0.1c). If a component within a product is reporting the error, the revision may not correspond to the revision level of the entire product.

Programmers refer to the Vendor, Tag, and Revision values as the DMI triplet.

When your code calls FFSTProbe, the DMI triplet values you specified must match the DMI values stored in the DMI database for your product. If the values do not match, FFSTProbe cannot find the VPD for your product in the DMI database.

If the DMI triplet for your product matches the DMI triplet of a different product, the results are unpredictable.

Deciding How and Where to Place Calls to FFSTProbe API

Here are two common approaches to instrumentation. One way is to place numerous calls to FFSTProbe throughout your product to get broad coverage. This approach might contain only a minimum amount of error data since you know every error would be captured via a probe. The second approach is to use just a few strategically placed calls that capture greater amounts of error data to better isolate the the exact cause of the failure.

The advantage to the broad coverage approach is that errors are most likely to be identified because of the greater number of calls to FFSTProbe. The strategic approach usually involves instrumenting existing exception paths or thoroughly understanding the code to identify where to place the call to FFSTProbe.

You might consider combining both approaches in your code. The broad coverage aspect identifies exactly where the error occurred, and the strategic aspect identifies the cause.

You should use FFSTProbe only to detect problems that would require a program fix or a modification to user operation procedures.

Places to Instrument

The following list contains situations and places in your code you should consider for instrumentation:

When your code generates an error return, create an error log entry for the error condition that caused the error.
Some programmers use Print Debug and Print File for testing code. These instructions print certain variables and messages at various code failure points. Convert the Print Debug and Print File instructions to calls to the FFSTProbe API. The system disables the Print Debug and Print File functions after you install your code.
When you expect return codes, create an error log entry when you receive unexpected return codes.
In environment situations (circumstances that are not necessarily program errors but are worthy of creating an error log entry). For example, resource shortages, time-out conditions, system-hang conditions, or lost physical connections.
In cleanup functions, your code may be tolerant of potential errors and may do some error recovery. The cleanup functions in your code are candidates for logging if the recovery signifies an important event.

Consider that the number of log entries and the size of entries you log could cause too much information to be logged. Creating excessive error log entries can cause the error log to wrap. This causes previously logged information to be overwritten. One of the most frequent questions that are asked about FFST is where and when to use it. When instrumenting your product, you should consider several places:

Exception Paths: Many programmers already take some actions in various exception conditions. These actions often include cleaning up execution environments, closing files, and ending the program. Your code should call the FFSTProbe API to create an error log entry that contains the following information:

the program or module that failed

why the failure occurred

what corrective actions to take.

Incorrect Conditionals (for example, switch case): As developers write programs, they make assumptions of what can or cannot happen, and add various conditionals and execution blocks to programs.

Conditionals that are not valid are ideal candidates for a call to the FFSTProbe API to log these failures. By calling the FFSTProbe function at these points, you can quickly and accurately pinpoint the failure and capture the associated data at the time of failure.

External Calls: OS/2 Warp Version 4 does not expect calls to external programs to fail. However, unexpected return codes, when not handled, can result in program failure. Your code should call FFSTProbe after each external call that results in an unexpected return code.

Some development groups use someone other than the developer instrument all calls. Other groups spend more time anticipating the potential problem areas and placing probes only in those areas. Your code should do what is achievable for the current circumstances. You should then evaluate how well your calls to FFSTProbe work before you begin the next development cycle.

For FFSTProbe API calls to be useful in debugging a problem, the calls must specify:

A unique probe ID
Descriptive text explaining the problem
Data that is relevant to the failure
Instructions on how to resolve the problem if appropriate.

With well-instrumented code, several benefits of using the FFSTProbe function are evident. You can use FFST to capture error information. You can also identify areas in your code that did not cause the problem. If instrumented code made no calls to FFSTProbe, you can focus on code without calls to FFSTProbe.

Your code should not call the FFSTProbe function inside a loop. Call FFSTProbe only once per error situation. Repeated calls may cause system performance problems and cause wrapping of FFST data by storing unnecessary data.

Problem-prone components in products are good candidates for the FFSTProbe function.

Your decision about using instrumentation depends on the possible errors and the cost of solving an error.

Deciding What Data You Want to Collect

After you decide where to call the FFSTProbe function, you need to decide what data to capture. The question to ask is, "What data would I need to see to have a good chance of determining the source of the error?" Consider capturing data items that are global variables and control blocks.

Other things to consider are:

How much data you need to determine the cause of the problem?
Has this problem been encountered before?
How complex is the code?
Are other components or products being called?
Does the error message information point to the problem?

Make every effort to collect enough data to solve the problem without requiring the user to re-create the problem.

The amount of data collected could also be affected by the amount of system storage that is available or allocated to store error data.

Error Types to Consider

When an error occurs, call the FFSTProbe function to log the error. The following examples describe several error types you should consider when you instrument your code and the types of data to capture for the error:

Error return: Determine the severity of an error so that you call the FFSTProbe function only when the error return indicates a serious problem. When a calling program has a significant failure that causes an error return, the program calls FFSTProbe. Be careful not to cause a "cascade" of calls to FFSTProbe as the system passes error returns back up through a set of higher level function calls.