Jump to content

PDGuide - Guide to Instrumenting Your Code

From EDM2
Problem Determination Programmer's Guide
  1. Introduction to Collecting and Managing Problem Determination Data
  2. Guide to Instrumenting Your Code
  3. Controlling FFSTProbe Calls
  4. Viewing and Analyzing Error Log Entries
  5. Analyzing Performance and Debugging Problems Using Trace
  6. Capturing and Saving Failure-Related Information through Dumps
  7. The Desktop Management Interface
  8. Summary of Functions and Interfaces
  9. Problem Determination APIs

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation

Code instrumentation improves problem analysis. Instrumented components of OS/2 Warp Version 4 use First Failure Support Technology (FFST) and trace. This chapter defines the required steps for instrumentation, and things you should consider before you instrument your code. This chapter also tells you what to expect when you use the FFSTProbe API and trace utility.


Introduction to FFST Instrumentation

FFST is a programming concept that uses a set of software tools and services to capture error information at the time of a code failure. You view the error information using system error log or PM Dump Facility dump formatter to determine the cause of the problem. You capture error information by placing a call to the FFSTProbe API in your code. You instrument your code by calling FFSTProbe and specifying which data to collect.

When your properly instrumented code encounters an unsuspected or unrecoverable error, the code immediately calls the FFSTProbe API to capture failure related information. Your code specifies the parameters to capture data when calling the FFSTProbe function. The system creates an error log entry each time your code calls the FFSTProbe function. The log entry will contain the information your code specifies in the call to FFSTProbe. After the call, the system returns control to your code unless the system triggered a system dump. System dumps automatically restart the system. Additional error information can be collected by using a Probe Control Table (PCT) entry. System dumps are triggered by using PCT entries. The captured information that is contained in the error log entry can include event trace data, program error information, or user-defined data.

Therefore, FFST consists of a collection of functions, commands, and utilities within the Problem Determination Tools folder. Use the utilities to do the following:

  • collect problem determination data
  • define the types of data collected
  • specify where to store the collected error data
  • override parameters on calls to the FFSTProbe function.

Summary of Functions and Interfaces, provides an overview of the interfaces to FFST. Problem Determination APIs, provides descriptions of the API functions.

This chapter provides the information you need to instrument your code. It may be helpful to have the OS/2 Warp Version 4 Tools Reference document available for reference while using this book. The associated references are available on the Toolkit CD ROM.

Benefits of Instrumenting for FFST

Instrumentation is key to providing adequate code serviceability. If problems occur, instrumented code allows you or service personnel to take full advantage of the FFST technology in OS/2 Warp Version 4. The system records problem determination data with no user or additional programmer intervention. Instrumentation decreases the need for reproducing user failures. System dumps and process dumps however do require intervention and problem reproduction.

The captured information that is recorded in the error log is essential to problem solving. An error log entry contains information that indicates the failing product, and the time the error occurred. By analyzing the captured information, you can determine the failing components, diagnose the causes of the error, and correct the problems.

Overview of FFSTProbe API

The FFSTProbe API is the key to problem analysis by signalling that your code has encountered a problem. FFSTProbe captures the requested data, and stores the data in the error log for use in problem analysis.

FFSTProbe Parameters

The FFSTProbe API parameters identify the product that reported the problem. The parameters specify which data to collect for the problem. You can use the parameters to specify the following:

  • the severity of the call
  • the associated error message data
  • the name of the formatting template that is used to display the error information
  • any system process information
  • any specific user data that you want collected.

The system stores module name and time stamp automatically. The FFSTProbe API can also initiate a system dump to capture data that resides in the main memory of the system. Refer to Problem Determination APIs for the FFSTProbe API and its parameters.

FFST Flow

FFST Flow.

The sequence of events that are shown in FFST Flow shows how FFST logs errors and captures data when your code calls FFSTProbe.

After you develop and install your code on the system, the application program box that is shown in the diagram above signifies your code. The FFSTProbe API is called when your code discovers a problem. If you specify to capture user data in the call to FFSTProbe, the system captures the data with the other error-related information.

Your code calls FFSTProbe to gather the following product information:

  • dump information
  • error message information
  • other error-related data

The system records the data in the error log entry. If your code has entries in the Probe Control Table, FFST uses the entry values instead of the FFSTProbe parameters that are used in the calls. The system records the data in the error log entry. FFST uses the configuration values to create the error log entry.

After FFST gathers the error-related information, it stores the data as an error log entry. The system stores FFST dump information in a file named FFxxxxxx.DMP, where xxxxxx signifies a six-digit identifier. If a trace snapshot is requested, a file named FFxxxxxx.TRC will be created. If a process dump is requested, a file named FFxxxxxx.PRC will be created. The error log information contains the name of the FFST dump file along with the trace file if applicable.

Use the SYSLOG utility to view the error log information. SYSLOG uses message files and template files to format and display error log records. Use SYSLOG to control the following log functions:

  • specify which error log file to use.
  • suspend or resume error logging.
  • change the size of the error log.

Steps for Instrumenting for FFST

The steps for instrumenting your code are as follows:

  • Planning for Instrumenting Your Code
  • Code the FFSTProbe API.
  • Compile the code.
  • Create the error record template file
  • Create message files.
  • Create DMI MIF files.

The remainder of the information in this chapter provides information about each step.

Planning for Instrumenting Your Code

There are several things to consider before you begin putting calls to the FFSTProbe API in your code. This section describes the following considerations and steps:

  • Define and ensure existence of Vital Product Data (VPD). VPD is the description of your code to the system. The system uses VPD to identify the product that is reporting a problem.
  • Decide how and where you should code your calls to the FFSTProbe function.
  • Decide what data you want the function to collect for code failures.

Defining Vital Product Data (VPD)

The DMI facility provides a standard way to register the hardware and software on the system. This allows both system software and system-based software (for example, application programs or device drivers) to register with the system. This information is called Vital Product Data (VPD). The system uses VPD to identify the source of error log entries. Various system management applications require access to the VPD information. When a product component uses FFSTProbe to log an error, the error logging function automatically includes the VPD information in the error record.

The VPD information allows systems management applications to assume a base level of VPD for all conforming products on a system. The VPD information for software products differs from the VPD information for hardware products. You can define additional specialized VPD information for your product.

Both your component's install object (that the feature installer uses to install your product) and the FFSTProbe parameter information must have identical information. This enables DMI to provide the template file that is specified on the call to FFSTProbe. Recommendations for these values are:

  • Vendor - a description of the organization or company that developed the product that is reporting an error (example: IBM).
  • Tag - a unique description of the product that is reporting an error (example: FFSTProbe SAMPLE).
  • Revision - optional description of the development organization's revision level, change level, or version of the product that is reporting an error. (example: 1.0.1c). If a component within a product is reporting the error, the revision may not correspond to the revision level of the entire product.

Programmers refer to the Vendor, Tag, and Revision values as the DMI triplet.

When your code calls FFSTProbe, the DMI triplet values you specified must match the DMI values stored in the DMI database for your product. If the values do not match, FFSTProbe cannot find the VPD for your product in the DMI database.

If the DMI triplet for your product matches the DMI triplet of a different product, the results are unpredictable.

Deciding How and Where to Place Calls to FFSTProbe API

Here are two common approaches to instrumentation. One way is to place numerous calls to FFSTProbe throughout your product to get broad coverage. This approach might contain only a minimum amount of error data since you know every error would be captured via a probe. The second approach is to use just a few strategically placed calls that capture greater amounts of error data to better isolate the the exact cause of the failure.

The advantage to the broad coverage approach is that errors are most likely to be identified because of the greater number of calls to FFSTProbe. The strategic approach usually involves instrumenting existing exception paths or thoroughly understanding the code to identify where to place the call to FFSTProbe.

You might consider combining both approaches in your code. The broad coverage aspect identifies exactly where the error occurred, and the strategic aspect identifies the cause.

You should use FFSTProbe only to detect problems that would require a program fix or a modification to user operation procedures.

Places to Instrument

The following list contains situations and places in your code you should consider for instrumentation:

  • When your code generates an error return, create an error log entry for the error condition that caused the error.
  • Some programmers use Print Debug and Print File for testing code. These instructions print certain variables and messages at various code failure points. Convert the Print Debug and Print File instructions to calls to the FFSTProbe API. The system disables the Print Debug and Print File functions after you install your code.
  • When you expect return codes, create an error log entry when you receive unexpected return codes.
  • In environment situations (circumstances that are not necessarily program errors but are worthy of creating an error log entry). For example, resource shortages, time-out conditions, system-hang conditions, or lost physical connections.
  • In cleanup functions, your code may be tolerant of potential errors and may do some error recovery. The cleanup functions in your code are candidates for logging if the recovery signifies an important event.

Consider that the number of log entries and the size of entries you log could cause too much information to be logged. Creating excessive error log entries can cause the error log to wrap. This causes previously logged information to be overwritten. One of the most frequent questions that are asked about FFST is where and when to use it. When instrumenting your product, you should consider several places:

Exception Paths
Many programmers already take some actions in various exception conditions. These actions often include cleaning up execution environments, closing files, and ending the program. Your code should call the FFSTProbe API to create an error log entry that contains the following information:
the program or module that failed
why the failure occurred
what corrective actions to take.


Incorrect Conditionals (for example, switch case)
As developers write programs, they make assumptions of what can or cannot happen, and add various conditionals and execution blocks to programs.
Conditionals that are not valid are ideal candidates for a call to the FFSTProbe API to log these failures. By calling the FFSTProbe function at these points, you can quickly and accurately pinpoint the failure and capture the associated data at the time of failure.
External Calls
OS/2 Warp Version 4 does not expect calls to external programs to fail. However, unexpected return codes, when not handled, can result in program failure. Your code should call FFSTProbe after each external call that results in an unexpected return code.
Some development groups use someone other than the developer instrument all calls. Other groups spend more time anticipating the potential problem areas and placing probes only in those areas. Your code should do what is achievable for the current circumstances. You should then evaluate how well your calls to FFSTProbe work before you begin the next development cycle.

For FFSTProbe API calls to be useful in debugging a problem, the calls must specify:

  • A unique probe ID
  • Descriptive text explaining the problem
  • Data that is relevant to the failure
  • Instructions on how to resolve the problem if appropriate.

With well-instrumented code, several benefits of using the FFSTProbe function are evident. You can use FFST to capture error information. You can also identify areas in your code that did not cause the problem. If instrumented code made no calls to FFSTProbe, you can focus on code without calls to FFSTProbe.

Your code should not call the FFSTProbe function inside a loop. Call FFSTProbe only once per error situation. Repeated calls may cause system performance problems and cause wrapping of FFST data by storing unnecessary data.

Problem-prone components in products are good candidates for the FFSTProbe function.

Your decision about using instrumentation depends on the possible errors and the cost of solving an error.

Deciding What Data You Want to Collect

After you decide where to call the FFSTProbe function, you need to decide what data to capture. The question to ask is, "What data would I need to see to have a good chance of determining the source of the error?" Consider capturing data items that are global variables and control blocks.

Other things to consider are:

  • How much data you need to determine the cause of the problem?
  • Has this problem been encountered before?
  • How complex is the code?
  • Are other components or products being called?
  • Does the error message information point to the problem?

Make every effort to collect enough data to solve the problem without requiring the user to re-create the problem.

The amount of data collected could also be affected by the amount of system storage that is available or allocated to store error data.

Error Types to Consider

When an error occurs, call the FFSTProbe function to log the error. The following examples describe several error types you should consider when you instrument your code and the types of data to capture for the error:

Error return
Determine the severity of an error so that you call the FFSTProbe function only when the error return indicates a serious problem. When a calling program has a significant failure that causes an error return, the program calls FFSTProbe. Be careful not to cause a "cascade" of calls to FFSTProbe as the system passes error returns back up through a set of higher level function calls.
Failure-related data may include:
Return code
Input parameters to the function
Returned values from the function
Any internal variables that determine or affect the erroneous results
Damaged data structures
Product data structures can become damaged with data that is not valid. To capture data for this type of error during normal processing, your code could have a method for periodically checking important internal data structures. Such logic is an important step toward improving the reliability and availability of the product.
Failure-related data may include:
Data structures
Historical information that indicates when your code found the product data structure to be correct
General system data showing other programs in use by the system when the error occurred.
Time-outs and detected hangs
To detect time-outs and hangs, design your code to sense how long a given request should take.
Failure-related data may include:
Current time-out values
Any state information that describes what the timed-out function is currently doing
States of resources that may relate to the time-out or hang
Historical information that describes what the timed-out function had been doing before the error occurred.
Slow performance of a service
In order to detect a slow performance condition, design code to sense how long a given service should take before calling FFSTProbe.
Failure-related data may include:
Internal resource states that may relate to the slow performance of the service
Historical information that indicates who has been using that service and what requests the user made of the service.
Traps
It is difficult to detect a failure within a product and determine its cause after an exception management routine has received control. Well-designed and instrumented code can detect a failure before exception management routines get control.
Failure-related data may include:
Exception blocks that contain the hardware state when the trap occurred
Context data indicates what was being run when the trap occurred (for example, call stacks or internal state variables)

Ways to Collect Data

FFST takes care of storing the collected data in the error log and optional FFST dump. This information is available for viewing through use of the SYSLOG utility. For more information on error logs and the SYSLOG utility, see Viewing and Analyzing Error Log Entries.

User data could be data areas, control blocks, complete files, or any other form of data that could be used to determine the problem.

Two parameters on the FFST Probe function allow user-specified data to be collected:

  • The pDmpUsrData parameter saves information in the FFST Dump. You can specify up to 30 items, each having a maximum size of 32 KB. For example, you use this parameter to save large control structures or buffers.
  • The LogUsrData parameter save saves user data in the error log. The maximum amount of logged data is 2 KB. You need to determine what data you want as part of the 2 KB. For example, you use this parameter to save small items such as return codes, function names, or system names.

FFST Dump Data

The system creates FFST dumps when you use the pDumpUserData parameter with the FFSTProbe function. The Enable FFST Dump option on the FFST Probe Control Table Entry Summary window must be selected before the system will create a FFST dump (see Probe Control Table (PCT) Entry Add or Change Summary Window).

If you did not specify the parameter to collect the FFST dump in the original call to the FFSTProbe function, you can dynamically change the call. You use the Probe Control Table (PCT) to request the FFST dump the next time the specified call to FFSTProbe occurs. The system stores the FFST dump information in the file that is defined on the FFST Setup (FFSTCONF) window (see Using FFST Setup (FFSTCONF)). You can select the path but not the file name.

To delete FFST dumps, use the FFSTCONF command and select the Actions option. Then choose the Dumps option to display the FFST Dump File Summary window. Select the dump file to delete and click on the File menu bar option. Click on Delete to delete the dump file.

The FFST dump data can be of the various types. Some types may not be part of dump, depending on availability of data. The system displays the information when you format the dump. The various types are: process environment data, process status data, trace buffer data, user data, error log data, and process errors.

Process Environment Data

If you requested the process environment data, FFST collects and stores the data as part of the FFST dump. The system displays this information when you use the PM Dump Facility dump formatter.

You can specify to have the process environment data captured by selecting the Capture Process Environment checkbox on the FFST PCT Entry window (see Probe Control Table (PCT) Entry Add or Change Summary Window).

System Process Status Data

Process status data is a record of all processes and threads that are running on the system. This information is similar to information you get when you use the PSTAT command.

If you requested process status data, FFST collects and stores the data as part of the FFST dump. The system displays this information when you use the PM Dump Facility dump formatter.

You can specify to have the process status data captured by selecting the Capture System Processes checkbox on the FFST PCT Entry window (see Probe Control Table (PCT) Entry Add or Change Summary Window).

Trace Data

When you enable your code for trace, FFST collects and stores trace data. You can specify to have the trace snapshot captured by selecting the Capture Trace Snapshot checkbox on the FFST PCT Entry window (see Probe Control Table (PCT) Entry Add or Change Summary Window). The system stores trace information in a separate file.

You display trace information either by using the TRACEFMT command or by using the Display Trace File option in the SYSLOG Tools menu. For information about using the trace functions and formatter, see Analyzing Performance and Debugging Problems Using Trace.

User Storage Data

If you requested user storage data, FFST collects and stores the data as part of the dump. Using the function parameters, you can capture up to 30 data areas. The system displays this information when you use the PM Dump Facility dump formatter. For information about using the dump functions and formatter, see Capturing and Saving Failure-Related Information through Dumps.

Additional Error Log User Data

If you requested additional error log user data in the FFSTProbe call, FFST generates a FFST dump and stores the data as part of the dump. This information is identical to the error log entry that is stored in the error log. The system displays this information when you use the PM Dump Facility dump formatter. For information about using the error log functions and formatter, see Viewing and Analyzing Error Log Entries.

FFST Dump Process Errors

FFST collects and stores FFST dump-processing errors that occurred when the system creates the dump. The system includes error message identifiers in this information. The system displays this information when you use the PM Dump Facility dump formatter.

Process Dumps

FFST collects and stores a process dump only when the Capture process dump option is selected on the Probe Control Table (PCT) Entry Summary window. Refer to FFST Probe Control Table Entry Summary Window.

System Dumps

FFST collects and stores a system dump only when the Capture system dump is selected on the Probe Control Table (PCT) Entry Summary window. Refer to FFST Probe Control Table Entry Summary Window.

Coding the FFSTProbe Functions

       Direct Calls
       Using Macros to Call the FFSTProbe Function
       Using Subroutines to Call the FFSTProbe Function 

Creating Template Files

       Why Template Files Are Important 

Creating Message Files

Setting Up (Instrumenting) for Trace

What Is Trace?

Creating a Trace File Entry Using the TraceCreateEntry Function

Defining Trace Information Format

       Creating Trace Entry Formatting Directives 

Creating DMI MIF Files

       How to Install Your Software Product 

FFST-Related Functions

Examples of Code when Instrumenting for FFST and Trace

       Example of an Application Program Using a Subroutine
       Creating an Error Record Template Input File
           Template File Tips 
       Message Input File Example
       MIF File Example