PDGuide - Introduction to Collecting and Managing Problem Determination Data

Problem Determination Programmer's Guide
Introduction to Collecting and Managing Problem Determination Data Guide to Instrumenting Your Code Controlling FFSTProbe Calls Viewing and Analyzing Error Log Entries Analyzing Performance and Debugging Problems Using Trace Capturing and Saving Failure-Related Information through Dumps The Desktop Management Interface Summary of Functions and Interfaces Problem Determination APIs

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation

Ideally, programs run forever, error-free, and within their performance targets. However, code failures do occur. When this happens, programmers need problem analysis tools and techniques that help find problems as quickly as possible.

OS/2 Warp Version 4 provides the architecture, tools, and support to help you collect and manage problem determination data. The goal is to get your code back on track (that is, service the code) as quickly as possible.

This chapter introduces the problem analysis elements and techniques that are available to you. The remainder of the book provides details about these elements.

An Approach to Collecting and Managing Problem Determination Data

Although there is no right or wrong way to repair code problems, this book presents a three-step approach to collecting and managing problem determination data.

Step 1 - Check the Error Log and Data Areas

The first places you should check when code failures occur are the Error Log and associated data areas. OS/2 Warp has a technology called FFST (pronounced "fist"), which stands for First Failure Support Technology. A failure in code that uses FFST technology collects useful problem analysis information that is stored in the error log and in data areas. This can happen without user or programmer intervention. Error log data provides information such as the date and time an error occurred. This data also identifies the module in which the error occurred, the severity, and description of the error. You can use data areas to store user-specified information (data structures, for example) when code fails. The system saves the data area with the error log entry. Many code problems can be pinpointed by analyzing what the error log and data areas tell you. A utility called SYSLOG controls error logging and works with error log contents.

How does a programmer capture problem determination data? FFST is the answer. FFST is a technique of capturing problem determination data at the time of a code failure. You use FFST by placing calls to the FFSTProbe API in strategic places in your code. Each time you call FFSTProbe for a problem, the system writes an entry to the error log. The system also writes entries to data areas if you have coded the FFSTProbe API to do so. The data in the error log entry includes information that identifies the product that encountered the problem. The system stores product information in a database that conforms to the Desktop Management Interface (DMI) standards.

FFSTProbe is a powerful function and requires some planning and setup. You instrument your code by placing calls to FFSTProbe at specific points in your code, along with your instructions for the function. Guide to Instrumenting Your Code, provides an introduction to using this API and steps for planning and using it.

You may decide later that you want to override the calls to FFSTProbe. A Probe Control Table, with a graphical utility, provides this function. Entries in the Probe Control Table override the coded probe functions. This dynamic override capability is very useful because it allows you to change what the probes do without having to recode and recompile your programs.

Guide to Instrumenting Your Code, provides detailed information about this first phase of problem analysis. It explains the data collected and provides guidance for planning and instrumenting your code. Capturing and Saving Failure-Related Information through Dumps, provides more details about how to use the PM Dump Facility dump formatter to view the FFST dump. Viewing and Analyzing Error Log Entries, provides more details about the error log table, what the error log table contains, and how you can work with the log.

Step 2 - Use Trace Facilities

Step 1 requires no user or programmer intervention with failing code. Step 2 involves setting trace points and using trace data, and it does require intervention. The following functions are available in OS/2 Warp Version 4 to help you use trace effectively:

A function to insert trace points in your code
A command that describes the trace file to be used
A command that turns trace on and off
A utility called Trace Customization that lets you format entries in the trace file
A utility called Trace Formatter that displays the contents of the trace file

Traces allow you to see and follow the course of events in code that lead to a failure. You can use trace data:

To understand the order or determine the operating path of the code
To understand parameter data changes during processing
To examine inputs to functions and outputs from functions
To examine resulting return codes
To save intermediate data values

Analyzing Performance and Debugging Problems Using Trace, contains more information about the Trace facilities.

Step 3 - System Dump

If steps 1 and 2 do not help you determine the cause of a code failure, step 3, system dump, is the recommended final step. This is the primary tool used by service personnel to solve system code problems and application code problems. Use a debugger for application problems. OS/2 Warp Version 4 has the technology to initiate a system dump when code fails. The dump information that is stored on disk provides information about the reasons for the code failure. Use this step as a final step because your system will restart after it stores the dump.

This third step of problem analysis has three phases:

Configure: involves adding the TRAPDUMP statement to the CONFIG.SYS file, selecting the Enable System Dump choice on the Probe Control Table, and allocating disk space by using a disk partition for storing the system dump. System dumps can be stored on diskettes but the number of diskettes required will depend on the amount of main storage memory your system has.

Trigger: the way you start a system dump:
- by a keyboard sequence
- by calling the DosForceSystemDump API
- by using the Probe Control Table (PCT) to override the values that are specified in a call to FFSTProbe
- when an unhandled trap occurs.

Format and view: the step where you look at the system dump data using the PM Dump Facility dump formatter.

Refer to Capturing and Saving Failure-Related Information through Dumps for more information about System Dumps and the PM Dump Facility.

Summary

The information in this book describes First Failure Support Technology and the supporting trace and dump facilities that are used during problem analysis. Note that trace utilities are not part of FFST. The FFST technology provides the tools (functions, commands, and graphical interface utilities) to help you instrument and service your code to take advantage of this technology.

Summary of Functions and Interfaces provides a summary of the functions, commands, utilities, and interfaces that comprise FFST.

Although problem determination can be done, whether or not your code is instrumented, any time taken to instrument code is well spent. By instrumenting your code, you will be able to take full advantage of the FFST, trace, and dump tools if code problems occur.