DBCS Enabling of Applications

From EDM2
Jump to: navigation, search

by John Howard

In parts of Asia, computers use writing systems that are all descendants of the Chinese character writing system. These writing systems are called Double Byte Character Set (DBCS). DBCS is used only in Japan, Korea, People's Republic of China (mainland), and the Republic of China (Taiwan), which represent about half the population of Asia (about two billion people). The remainder of Asia uses a system called Single Byte Character Set (SBCS).

The Background of the Languages

The Chinese characters that form the basis for all of the DBCS writing systems are called Hanzi. Japanese uses some Chinese characters as part of their system, called Kanji. Japan also uses kana characters - a shorthand way to refer to additional sets of Japanese characters called Hiragana and Katakana. Finally, Korean uses Chinese characters called Hanja. The Korean writing system has a form of alphabet comprised of Jamo. All these writing systems use both the Roman alphabet and Arabic numbers to some extent. The Chinese system additionally includes Chinese character numbers and a numbering system.

The characters used in DBCS are called ideographs, which represent a syllable or a word. Each DBCS language has thousands of ideographs. In order to enter these from a keyboard, a tool called an Input Method Editor is used. This tool allows the user to enter a representation of the character either phonetically or by writing strokes. The editor then presents the user with all of the ideographs that match the input. The user selects the ideograph that is intended. Japan uses kana to enter phonetic characters as the basis for describing potential candidate characters. In China, an editor is used that understands how characters are formed from strokes laid down in a particular sequence (called 5-stroke).

The Basics

Although the Asian languages are difficult to learn to read and write, you don't need any Asian language writing skills to enable your applications. The DBCS include English characters and punctuation. In addition, the code pages used in DBCS have single-byte English characters as well as double-byte English characters. Hence, you can use the double-byte English characters for most testing. These can be entered easily from an English keyboard using a simple toggle key.

The terms single-byte, double-byte, and code page, as well as the thousands of Chinese characters, might have you confused. However, a code page is nothing more than a mapping between a character (or glyph) and a binary value. The glyph is what you see on the monitor or printer, and the binary value is what programmers know as the character. Most code pages have 256 entries because they represent single-byte characters. To represent the large number of Asian characters, a combined single- and double-byte code page is used. In this code page, certain ranges of single-byte characters are not characters at all but first bytes of double-byte characters. The second byte of a double-byte character also has a range. The following table will help clarify:

DBCS-Enab-Fig-1.gif

Figure 1. Japanese combined code page

Figure 1 shows the valid ranges of first bytes of double-byte characters by looking down the column on the left. Where the column contains an s, the character is a single-byte character. Where the column contains a 1, the character is the first byte of a double-byte character. In the middle of the table, the boxes with 2s in them indicate the valid second byte ranges of double-byte characters. This table is valid for code page 942. (The ranges differ based on the code page.)

It is worth noting that 5C and 7E are special characters. They differ from the standard ASCII found in PC SBCS code pages. For Korean and Japanese code pages, 5C (usually the backslash) is replaced with the country's currency symbol (Won and Yen, respectively). The tilde is replaced with the overline.

Enabling Techniques

While reading and writing the languages are difficult, enabling programs to handle DBCS characters is not. This article gives you a few simple techniques that cover any situation you are likely to run into. The "difficult" part is that the work is tedious because you must do it in a variety of places. The good news is that the system provides you with a C runtime and system defaults that make the job easier.

What kinds of problems are you faced with? They boil down to three basic categories:

  • Handling keyboard input
  • Path and file name processing
  • Dialog box problems (in Presentation Manager programs)

Before discussing the first problem, let's take a look at data in a DBCS application. The first thing you notice is that the data stream can be mixed. That is, it contains both single - and double - bytecharacters. This mixed data stream can occur in any of the following:

  • Programming language literals
  • Programming language comments
  • Keyboard input
  • Name space
  • File data
  • Process data
  • Menus and messages
  • Display and printed output

Figure 2 shows the string IBM JAPAN using English with code page 437/850 and then Japanese with code page 932/942 encodings.

DBCS-Enab-Fig-2.gif

Figure 2.

You can see from Figure 2 that Japan is written as two DBCS characters (pronounced NI and HON) in code page 932/942.

OS/2 provides several APIs that make dealing with the mixed data stream a little easier. The APIs let an application be single source and single object. This is accomplished by adapting at run time to the code page and data stream architecture. The application makes a call to either DosGetDBCSEv or DosQueryDBCSEnv, depending on whether it is a 16-bit or 32-bit application, respectively. In Windows, the call is IsDBCSLeadByte. This call returns either a TRUE or FALSE indicator. DOS programs can use the INT 21H, AX=6507H. The API returns a buffer to the caller. The code snippet in Figure 3 shows a 16-bit example.

define   INCL_DOSNLS
/* National Language Support Values */ 
#include <<os2.h<;> 
#include <<stdio.h<;> 
ULONG    Length ;
/* Length of data area provided */ 
COUNTRYCODE Structure;
/* Input data structure */
UCHAR MemoryBuffer <[>12];
/* DBCS Environmental */
APIRET rc;       /* Return code */
Length = 12;     /* A length of 12 bytes is sufficient*/ 
Structure.country = 0;  /* Use the default system country */ 
Structure.codepage = 0; /* Return DBCS info for the */ 
/* current process codepage */
rc = DosQueryDBCSEnv(Length, &Structure, MemoryBuffer);
if (rc != 0) {
    printf("DosQueryDBCSEnv error: return code = %ld", rc);
    return;
    }

Figure 3. Obtaining a DBCS environment vector

The memory buffer will be filled with pairs of 16-bit integers. These pairs take the form shown in Table 1. Notice how the results for Japanese in the table correlate with the example in Figure 1.

Page Environmental Vector DBCS First Byte Code Range
Chinese 0x81FC0000 81-FC
Japanese 0x819FFE0FC000081-9F E0-FC
Korean x81FE0000 1-FE
English 0x0000
German etc. 0x0000

Table 1.

To adapt at run time, code examining a text string simply calls the API and then proceeds to process each character by examining the byte value and determining if the byte is a valid first byte of a double-byte character. If not, it goes to the next byte. For double-byte characters, it skips the next byte because it is part of the DBCS character. You must follow this process for every character.

Handling Keyboard Input

Keyboard input is a problem in DBCS because the data stream contains a mix of single- and double-byte characters. Care must be taken when processing mixed data streams. Figure 4 provides a comprehensive example of all text processing problems. Each problem is described below the figure starting from the top moving left to right and then going to the bottom and working left to right. Refer to the figure while reading the explanations.

DBCS-Enab-Fig-4.gif

Figure 4.

Explanations

  1. Always use the DBCS environmental vector to determine what you are looking at. It could be either a single character or the first or second byte of a double-byte character.
  2. When you replace a double-byte character with a single-byte character in a line of text, you are effectively performing a delete. Every character to the end of the line must be moved to the left.
  3. When you replace a single-byte character with a double-byte character in a line of text, you are effectively performing an insert. Every character to the end of the line must be moved to the right. Care must be taken that they will fit.
  4. Do not insert a character between the first byte and second byte of a double-byte character. The results could be disastrous.
  5. Be sure that there is space to insert a double-byte character into the line. It takes two bytes instead of one.
  6. When examining text ,always scan from the beginning of the field .This is the only safe way to determine what the character you are looking at might be.
  7. Never attempt to case convert a string that has double-byte characters in it. The second bytes of double-byte characters have values that are valid as single-byte characters.
  8. Remember, a number of second bytes of double - byte characters are valid as single - byte characters.
  9. When the need arises to back up a pointer in a text string, the safe thing to do is go back to the beginning and re-examine the string. Looking at the previous byte cannot tell you whether you are looking at the second byte of a double-byte character or a single-byte character.
  10. Never truncate the second byte of a double - byte character. It could combine with the next character and look okay even though it is definitely wrong. It will be a different double - byte character.

Path and File Name Processing

In addition to all of the problems that occur for text strings, path and file name processing also have these problems:

First, the DBCS code pages have both a single-byte and a double-byte space character. You must honor both as white space characters on the command line. The code point of the double-byte space changes by country.

Drive names are not changed in a DBCS system. However, pathname processing must be done carefully. The second byte of a double-byte character can legitimately look like a backslash (\). Hence, you cannot use any of the normal C library string scanning functions to locate a backslash. You must go through one character at a time and make the determination. If your path processing normally backs up tore analyze the string, this will not work successfully on DBCS strings. Again, that's because the second byte of a DBCS character could have the hexadecimal value of a backslash but not be a backslash character.

File system names can have a maximum of 8 bytes in FAT file systems, and these bytes can be from mixed strings. File names must be truncated to fit within the 8 bytes. As a result, a DBCS file name can have a maximum of 4 DBCS characters. File extensions have a maximum size of 3 bytes in FAT file systems. Therefore ,theycancontainonly1DBCScharacter .To help you, follow these truncation and fix up rules:

  • byte file names and 3-byte extensions
    • If the eighth (third) byte of the file name is SBCS, truncate to 8 (3).
    • If the eighth (third) byte is the second byte of a DBCS character, truncate to 8 (3).
    • If the eighth (third) byte is the first byte of a DBCS character, truncate to 7 (2) bytes.

The LAN server name space also can contain mixed strings. These occur in path names, domain names, computer names, user names, net names, aliases, messaging, and application IDs. Use the rules listed above.

Dialog Box Problems in PM

It's not possible to describe a specific solution in this section. However, you should be aware of the problems. OS/2 uses dialog boxes in PM. Dialog boxes are easy to use and a handy way to present information, and they are measured in dialog units. The purpose of dialog units is to retain the same relative dimensions regardless of the display resolution. However, the dialog units are also related to system font size. Here are some examples:

Display Screen Size System Font Size Point Size Screen Size in Dialog Units
VGA (SBCS) 640 x 480 6 x 16 10 426 x 240
VGA (DBCS) 640 x 480 8 x 18 12 320 x 213
SVGA (DBCS) 1024 x 768 11 x 24 12 378 x 256

Table 2.

If you examine what happens when you change from VGA to SVGA, you'll see that the screen size in dialog units changes. As a result the following things change: the dialog box size, the dialog box placement, and the amount of text.

This means that a box designed for VGA will be larger when displayed in SVGA. It will extend to the right and below the placement of the same box in VGA. If you design the dialog box too big, then a switch of video resolution might cause the box to not fit horizontally or vertically on the screen. This size change might cause the box to overlap or cascade over other dialog boxes. Care must be taken in placing boxes too close to the right edge of the screen or the bottom of the screen.

If the design of the dialog box is too small, then a switch of video resolution might cause larger text sizes to not fit within the box. This might cause DBCS text to overlap other objects. Vertical scroll bars may alleviate some of these problems. Even if the size of the box is sufficient for the text, the box's position might cause it to cascade over other boxes. Care should also be taken to select a font that supports DBCS characters.

Finally, the dialog box must be enabled for DBCS input. This is done by setting the FCF_DBE_APPSTAT flag in dialog templates. This flag should be logically OR'ed with other flags (for example, FCF_DLGBORDER | FCF_TITLEBAR | FCF_DBE_APPSTAT and so on) that are part of the DIALOG statement. In the PM program, the flFrameFlags should have the FCF_DBE_APPSTAT flag OR'ed into the other flags (FCF_TITLEBAR | FCF_SYSMENU | FCFTASKLIST | FCF_DBE_APPSTAT and so on, for example) that are part of the definition of this ULONG. The addition of the FCF_DBE_APPSTAT to a DIALOG statement or the flFrameFlags has no effect on single-byte systems. As a result, just OR in the flag and the input will be enabled for DBCS.

Summary

This article goes through many of the problems that you can encounter when enabling an application for DBCS support. The DBCS languages can be difficult to learn to speak or write, however, enabling an application for DBCS involves just a few simple concerns. The work required for each of these concerns is fairly simple and straightforward, as you can see in the examples presented. I hope you'll now see that the difficulty with DBCS enablement is not in the individual changes required, but only that these changes are required in many places in an application.

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation