An Introduction to Universal Language Support

From EDM2
Jump to: navigation, search

by Lisa Abbott

Universal Language Support (ULS) is a new feature that will be offered in future releases of Workplace supporting internationalization of our products and applications. An internationalized software product will correctly display all culturally-sensitive information on demand, including language. ULS is a set of APIs and utilities that let you manipulate characters and character strings conforming to the Unicode Standard, access culturally-sensitive information, and convert character data between code pages.

The strategic direction of Workplace is to develop a fully internationalized operating system over a staged period of development. Our goal is conformance to the internationalization programming model for all portions of the system. All utilities, interfaces, and subsystems will provide dynamic support for processing and configuring culturally-sensitive information. Workplace will also be capable of supporting new personalities, applications, and other implementations conforming to the internationalization programming model.

Internationalization Programming Model

Internationalized software implementations do not require users to restart their systems or their applications to change cultural configurations. Instead, the implementations offer a dynamic facility for changing the cultural configuration seamlessly. For a single system or application to run seamlessly in all areas of the world, the three basic rules of internationalized programming model must be followed:

  • Use the character encodings (bit representations) provided by the Unicode Standard to store and manipulate all textual data.
  • Support the cultural conventions requested by the user on demand.
  • Isolate all translatable material from source code. Translatable material includes text, messages, audio output, animations, windows, panels, helps, tutorials, diagnostics, clip art, icons, and the presentation controls necessary to convey the information (window placement and sizing information).

The Unicode Standard and Code Pages

The Unicode Standard was developed by the Unicode Consortium (The Unicode Consortium can be reached at: 1965 Charleston Road, Mountain View, CA 94043, EMAIL: unicode_inc@hq.metaphor.com, Phone: (415) 961-4189.), a non-profit organization consisting of software and hardware manufacturers, including IBM. The Unicode Standard defines an encoding for characters used today in the world from all significant languages (30,000+ characters). Each character encoding in the Unicode code page is 16-bits wide. An application conforms to the Unicode standard if it uses independent fixed-width 16-bit characters and uses Unicode code points (bit representations) to represent Unicode-defined characters.

The Unicode Standard further defines rules for the use of the Unicode encoding. For example, the Unicode Standard defines an implicit algorithm for uniquely interpreting bi-directional character streams and rules for formal Unicode compliance. Workplace will use the latest version of the Unicode Standard, Version 1.1. Version 1.1 aligns with the ISO DIS 10646 UCS-2 (Universal Character Set containing 2 bytes) standard for multibyte character encoding. ISO DIS 10646 UCS-2 also is referred to as the Basic Multilingual Plane (BMP) of ISO DIS 10646, where most useful characters found in existing worldwide standards are assigned character codes.

A code page is simply a mapping of bit representations to characters. Without code-page information, a bit representation of text is ambiguous. For example, the char value 157 (0x9D) denotes a different character depending on the code page selected. The char value 157 denotes the character ˜ in code page 437, while the same char value (157) denotes the character ↔ in code page 850. Most code pages on PCs conform to the ASCII7 standard and contain a common subset of 128 characters, the ASCII7 character set, which have equivalent encodings in each code page. The Unicode encoding also might be thought of as a code page.

The following shows different types of encoding and their result.

BITS STANDARD BINARY HEX DEC CHAR
7 ASCII 1000001 41 65 A
8 ISO 8859-1 01000001 41 65 A
16 Unicode 00000000 01000001 41 65 A

Internationalized Programming Rule #1: Use Unicode

Internationalized implementations must use the Unicode encoding to store and manipulate textual data, so they can recognize all the significant characters used in the world. Use the following coding guidelines when implementing the Unicode encoding.

  1. ULS defines the type UniChar to represent the Unicode 16-bit character type to Workplace programmers. Modify every function, data structure, or other instances of the char data type to use the UniChar data type. If you need a data type that represents one byte, typedef one called BYTE.
  2. Any call requesting memory will have to be examined and possibly changed to use sizeof(UniChar) as a modifier to the size of memory requested.
  3. Every instance of pointer arithmetic with a char * type variable needs to be examined and possibly replaced with a Unichar * pointer.
  4. Identify areas of your code where you'll have to import or export textual data to another environment that might not recognize Unicode textual data. Use the ULS APIs for converting textual data between Unicode and other code pages in these instances.
  5. Your implementation must be code-page independent. Perform all textual processing using Unicode encoded textual data. Convert any non-Unicode textual data to Unicode textual data before processing.
  6. If you need to define string literals, the definition must change to include the ANSI 'L' modifier in front of the literal. The 'L' modifier creates a 16-bit character (named a wide character), rather than an 8-bit character. For example:
    char definition              UniChar definition
    
    char chs="string";           UniChar uchs=L"string";
    char ch='A';                 UniChar uch=L'A';
    Note: Only use string literals when the contents of the string will not be translated.
  7. Avoid the use of string literals that contain characters outside of the ASCII7 character set. Many compilers do not store string literals as Unicode unless the characters belong to the ASCII7 character set. The Unicode encoding includes the same encoding as the other ASCII PC code pages for the ASCII7 character set. In the following string, all characters are encoded as Unicode, with the exception of ♀. Therefore, the string literal should be avoided
    UniChar *ucs = L"A string with the character ♀ will not be stored as Unicode"
  8. Your API must be specified in terms of UniChar. However, you might also need an ASCII-based API for compatibility. For example, DosOpen() and DosOpenUni() will both exist in Workplace.
  9. Do not assume characters have any specific properties. Determine character attributes dynamically by using the ULS APIs, such as UniQueryCharAttr().
  10. Change any calls to character or string manipulation functions to ULS API calls, for example UniStrlen() , UniStrcmp(), UniStrcat(), UniStrpbrk().

Representing Cultural Conventions

Cultural conventions differ from region to region around the world. ULS has packaged the cultural conventions pertaining to a region of the world in an object called a locale. The information contained in a locale refers to a region, which can be several countries, one country, or a portion of a country. Locales contain cultural information pertaining to time and date formats, collating sequences, character classifications, case mappings, monetary formats, and language.

Locales are named for the region that they represent. For example, a locale that contains information particular to the English language and the cultural representations (date, time, monetary formats, and so forth) required for the USA might be named en_US.

ULS implements the locale as a locale object (LocaleObject). That is, a data type that encapsulates locale-dependent data and functions. Any ULS function that is dependent on locale information will have a LocaleObject argument.

Internationalized Programming Rule #2: Support Locales

By using the ULS API to access the information contained in locales, implementations become sensitive to the characteristics of each culture. Use the following guidelines for locale-sensitive programming.

  1. Make no assumptions about the country or language being used. Processing such as text collation, date or numeric formatting, or text manipulation will require using a generic internationalized ULS API. For example, UniStrcoll() for collation, UniStrftime() for creating a culturally correct time and date string, UniTransformStr() for upper- and lower-case transformations.
  2. All code is written to be multilingual through Unicode and locale support. Do not use country codes and code pages, if possible. Access all country information through the ULS APIs.
  3. Code page and keyboard settings are entirely independent of the locale - do not assume the locale setting based on them.
  4. The locale might be set differently across the system; multilingual applications might create several locale objects.
  5. Locales are maintained by the application, not the system in the format of a LocaleObject. The Workplace registry will maintain a locale name. This name can be used as the default locale by applications wishing to create a locale.

Isolating Translatable Material

All of the language sensitive information passing between the user and a product is translated into a number of languages depending on market requirements. The implementation must be able to dynamically change the cultural conventions in use (including the locale and the language of translated material) on the request of the user. To satisfy this requirement, all translatable material must be dynamically loadable. This implies the translatable material must be isolated from any source code and executables in the implementation.

The following items are translatable:

  • Messages
  • Audio output
  • Animations
  • Windows
  • Panels
  • Helps
  • Tutorials
  • Diagnostics
  • Clip art
  • ICONs
  • Presentation controls

The following items are not considered to be translatable.

  • Operating system utility, command, and parameter names
  • Reserved device names (for example. mouse, pointer, scrn, kbd, prn, lpt1 and 2, com1 and 2)
  • Font family names and type names
  • Extended attribute content and keywords
  • SQL command names
  • REXX command names
  • Name of system objects
  • File, directory, and path names

Internationalized Programming Rule #3: Isolate Translatable Material

Isolation of translatable material is mandatory in Workplace so the system may respond accordingly to cultural configuration changes dynamically. Follow these isolation guidelines when writing internationalized code.

Number of Characters in Text Additional Space Required
Up to 10 100 to 200%
11 to 20 80 to 100%
21 to 30 60 to 80%
31 to 50 40 to 60%
51 to 70 31 to 40%
Over 70 30%
  1. All translatable material must be isolated from executable code at the source and load module level. Messages should not be contained in source code. From your source code, call a message facility to retrieve messages, as needed, from a message file. Build resources into shared libraries that contain no executable code.
  2. Provide for effective presentation of text after it has expanded because of translation. The numbers in the following table represent a statistical average of the additional space required for translation of English into another language. Use this table as a guideline.
  3. Functions dependent on location of panel elements must not be inhibited by display position changes caused by text expansion. The position of one panel element often is influenced by the position and size of others. This causes the translated version of a panel to relocate some elements. The code must respond properly despite the relocation.
  4. Design products for multiple languages, rather than a single language. ULS APIs must be called at all points where functions that are dependent on national language or country support will be required. If any source code or any module of executable code in a product must be modified to add (retrofit) support for a new language, the product is not enabled.
  5. Substitution variables must be permitted to assume any location and order within a display field. Dynamic messages usually employ substitution variables. When the operating system message facility prepares a message, the current information replaces the substitution variable. Each spoken language has its own syntax; therefore, you might have to change the position and order of the substitution variables to meet the syntax requirements.
  6. Messages must be complete entities, not constructed from individual words or phrases because syntax requirements change between languages.
  7. If an ICON cannot be interpreted worldwide, it will be translated. To avoid ICON translation, do not use words, letters, crosses, stars, or hand gestures in your icon.
  8. The support for one country should not interfere with support for another. In addition, support for a particular country should not force any reduction in function of the product.

Conclusion

This article introduces you to internationalization programming concepts and how the model is supported by Workplace. Use these concepts to start enabling your code today for internationalization. Isolate translated material and design for locale and Unicode use in the future on the Workplace platform.

Stay tuned to future issues of The Developer Connection News for descriptions of the ULS APIs.

About the Author - Lisa Abbott

Lisa Abbott is a Staff Programmer working on the internationalization of Workplace. Lisa joined IBM in 1988 where she has continually worked on OS/2. For the past two years, she has been working on Workplace.

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation