Issues Specific to Converting Unicode to MARC-8 (Character Sets: Conversion)

The MARC-8 repertoire contains over 16,000 characters; the Unicode repertoire contains over 100,000 characters. Direct mappings using the tables in Part 5 are sufficient for Unicode to MARC-8 conversion only for a record that contains no characters that are outside the MARC-8 repertoire. Additional techniques are needed for the more general case in which non-MARC-8 characters may be present in a Unicode record that is to be converted.

Two generally applicable methods for conversion from Unicode to MARC-8 encoding have been approved to aid conversion of MARC 21 records containing characters outside the MARC-8 repertoire. The two methods must not be used in the same record.

Lossy conversion to MARC-8 encoding

The lossy conversion method is intended for use in situations in which the loss of data beyond the large MARC-8 repertoire is not a concern. Each character that is not in the MARC-8 repertoire is replaced with an ASCII vertical bar (7C(hex)) during conversion. This method is called lossy because the substitution of a generic placeholder for every unconvertible Unicode code point loses data that cannot be recovered in a reconversion to Unicode.

Lossless conversion to MARC-8 encoding

In the lossless conversion method, a Unicode character that is not in the MARC-8 repertoire is replaced by a hexadecimal Numeric Character Reference (NCR) identifying the specific unconvertible Unicode code point. This method preserved precisely the information content of the Unicode record although the result may result in a cryptic display, and additional conversion techniques will be required to reconstruct the record exactly in Unicode. The Numeric Character Reference consists only of ASCII characters, thus can be carried into the MARC-8 target record.

The structure of the NCR is &#xXXXX; where:

& and ; (the ampersand and semicolon) surround the Reference data

#x designates that the value expressed is in hexadecimal notation

XXXX is the hexadecimal representation of the code point for the Unicode character expressed in hex digits 0123456789ABCDEF. Some characters, primarily infrequently encountered CJK ideographs, may require more than four hexadecimal digits. The NCR can contain more than four digits if they are needed.

It is not correct to represent a non-ASCII character in an NCR by its UTF-8 octets; only the scalar value of the code point is allowed.

Enhancing the Unicode to MARC-8 conversion

Either the lossy or the lossless conversion method can be applied directly to a Unicode record, but better results will be obtained if characters outside the MARC-8 repertoire are first converted, as far as possible, into approximately equivalent MARC-8 characters or character sequences. This will minimize the number of vertical bars or NCRs in the output and a more readily usable output record will result. Techniques of this sort are frequently referred to as normalization. Unicode defines four normalization forms for use within the Unicode environment. The optimal normalization for conversion to MARC-8 is a variant of the one called Compatibility Decomposition, or KD.

The code charts on the Unicode web site list valid decomposition sequences for all decomposable characters. These sequences are of two kinds: canonical and compatibility. A common example of the canonical type is the decomposition of a letter with a diacritical mark: E with acute accent (00E9(hex)) decomposes to E (0045(hex)) + acute (0301(hex)). Compatibility decompositions differ from the canonical in that they "do not attempt to retain or emulate the formatting of the original character." (Unicode Standard 5.0, Section 17.1). Some examples of characters with compatibility equivalents are the ellipsis character (2026(hex)) that decomposes to a sequence of three periods (002E(hex)); the circled digit four (2463(hex)) that becomes simply 4 (0043(hex)); the Roman numeral IV (2163(hex)) that decomposes to I (0049(hex)) + V (0056(hex)); and any of the spaces of different width (2000-2008(hex)) that can decompose in one or two steps to the ASCII space (0020(hex)).

Unicode normalization form D specifies only canonical decompositions. MARC-8 repertoire includes several precomposed characters that can be decomposed in Unicode, but should not be decomposed during conversion to MARC-8. These characters are specified in Table 4.1 below.

Table 4.1
Characters not requiring canonical decomposition for conversion from Unicode to MARC-8 encoding.
All code points are shown in hexadecimal notation.
MARC-8 code points are shown in the G0 range for all sets except Extended Latin.
Character name	Unicode code points (u.c., l.c)	MARC-8 G0 code points (u.c., l.c.)	MARC-8 character set
Cyrillic Short I	(0419, 0439)	(4A, 6A)	Basic Cyrillic
Cyrillic Io	(0401, 0451)	(44, 64)	Extended Cyrillic
Cyrillic Gje	(0403, 0453)	(42, 62)	Extended Cyrillic
Cyrillic Yi	(0407, 0457)	(47, 67)	Extended Cyrillic
Cyrillic Kje	(040C, 045C)	(4C, 6C)	Extended Cyrillic
Cyrillic Short U	(040E, 045E)	(4D, 6D)	Extended Cyrillic
Arabic Alef, Madda above	0622	42	Basic Arabic
Arabic Alef, Hamza above	0623	43	Basic Arabic
Arabic Waw, Hamza above	0624	44	Basic Arabic
Arabic Alef, Hamza below	0625	45	Basic Arabic
Arabic Yeh, Hamza above	0626	46	Basic Arabic
Latin O with horn	(01A0, 01A1)	(AC, BC)	Extended Latin (ANSEL)
Latin U with horn	(01AF, 01B0)	(AD, BD)	Extended Latin (ANSEL)

Unicode normalization form KD, the optimal for conversion from Unicode to MARC-8 repertoires, specifies that decompositions of both types, canonical and compatibility, should be done until no further decomposition is possible. The full KD normalization, however, may not be desired because of the canonical decompositions in the above table and other issues with the compatibility decompositions, such as loss of formatting with superscripts and subscripts.

Additional considerations for converting characters with combining marks

A further complication arises when a character with a combining mark cannot be normalized into components belonging to the MARC-8 repertoire. Proper treatment depends on whether it is the base character or the combining mark that is absent from the repertoire. If the base character can be converted, then the combining mark should be replaced by an NCR or placeholder properly repositioned before the base character in the output record. If the base character cannot be converted, either the lossless or the lossy technique can be applied directly, preferably to the character in its precomposed form, so that it will generate a single NCR or placeholder rather than two. This treatment is preferred whether or not the combining character is also missing from the MARC-8 repertoire.

To return, select:

Part 4: Conversion Between Environments

Character Sets and Encoding Options