Implementation (Character Sets: Unicode)

UTF-8 encoding form

Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records. UTF-8 transforms a full 32-bit representation of Unicode code points, or the original 16-bit representation of Unicode (now known as UTF-16), into 8-bit units (octets). A Unicode character can be represented in a single octet or a sequence of two, three, or four octets, depending on its code point.

Only values from 00(hex) to 7F(hex) require a single octet. This part of the repertoire is identical in its UTF-8 encoding to ASCII. This is the reason only ASCII characters are allowed in the leader and other parts of the MARC 21 record on which the parsing of the record depends; and conversely, the reason that UTF-8 is the only Unicode encoding form currently permitted in MARC 21.

In many contexts it is unnecessary to know what the transformed code points look like; knowing the scalar values is sufficient. In other situations, such as examining a dump of a MARC 21 record, or creating certain tables of values, it is necessary to be able to interpret the transformed octets. (See the section UTF-8 Transformation Details for more information.)

Expressing lengths

Lengths in MARC 21 records are generally expressed in octets rather than characters. This distinction is important in Unicode encoding because of the variability of character length inherent in UTF-8. The record length contained in Leader positions 0-4, and field lengths and starting positions in directory entries are counts of octets, not characters.

MARC 21 encoding marker

A Unicode-encoded MARC 21 record must have value a in Leader position 9 (Character coding scheme).

MARC field 066

Field 066 (Character Sets Present) is not used in Unicode-encoded MARC 21 records in the Unicode environment. During conversion of MARC 21 records from MARC-8 encoding to Unicode, field 066 should be deleted.

MARC subfield $6 (Linkage)

Subfield $6 (Linkage) is used in MARC 21 records to link alternate graphic representations of the same data, to identify the presence of specific scripts in a field, and to flag fields in which the display/print directionality of data is right-to-left (e.g., for Arabic script). The subfield $6 script identification code in MARC-8-encoded MARC 21 records identifies MARC-8 character sets, rather than scripts per se; hence the code is irrelevant in the Unicode environment because the character set is always UCS, which has no script identification code value. The script identification code should be dropped from subfield $6 when converting to Unicode from MARC-8 encoding. The Field Orientation Code, which flags a field as having right-to-left display directionality, should be used in Unicode-encoded MARC 21 records. When present, the Field Orientation code is separated from the subfield $6 tag linkage data by two solidus (slash) characters (002F(hex)).

Combining marks (diacritics)

Unicode requires that separately encoded diacritical marks and similar combining characters used with base letters from the Latin and other scripts be encoded following the base letter they modify. This is the opposite of the MARC-8 rule for encoding order. Further, the rules that apply to base letters with more than one combining mark differ between the encodings. In MARC-8, the rule is to encode the combining marks from top to bottom. In Unicode, if one of the marks displays below the base letter and the other above, it is preferable to encode the one below the letter first. Multiple marks in the same typographic space (e.g., above the letter) should be encoded starting with the one that appears nearest the base letter, or, when at the same height, in the order in which they appear in the writing direction of the script, reading left to right (or right to left with right-to-left scripts).

Directionality of text

Data are recorded in logical order, from the first character to the last, regardless of field orientation. In the scripts included in the MARC-8 repertoire there are no exceptions to this rule. (One known exception occurs in the Thai script where, to conform with Thai data input standards, vowel characters that display before the consonants they are associated with are recorded before the consonants, instead of after them, which would be the logical way.) In bidirectional scripts such as Arabic and Hebrew, where the dominant writing direction is right to left but numbers are written left to right, the logical order rule still obtains. A full explanation of bidirectionality in Unicode encoding will be found in: The Bidirectional Algorithm; Unicode Standard Annex # 9.

To return, select:

Part 3: Unicode Encoding Environment

Character Sets and Encoding Options