Specification (Character Sets: UCS/Unicode)

Character Repertoire:

A subset of the total characters in UCS/Unicode should be used at this time. The subset is the UCS characters that correspond to the over 16,000 characters defined in MARC 21 for the MARC-8 character sets. This is called the MARC 21 repertoire of characters. The correspondences between the MARC-8 (8- and 24-bit) and UCS/Unicode (16-bit) character codes are shown in the character set lists in Part 3: Code Tables. The Chinese, Japanese, and Korean character correspondences are at the following website: http://www.unicode.org/charts.

Encoding:

The encoding of Unicode characters will be according to the rules of UTF-8 (UCS Transformation Formats-8) which uses designated bits to indicate whether a UCS/Unicode character is represented by 1 octet (8-bits) or multiple octets. This encoding has the advantage of allowing the Basic Latin (ASCII) subset of the MARC 21 repertoire to be encoded the same as in MARC-8 (with 1 octet), thus preserving the basic structural elements of the MARC 21 record, while enabling record content to be multiscript. A brief description of UTF-8 encoding follows, but a fuller description is carried in the UCS and Unicode standards.

UTF-8 is an encoding form for the UCS/Unicode 16-bit repertoire. It represents characters in a systematic way as 1, 2, or 3 octets, using the left-most bits of each octet to indicate how the octet is to be interpreted.

Left-most bits	Meaning of left-most bits for character encoding
0	character composed of 1 octet
110	first octet of 2 octet character
1110	first octet of 3 octet character
10	octet is not the first octet for a character, it is the 2nd or 3rd octet of a multi-octet character

The following transformation is used when converting UCS/Unicode 16-bit characters to UTF-8.

UCS/Unicode values	UTF-8 values
Range 0000 to 007F hex	Form 00000000 0xxxxxxx
0080 to 07FF hex	00000xxx xxyyyyyy
0800 to FFFF hex	xxxxyyyy yyzzzzzz

Note: x indicates the part of the UCS/Unicode character encoding that will be transferred to the first UTF-8 octet; y the part that will be transferred to the second UTF-8 octet; and z the part that will be transferred to the third UTF-8 octet.

Examples:

Character	MARC-8 encoding	UCS/Unicode encoding	UTF-8 encoding
Comma	00101100 (2C hex)	00000000 00101100 (002C hex)	00101100
Latin small letter h	01101000 (68 hex)	00000000 01101000 (0068 hex)	01101000
Macron	11100101 (E5 hex)	00000011 00000100 (0304 hex)	11001100 10000100
Hebrew letter tav	01111010 (7A hex)	00000101 11101010 (05EA hex)	11010111 10101010
Combining liga- ture left half	11101011 (EB hex)	11111110 00100000 (FE20 hex)	11101111 10111000 10100000

The transformation of surrogate pairs is not described above as they are not currently required for any of the MARC-8 repertoire. See the Unicode Standard 3.0 documentation for information on surrogate pairs.

UCS/Unicode Markers:

In the MARC 21 record, Leader character position 9 contains the value a if the record is encoded using UCS/Unicode. Field 066 is not used in a MARC 21 record encoded according to UCS/Unicode as specific character sets are not distinguished in the UCS/Unicode environment.

Standard UCS/Unicode:

MARC 21 uses standard UCS/Unicode. The contents of a field in a MARC 21 record are consistent with Unicode conventions. For example, the diacritics are encoded after the character they modify, rather than before as in MARC-8. Bidirectional data is recorded in logical order, from the first character to the last, regardless of field orientation, and with no exceptions.

Private Use Area values have been used for a small number of Chinese, Japanese and Korean characters to preserve their integrity when mapping to the East Asian Character Code (EACC) used in the MARC-8 environment. Procedures have been initiated to add these characters to standard UCS/Unicode where possible.

To return, select:

Part 2: UCS/Unicode Environment (Character Sets)

Character Sets