Cataloger's Reference Shelf
MARC 21 Specifications for Record Structure, Character Sets, and Exchange
Media
Specification (Character Sets, Part 2: UCS/Unicode Environment)
A subset of the total characters in UCS/Unicode should be used at this time. The subset is the UCS characters that correspond to the over 16,000 characters defined in MARC 21 for the MARC-8 character sets. This is called the MARC 21 repertoire of characters. The correspondences between the MARC-8 (8- and 24-bit) and UCS/Unicode (16-bit) character codes are shown in the character set lists in Part 3: Code Tables. The Chinese, Japanese, and Korean character correspondences are at the following website: http://www.unicode.org/charts.
The encoding of Unicode characters will be according to the rules of UTF-8 (UCS Transformation Formats-8) which uses designated bits to indicate whether a UCS/Unicode character is represented by 1 octet (8-bits) or multiple octets. This encoding has the advantage of allowing the Basic Latin (ASCII) subset of the MARC 21 repertoire to be encoded the same as in MARC-8 (with 1 octet), thus preserving the basic structural elements of the MARC 21 record, while enabling record content to be multiscript. A brief description of UTF-8 encoding follows, but a fuller description is carried in the UCS and Unicode standards.
UTF-8 is an encoding form for the UCS/Unicode 16-bit repertoire. It represents characters in a systematic way as 1, 2, or 3 octets, using the left-most bits of each octet to indicate how the octet is to be interpreted.
Left-most bits |
Meaning of left-most bits for character encoding |
0 |
character composed of 1 octet |
110 |
first octet of 2 octet character |
1110 |
first octet of 3 octet character |
10 |
octet is not the first octet for a character, it is the 2nd or 3rd octet of a multi-octet character |
The following transformation is used when converting UCS/Unicode 16-bit characters to UTF-8.
UCS/Unicode values |
UTF-8 values |
Range 0000 to 007F hex |
Form 00000000 0xxxxxxx |
0080 to 07FF hex |
00000xxx xxyyyyyy |
0800 to FFFF hex |
xxxxyyyy yyzzzzzz |
Note: x indicates the part of the UCS/Unicode character encoding that will be transferred to the first UTF-8 octet; y the part that will be transferred to the second UTF-8 octet; and z the part that will be transferred to the third UTF-8 octet.
Examples:
Character |
MARC-8 |
UCS/Unicode encoding |
UTF-8 encoding |
Comma |
00101100 |
00000000 00101100 |
00101100 |
Latin small |
01101000 |
00000000 01101000 |
01101000 |
Macron |
11100101 |
00000011 00000100 |
11001100 10000100 |
Hebrew letter |
01111010 |
00000101 11101010 |
11010111 10101010 |
Combining liga- |
11101011 |
11111110 00100000 |
11101111 10111000 10100000 |
The transformation of surrogate pairs is not described above as they are not currently required for any of the MARC-8 repertoire. See the Unicode Standard 3.0 documentation for information on surrogate pairs.
In the MARC 21 record, Leader character position 9 contains the value a if the record is encoded using UCS/Unicode. Field 066 is not used in a MARC 21 record encoded according to UCS/Unicode as specific character sets are not distinguished in the UCS/Unicode environment.
MARC 21 uses standard UCS/Unicode. The contents of a field in a MARC 21 record are consistent with Unicode conventions. For example, the diacritics are encoded after the character they modify, rather than before as in MARC-8. Bidirectional data is recorded in logical order, from the first character to the last, regardless of field orientation, and with no exceptions.
Private Use Area values have been used for a small number of Chinese, Japanese and Korean characters to preserve their integrity when mapping to the East Asian Character Code (EACC) used in the MARC-8 environment. Procedures have been initiated to add these characters to standard UCS/Unicode where possible.
To return, select: