Character Repertoire:

A subset of the total characters in UCS/Unicode should be used at this time.  The subset is the UCS characters that correspond to the over 16,000 characters defined in MARC 21 for the MARC-8 character sets.  This is called the MARC 21 repertoire of characters.  The correspondences between the MARC-8 (8- and 24-bit) and UCS/Unicode (16-bit) character codes are shown in the character set lists in Part 3: Code Tables.  The Chinese, Japanese, and Korean character correspondences are at the following website:  http://www.unicode.org/charts.

Encoding:

The encoding of Unicode characters will be according to the rules of UTF-8 (UCS Transformation Formats-8) which uses designated bits to indicate whether a UCS/Unicode character is represented by 1 octet (8-bits) or multiple octets.  This encoding has the advantage of allowing the Basic Latin (ASCII) subset of the MARC 21 repertoire to be encoded the same as in MARC-8 (with 1 octet), thus preserving the basic structural elements of the MARC 21 record, while enabling record content to be multiscript.  A brief description of UTF-8 encoding follows, but a fuller description is carried in the UCS and Unicode standards.

UTF-8 is an encoding form for the UCS/Unicode 16-bit repertoire.  It represents characters in a systematic way as 1, 2, or 3 octets, using the left-most bits of each octet to indicate how the octet is to be interpreted.

Left-most bits

Meaning of left-most bits for character encoding

0

character composed of 1 octet

110

first octet of 2 octet character

1110

first octet of 3 octet character

10

octet is not the first octet for a character, it is the 2nd or 3rd octet of a multi-octet character

The following transformation is used when converting UCS/Unicode 16-bit characters to UTF-8.

UCS/Unicode values

UTF-8 values

Range

0000 to 007F hex

Form

00000000 0xxxxxxx

0080 to 07FF hex

00000xxx xxyyyyyy

0800 to FFFF hex

xxxxyyyy yyzzzzzz

Note:  x indicates the part of the UCS/Unicode character encoding that will be transferred to the first UTF-8 octet; y the part that will be transferred to the second UTF-8 octet; and z the part that will be transferred to the third UTF-8 octet.

Examples:

 

Character

MARC-8
encoding

 

UCS/Unicode encoding

 

UTF-8 encoding

Comma

00101100
(2C hex)

00000000 00101100
(002C hex)

00101100

Latin small
letter h

01101000
(68 hex)

00000000 01101000
(0068 hex)

01101000

Macron

11100101
(E5 hex)

00000011 00000100
(0304 hex)

11001100 10000100

Hebrew letter
tav

01111010
(7A hex)

00000101 11101010
(05EA hex)

11010111 10101010

Combining liga-
ture left half

11101011
(EB hex)

11111110 00100000
(FE20 hex)

11101111 10111000 10100000

The transformation of surrogate pairs is not described above as they are not currently required for any of the MARC-8 repertoire.  See the Unicode Standard 3.0 documentation for information on surrogate pairs.

UCS/Unicode Markers:

In the MARC 21 record, Leader character position 9 contains the value a if the record is encoded using UCS/Unicode.  Field 066 is not used in a MARC 21 record encoded according to UCS/Unicode as specific character sets are not distinguished in the UCS/Unicode environment.

Standard UCS/Unicode:

MARC 21 uses standard UCS/Unicode.  The contents of a field in a MARC 21 record are consistent with Unicode conventions.  For example, the diacritics are encoded after the character they modify, rather than before as in MARC-8.  Bidirectional data is recorded in logical order, from the first character to the last, regardless of field orientation, and with no exceptions.

Private Use Area values have been used for a small number of Chinese, Japanese and Korean characters to preserve their integrity when mapping to the East Asian Character Code (EACC) used in the MARC-8 environment.  Procedures have been initiated to add these characters to standard UCS/Unicode where possible.

To return, select:

Part 2:  UCS/Unicode Environment (Character Sets)

Character Sets