Cataloger's Reference Shelf
MARC 21 Specifications for Record Structure, Character Sets, and Exchange
Media
UTF-8 Transformation Details (Character Sets, Part 3: Unicode Encoding Environment)
The UTF-8 transformation of a Unicode scalar value into an octet sequence is accomplished by reallocating its bits into octets that begin with bit sequences identifying the function of the octet.
Left-most bits |
Meaning of left-most bits for character encoding |
0 |
character composed of 1 octet |
110 |
first octet of 2 octet character |
1110 |
first octet of 3 octet character |
11110 |
first octet of 4-octet character |
10 |
octet is not the first octet for a character, it is the 2nd, 3rd, or 4th octet of a multi-octet character |
The following patterns show how bits of the scalar value are allocated to UTF-8 octets.
Range(hex) |
Unicode scalar value |
UTF-8 value |
0000 to 007F |
00000000 0xxxxxxx |
0xxxxxxx |
0080 to 07FF |
00000yyy yyxxxxxx |
110yyyyy 10xxxxxx |
0800 to FFFF |
zzzzyyyy yyzzzzzz |
1110zzzz 10yyyyyy 10xxxxxx |
10000 to 10FFFF |
000uuuuu zzzzyyyy yyxxxxxx |
11110uuu 10uuzzzz |
NOTE: x,y,z, and u are used to show how bits get distributed among UTF-8 octets; the final two octets of a four-octet sequence, not shown here, are identical to the final two of the three-octet sequence. Observe that the first hex digit of a four-octet sequence will always be F because the initial octet begins with bits 11110. Similarly, the first hex digit of a three-octet sequence will always be E, and the first one of a two-octet sequence will be C or D. A second or subsequent octet will begin with hex 8,9,A, or B.
Example of three encodings expressed in binary and hexadecimal notation:
Character |
MARC-8 |
Unicode scalar value |
Unicode UTF-8 |
Comma |
00101100 |
00000000 00101100 |
00101100 |
Latin small letter h |
01101000 |
00000000 01101000 |
01101000 |
Macron |
11100101 |
00000011 00000100 |
11001100 10000100 |
Hebrew letter tav |
01111010 |
00000101 11101010 |
11010111 10101010 |
Script small l |
11000001 |
00100001 00010011 |
11100010 10000100 10010011 |
No example of a four-octet sequence is shown, but the preceding table gives the pattern. Characters beyond FFFF (hex) are very rarely encountered in MARC 21.
To return, select: