UTF-8 Transformation Details (Character Sets: Unicode)

The UTF-8 transformation of a Unicode scalar value into an octet sequence is accomplished by reallocating its bits into octets that begin with bit sequences identifying the function of the octet.

Left-most bits	Meaning of left-most bits for character encoding
0	character composed of 1 octet
110	first octet of 2 octet character
1110	first octet of 3 octet character
11110	first octet of 4-octet character
10	octet is not the first octet for a character, it is the 2nd, 3rd, or 4th octet of a multi-octet character

The following patterns show how bits of the scalar value are allocated to UTF-8 octets.

Range(hex)	Unicode scalar value	UTF-8 value
0000 to 007F	00000000 0xxxxxxx	0xxxxxxx
0080 to 07FF	00000yyy yyxxxxxx	110yyyyy 10xxxxxx
0800 to FFFF	zzzzyyyy yyzzzzzz	1110zzzz 10yyyyyy 10xxxxxx
10000 to 10FFFF	000uuuuu zzzzyyyy yyxxxxxx	11110uuu 10uuzzzz

NOTE: x,y,z, and u are used to show how bits get distributed among UTF-8 octets; the final two octets of a four-octet sequence, not shown here, are identical to the final two of the three-octet sequence. Observe that the first hex digit of a four-octet sequence will always be F because the initial octet begins with bits 11110. Similarly, the first hex digit of a three-octet sequence will always be E, and the first one of a two-octet sequence will be C or D. A second or subsequent octet will begin with hex 8,9,A, or B.

Example of three encodings expressed in binary and hexadecimal notation:

Character	MARC-8	Unicode scalar value	Unicode UTF-8
Comma	00101100 2C (hex)	00000000 00101100 002C (hex)	00101100 2C (hex)
Latin small letter h	01101000 68 (hex)	00000000 01101000 0068 (hex)	01101000 68 (hex)
Macron	11100101 E5 (hex)	00000011 00000100 0304 (hex)	11001100 10000100 CC84 (hex)
Hebrew letter tav	01111010 7A (hex)	00000101 11101010 05EA (hex)	11010111 10101010 D7AA (hex)
Script small l	11000001 C1 (hex)	00100001 00010011 2113 (hex)	11100010 10000100 10010011 E28493 (hex)

No example of a four-octet sequence is shown, but the preceding table gives the pattern. Characters beyond FFFF (hex) are very rarely encountered in MARC 21.

To return, select:

Part 3: Unicode Encoding Environment

Character Sets and Encoding Options