The UTF-8 transformation of a Unicode scalar value into an octet sequence is accomplished by reallocating its bits into octets that begin with bit sequences identifying the function of the octet.

Left-most bits

Meaning of left-most bits for character encoding

0

character composed of 1 octet

110

first octet of 2 octet character

1110

first octet of 3 octet character

11110

first octet of 4-octet character

10

octet is not the first octet for a character, it is the 2nd, 3rd, or 4th octet of a multi-octet character

The following patterns show how bits of the scalar value are allocated to UTF-8 octets.

Range(hex)

Unicode scalar value

UTF-8 value

0000 to 007F

00000000 0xxxxxxx

0xxxxxxx

0080 to 07FF

00000yyy yyxxxxxx

110yyyyy 10xxxxxx

0800 to FFFF

zzzzyyyy yyzzzzzz

1110zzzz 10yyyyyy 10xxxxxx

10000 to 10FFFF

000uuuuu zzzzyyyy yyxxxxxx

11110uuu 10uuzzzz

NOTE:  x,y,z, and u are used to show how bits get distributed among UTF-8 octets; the final two octets of a four-octet sequence, not shown here, are identical to the final two of the three-octet sequence.  Observe that the first hex digit of a four-octet sequence will always be F because the initial octet begins with bits 11110.  Similarly, the first hex digit of a three-octet sequence will always be E, and the first one of a two-octet sequence will be C or D.  A second or subsequent octet will begin with hex 8,9,A, or B.

Example of three encodings expressed in binary and hexadecimal notation:

Character

MARC-8

Unicode scalar value

Unicode UTF-8

Comma

00101100
2C (hex)

00000000 00101100
002C (hex)

00101100
2C (hex)

Latin small letter h

01101000
68 (hex)

00000000 01101000
0068 (hex)

01101000
68 (hex)

Macron

11100101
E5 (hex)

00000011 00000100
0304 (hex)

11001100 10000100
CC84 (hex)

Hebrew letter tav

01111010
7A (hex)

00000101 11101010
05EA (hex)

11010111 10101010
D7AA (hex)

Script small l

11000001
C1 (hex)

00100001 00010011
2113 (hex)

11100010 10000100 10010011
E28493 (hex)

No example of a four-octet sequence is shown, but the preceding table gives the pattern. Characters beyond FFFF (hex) are very rarely encountered in MARC 21.

To return, select:

Part 3:  Unicode Encoding Environment

Character Sets and Encoding Options