General Conversion Issues (Character Sets: Conversion)

The following points need to be considered when converting from either specified encoding to the other.

MARC 21 encoding marker

When converting from MARC-8 to Unicode, rLeader position 9 (Character coding scheme) must be set to "a" to indicate that the converted record uses Unicode encoding. When converting from Unicode to MARC-8 Unicode, Leader position 9 must be set to "blank" (20(hex)).

Escape sequences and MARC field 066

Neither field 066 (Character Sets Present) nor any escape sequence is allowed in a Unicode MARC 21 environment. Escape sequences and the 066 field in a MARC-8-encoded record must be removed during conversion to Unicode.

When converting to MARC-8, escape sequences and a 066 field must be constructed where appropriate. Field 066 is required in a MARC-8-encoded record whenever it contains a type 2 escape sequence, as described in Part 2. If there are no such escapes, field 066 is not used.

MARC subfield $6 (Linkage)

Subfield $6 (Linkage) is used in MARC 21 records to link alternate graphic representations of the same data, to identify the presence of specific scripts in a field, and to flag fields in which the display/print directionality of data is right-to-left (e.g., for Arabic script). The subfield $6 script identification code in MARC-8-encoded MARC 21 records identifies MARC-8 character sets, rather than scripts per se; hence the code is irrelevant in the Unicode environment because the character set is always UCS. The script identification code should be dropped from subfield $6 when converting from MARC-8 to Unicode encoding. The field orientation code, which flags a field as having right-to-left display directionality, should be used in Unicode-encoded MARC 21 records. When present, the script identification code is separated from the subfield $6 linking tag and occurrence number by two solidus (slash) characters (002F(hex)). In conversion from Unicode to MARC-8, the script identification code should be restored, typically to a code recorded in subfield $c of the 066 field.

Combining characters (diacritics)

In moving from MARC-8 to Unicode it is necessary to re-order combining characters and base characters so that the base character precedes the combining character(s). When converting from Unicode to MARC-8, combining marks must be moved to precede the base characters. The differing rules for proper sequencing of combining marks when a base letter has more than one are specified in Part 3 (i.e., top down vs. inside out). Best practice during conversion is to reorder the multiple marks according to the rule for the output encoding, but this is not considered mandatory.

Directionality of text

When converting from MARC-8 to Unicode, the conversion should determine whether multi-digit numbers used in bidirectional scripts have been entered in logical or visual order. If visual order has been used, best practice requires that the digits be re-ordered from visual order to logical order. If logical order has been used, no re-ordering is necessary.

Folding of certain MARC-8 Codes

The numbers, punctuation marks, and symbols found in ASCII 21-3F, 5B, 5D (hex) are also, in full or in part, allocated code points in the MARC-8 sets for Hebrew, Cyrillic, and Arabic. These are mapped (folded) into a single, identical set of Unicode code points, as specified in the mappings in the code tables in Part 5; hence mappings are not perfectly reversible because these characters will always be mapped to the ASCII set during reconversion to MARC-8. The resultant record may contain more escape sequences than the same record originally encoded in MARC-8 required. This is an acceptable result that should not interfere with processing or display of the record.

The characters (alpha, beta, gamma) of the custom MARC-8 Greek Symbols set are mapped to the regular Greek letters in Unicode and consequently are not reversible when reconverting to MARC-8. Use of the Greek Symbols set is strongly discouraged for that reason. In MARC-8 contexts where escapes to the standard Basic Greek set is not feasible, textual equivalents of the symbols should be preferred, i.e., [alpha], [beta], [gamma] over use of the Greek Symbol set.

The space character (20 (hex)) is defined only in the Unicode and ASCII sets but is recognized by all the standard graphic character sets in MARC-8 even though not included in those sets. When converting from Unicode to MARC-8, the space character can be converted (unchanged) without being preceded by an escape sequence no matter which of the standard sets is the current working set. Optionally, the escape sequence to ASCII may be included before the space character. However, when the output working set is a custom set accessed by escape technique 1, the technique 1 escape sequence to ASCII is required before the space character.

To return, select:

Part 4: Conversion Between Environments

Character Sets and Encoding Options