Constraints on Unicode Repertoire (Character Sets: Unicode)

Exclusions a priori

There are many undefined code points in Unicode codespace; they are reserved for character set expansion and may not be used by any application. Unicode designates a small number of assigned code points as either non-characters or deprecated characters; none of these should be included in a MARC 21 record. Neither is use of surrogate pairs allowed for representing code points beyond FFFF(hex) because surrogates (D800 to DFFF (hex)) have meaning only in a UTF-16 context. UTF-8 allows code points beyond FFFF(hex) to be encoded directly.

MARC-21 as a matter of policy avoids the use of characters in the Private Use Area (PUA) (E000-E8FF (hex)) as detrimental to effective information exchange. While the initial mapping of EACC to Unicode assigned several ideograph variants and certain other CJK characters to the PUA, those assignments have subsequently been remapped to standard Unicode code points. The latter should be used to represent those characters in future exchanges and, wherever feasible, to replace existing instances of the PUA code points.

Cautions

The original restriction of MARC 21 Unicode character repertoire to the MARC-8 repertoire is no longer practicable because of the increased availability of Unicode-encoded data sources that are not bound by such a limitation. Through a variety of techniques, only the most common being copy-and-paste, non-MARC-8 characters can and do get introduced into MARC 21 records. Frequently these characters will escape detection when a record is created, or even when used locally, but they may impede the effectiveness of the data interchange that is the primary purpose of MARC 21. Characters such as single quotation marks and apostrophe, compressed to a single character in ASCII because of space limitations, are among the most common to be encountered accidentally. Data in European languages are likely to contain precomposed Latin characters. Users of CJK data may discover characters from the Halfwidth and Fullwidth Forms block (FF00 to FFEF (hex)).

It is infeasible to identify a particular collection of Unicode characters to be prohibited from MARC 21 records. But creators of MARC 21 records should take into account the capabilities of their likely exchange partners as they choose to expand their working repertoire. For limited distributions, agreements among exchange partners can support aggressive repertoire expansion. For general distribution, a more conservative approach is warranted. Such an approach would minimize or avoid entirely the use of certain types of characters. For example, characters in the CJK Compatibility Ideographs area and the several Presentation Forms blocks were included in the Unicode repertoire primarily to accommodate pre-existing standards. In the future fewer applications can be expected to continue supporting the old standards; so avoiding these characters is wise.

The control function codes defined for MARC 21 are listed in Part 1. In addition, there are a few other format control characters available in Unicode encoding that may be useful for controlling bidirectional display. Aside from those, introduction of new control characters into MARC 21 records should be done only with the greatest caution. In particular, code points in the 0 to 1C(hex) range should not be used.

To return, select:

Part 3: Unicode Encoding Environment

Character Sets and Encoding Options