lib-MOCS-KMC364-20131012113359 210 Reports and Working Papers Inclusion of Nonroman Character Sets The following document was prepared by staff of the Library of Congress as a work- ing paper for discussions on incorporating the techniques described into the MARC communications format. The document defines the principles for inclusion of nonroman alphabet character sets in the MARC communications format and the procedural changes needed to allow implementation of the principles. This technique was agreed upon at the MARBI Committee meeting on February 2, 1981. Any questions on the description of the inclusion of nonroman character sets in the MARC communications format should be addressed to: Library of Congress, Process- ing Services, Attention: Mrs. Margaret Pat- terson, Washington , DC 20540. 1. INTRODUCTION The cataloging rules followed by Ameri- can libraries favor recording the title page data in the original script when possible. This helps those who consult catalogs to read the most essential information about the book. (Reading his or her name in ro- manized form is just as difficult for someone who knows Arabic as reading your name when it's written in Arabic. ) The new cata- loging rules also specify that names and ti- tles in notes be given in their original script, AACR2 l. 7 A.3. Technological advances have made it possible to provide many, if not all , nonroman alphabets in machine- readable cataloging records. OCLC and RLIN are in the process of enhancing their systems so they can handle some nonroman writing systems. The Library of Congress has entered into a cooperative agreement with RLIN for the development and use of an augmented RLIN system for East Asian (i.e., Chinese, Japanese, and Korean) bib- liographic data. Although the Library itself will not be creating and distributing MARC records with nonroman characters in the near term , the goal of this proposal is to define how these data can be included now so others can do so soon. The technique known as an escape se- quence announces that the codes which fol- low will represent letters in a specific differ- ent alphabet instead of the roman letters the codes would otherwise stand for. 2. PRINCIPLES The following principles will govern in- clusion of other alphabets in MARC rec- ords. Note that these deal only with the MARC communications format record, not the details of its processing-keying, sort- ing, display, etc.-by any bibliographic agency or utility. These principles are a slightly revised version of ones reviewed and approved in principle by the MARBI Character Set Committee in 1976. The ear- lier version was also distributed that year as working paper N77 of ISO TC46/SC4/ WGI. (1) Standard character sets should be used when available. (2) Standard escape sequences should be used when available. (3) Escape sequences should be used only when needed. (4) Escape sequences are locking within a subfield but revert at any delimiter or field or record terminator code. Example: (For demonstration purposes only, EC represents escape to Cyrillic and EA escape to ASCII) 245 10$aECRussian title proper :$bECRussian Subtitle. F not 245 10$aECRussian title proper :EA$bECRussian subtitle. EAF and not 245 10$aECRussian title proper :$bRus- sian subtitle. F (5) Records which contain an escape se- quence will also contain a special field which specifies what unusual character sets are present. 3. IMPLEMENTATION The following will be done to realize these principles. • The ALA character set will be redefined-see table 1. • A new character sets present field will be defined. • Details of application such as distribu- tion, filing indicator values, etc., will be defined. 3.1 Discussion- ALA Character Set A character set is a list of characters with the code used to represent each one. Using this definition , the ALA character set as given in appendixes III.B and III.C of MARC Formats for Bibliographic Data ac- tually consists of eight character sets. (1) ASCII and ALA diacritics and spe- cial characters with their eight-bit code. (2) Superscript zero to nine, plus, minus, open and close parentheses with their eight-bit code. Table 1. Proposed Revised ALA Character Set - ~ p ~ p p p p I p p I p P P I I p I p p P I P I P I I P ~ I I I I ~ ~ ~ I P ~ I I ~ I p I P I I I I P P I I P I I I I P I I I I 4 3 2 I BITS p I 2 3 4 ~ 6 7 R 9 10 II 12 13 14 , 'I p I p I 2 NUL OLE SP SOH DCI ! fSTX DC2 . ETX DC3 " EOT DC4 s ENQ NAK " ACK SYN & BEL !::TO OS CAN I HT EM I LP SUB VT F:SC + FF FS CR OS , - so ns , S l us' I ~ p l p 9 I I I I p p p I p I 3 ~ ~ . p @ p I A Q 2 B R 3 c s 4 I> T ~ E u 6 p v 7 G w 8 H X !I I y J z ; K I < 1. \ - M I > N - ' 1 0 I ASCII 6 • b c d c r • h ; j k I m n 0 Reports and Working Papers 211 (3) Subscript zero to nine, plus, minus, open and close parentheses with their eight-bit code. (4) Greek lowercase alpha, beta, and gamma with their eight-bit code. (5-8) The same characters with their six- bit codes. The six-bit character sets are used to dis- tribute MARC records on seven-track tapes. There are very few subscribers. It is un- likely that a method can be devised for dis- tribution of nonroman character sets rec- ords on such tapes. The present seven-track subscribers should be asked if they know of any way to do so. If they do not, the alterna- tives are to cease distribution of seven-track tapes entirely or limit them to those records containing only roman alphabet characters-those without a character sets present field. In the latter case, they should pay proportionately less for their subscrip- tion. The present four eight-bit character sets and their escape sequences do not conform to present standards. The present standards did not exist when the character sets were being defined. To avoid creating and dis- tributing records containing both standard and nonstandard character sets and stan- p p I I I I p p I I p p p I p I 7 8 9 . p q r ' l u ,. w X y , I' : I' -. DEL l I I /I p ~ I I I I I I p p I I p I p I II p ~~. I I 10 II 12 13 · u 0 L I l ' e • < 2 0 d ' J - .. p ~ 4 4 - . ;E • 5 s - u <E .. 6 -, " 7 ' . . 8 .. < ~ ( 9 - . 411 b + . ~ :!: - r -- 0 " ( ~ lf u I ) ~ . a y . Proposed Change D ALA Extension of ASCII I HB 7 I 6T ~) s 212 Journal of Library Automation Vol. 14/3 September 1981 dard and nonstandard escape sequences, the ALA character set should be redefined. This change will be much less traumatic than it sounds. No new characters will be added; only the codes used to represent sub- script, superscript, and Greek characters will be changed. These characters were found in the title field of 8.59 out of 1.1 million records. If, as seems plausible, most or all MARC subscribers translate tapes into their own character set codes as a first step and for communication translate from their own codes into the ALA character set as the last step before distribution, only these two programs would need to be changed. The proposed redefined ALA character set is shown in table 1. On it, columns two through seven are the American standard code for information interchange (ASCII) which is a recognized standard with a regis- tered escape sequence. Columns ten through fifteen are the ALA extension of ASCII with special characters and the three Greek letters in columns ten and eleven, su- perscripts in column twelve, subscripts in thirteen, and diacritics in columns fourteen and fifteen. (It should be noted that six AS- CII codes will not occur in MARC records: codes 5/14 circumflex, 5/15 underline, 6/0 grave, and 7/ 14 tilde are redundant with the codes for these diacritics in columns fourteen and fifteen; codes 7 I 11 left brace and 7/13 right brace never occur because these characters do not occur in biblio- graphic data. No change in this practice is proposed. It is the fact that these last two codes are used in some nonroman alphabet standard character sets that makes nonro- man six-bit codes impossible.) The ALA ex- tension of ASCII is not an official standard now; it does not have an escape sequence yet. In addition to the ALA extension of ASCII, there is a draft international stan- dard extended Latin alphabet character set for bibliographic use-ISO DIS 5426 (table 2). While both sets are identical in purpose, they differ in the characters they contain and the codes used to represent them. The ABACUS group has agreed that ISO 5426 be used for international distribution of MARC records among the bibliographic agencies they represent once it is an ap- proved international standard, cf. LC In- formation Bulletin, November 16, 1979, p. 475. The Library will, however, continue to use the ALA extension for U.S. distribu- tion. Some of the characters only on the ISO set could be added to the ALA extension without affecting existing records. An ANSI Z39 subcommittee has been established to consider this possibility. While some changes may be desirable to the ALA char- acter repertoire, it is important that this is- sue not delay the separate matter of provid- ing for inclusion of nonroman alphabets in MARC. 3.2 Discussion-Escape Sequence For purposes of this discussion, escape se- quences are defined as a combination of three characters. (See table 3.) The first is an escape character, hex 1111. The second character specifies which codes are having different characters assigned to them, those in columns 2- 7 or those in columns 10-15. The third character defines what characters are being assigned to these codes, e.g., Cy- rillic, Greek, etc. This is a greatly simplified explanation of the escape sequence stan- dards, ISO 2022 and ANSI X3.41. (Both are in the process of revision.) These standards provide for two types of escape sequences: public ones which reference registered character sets, and private ones for unregis- tered character sets. While the meaning of the latter is governed by an agreement be- tween the sender and the receiver, they are in conformity with the standard. Until the ALA extension of ASCII has a registered es- cape sequence, such a "private" escape se- quence could be defined for it in the charac- ter set appendix and used. The second character of an escape se- quence which changes the meaning of the codes in columns 2-7 contains either an open parenthesis, hex 2/8, or a less than sign, hex 2/12. The second character of an escape sequence which changes the mean- ing of the codes in columns 10-15 contains either a close parenthesis, hex 2/9, or an equal sign, hex 2/13. The third character of escape sequences for certain registered character sets has been defined as follows: Table 2. Extended Latin Alphabet Character Set II . : 0 0 0 I ~ 0 lb71b61bs b4 b3 b2 bt 1\ow Set ASCII Bits Russian (1967 Cost Standard) (Table 3) ISO Greek rSO extended Cyrillic (Table3) 0 0 0 0 0 0 0 0 1 1 0 0 1 0 2 0 0 l 1 3 0 1 0 0 4 0 l 0 1 5 0 1 1 0 6 0 1 1 1 7 1 0 0 0 8 1 0 0 1 9 1 0 l 0 10 1 0 1 l 11 1 l 0 0 12 l 1 0 1 13 1 1 1 0 14 1 1 1 1 15 Code registration applied for, code pending 5/8, uppercase X 517, uppercase W 0 0 1 The sixteen codes in column three can be used to designate sixteen different "private" character sets. In MARC records, ASCII and Russian would be assigned to columns 2-7, while Greek and the extended Cyrillic (and the ALA extension of ASCII) would be assigned to columns 10-15. 1 Reports and Working Papers 213 10 0 1 1 11 l 1 1 0 0 1 1 0 1 0 1 0 1 2 3 4 5 6 7 . 7 ) . ... I l iE ae " -[) ct " . .J £ A 0 1.1 s ~ 0 ¥ - ~ 1 t .t u u ij 9 . .. I II .. _t. .r . . .. ¢ = ¢ .. .. 0 <E I <E « >> . B "' b * . p p © I II r ® II J 1 ® L " '-../ Escape sequences would be given where needed in data fields. If necessary, it is per- missible to embed escape sequences within a word. For example, a Latin diacritic might be needed with an extended Cyrillic letter to represent a letter in one of the non- Slavic languages of Central Asia which uses the Cyrillic alphabet. In addition to escape sequences for non- roman alphabets described above in which one code stands for one letter, the escape standards also define escape sequence pro- cedures for changing to multiple byte char- acter sets. Because the ideographic writing 214 Journal of Library Automation Vol. 14/3 September 1981 Table 3. Escape Sequence Character Set p p p p p p p 1 p p 1 p ~ ~ 1 1 U I P P p I P 1 p I 1 P P 1 I 1 I P P U 1 u u 1 I ~ 1 P I ~ I I I I P U I I Q I I I I P I 1 I I ·I 3 2 I JilTS g 1 2 J 4 r. 6 7 8 9 Ill 11 12 IJ (.1 1$ fl ~ ~ u ~ p 1 p 1 ~ p Q p 1 1 p I 2 3 SP p ! 1 " 2 # J ll 4 \\ $ & 6 7 I 8 I 9 : • : < - - > I ? l ~ I I I g g I p I ~ 4 !o G 10 n 10 a « A 15 p !j 1.1 c ll A T .n e >' E <!> "' <l> r • r " • X u .. lo( -,, 3 ~ K w " J1 . n ... U1 M K ~ ll 0 . - 0 ~ I I l I I u g p u I 1 p ~ I 1 u 1 ~ 1 ~ I ~ 7 8 9 10 11 12 n r .!l ~ p r c c T ~ y s lK j B 'i b j bl "' 3 ,_ ill II 3 " Ul y 'l ,, ,, I I p 1 13 -t 9 v .. [ J /I I I I 1 H ~ 1!o 1 / r '!; ,; e f y c ;;{ j:: s 1 .. J Jb H, 1\ ,( y ll " I Hll 7 I GT r, s COST 13052-67 Russian ISO DIS 5~27 Extended Cyrillic systems of East Asia use thousands of differ- ent characters, it will be necessary to use two or three bytes/codes to identify a single specific character uniquely. The Japanese Industrial Standard character set, JIS 6226, uses two bytes per character, and it has been submitted to ISO to obtain a registered es- cape sequence. The first volume of the Chi- nese Character Code for Information Inter- change, CCCII, has been issued; the second is expected in December. It uses three bytes per character. In all probability the LC/ RLIN East Asian cooperative project will adopt either these character sets and their escape sequences or machine reversible ad- aptations of them. The need to expand East Asian character sets constantly to provide for infrequently used characters poses prob- lems whose solutions cannot be predicted at this time. 3. 3 Discussion- Character Sets Present Field As specified in the sixth principle, there is need for a special field which specifies what character sets are present whenever a set other than ASCII and the ALA extension of ASCII are present in a record. The pro- posed field will use tag 066 and be defined as follows: 066 Character Sets Present This field specifies what character sets are present in the other than ASCII and the ALA extension of ASCII. The field is not repeatable. Both indicators are unused and will contain blanks. $a This subfield will contain all but the first character of the escape sequence to the default character set in columns 2-7 whenever the default character set is not ASCII. This is not likely to occur in records created in the United States. Since there can only be one default character set, the subfield is not repeatable. $b This subfield will contain all but the first character of the escape sequence to the default character set in columns 10- 15 whenever the default character set is not the ALA extension of ASCII. This is not likely to occur in records created in the United States. Since there can be only one default extension character set, this subfield is not repeatable. $c This subfield will contain all but the first character (or all but the first if a longer escape sequence is used) of every escape sequence found in the record. If the same escape sequence occurs more than once, it will be given only once in this subfield. The subfield is repeatable. This subfield does not identify the default charac- ter sets. Example : l'>l'>~c)W A record containing the ISO extended Cyrillic character set. l'>l'>$c)W$c)X A record 3.4 Discussion-Other Details containing both the ISO Greek and extended Cyrillic character sets. When a field has an indicator to specify the number of leading characters to be ig- nored in filing and the text of the field be- gins with an escape sequence, the length of the escape sequence will not be included in the character count. When fields contain escape sequences to languages written from right to left, the field will still be given in its logical order. For example, the first letter of a Hebrew title would be the eighth character in a field (following the indicators, a delimiter, a subfield code, and a three-character escape sequence). The first letter would not appear just before the end of field character and proceed backwards to the beginning of the field. A convention exists in descriptive cata- loging fields that subfield content designa- tion generally serves as a substitute for a space. An escape sequence can occur within a word, after a subfield code, or between two words not at a subfield boundary. For simplicity, the convention that an escape sequence does not replace a space should be adopted. One other convention is also advo- cated: when a space, subfield code, or punctuation mark (except open quote, pa- Reports and Working Papers 215 renthesis or bracket) is adjacent to an escape sequence, the escape sequence will come last. Wayne Davison of RLIN raised the fol- lowing issue. After the Library of Congress has prepared and distributed an entirely ro- manized cataloging record for a Russian book, a library with access to automated Cyrillic input and display capability will create a record for the same book with the title in the vernacular. (Since AACR2 says to give the title in the original script "wher- ever practicable," the library could be said to be obligated to do so.) In such an event the local record could have all the authori- tative Library of Congress access points. To keep this record current when the Library of Congress record is revised and redistrib- uted, it would be necessary to carry the LC control number in the local record . Most automated systems are hypersensitive to the presence of two records with the same con- trol number. The two records can be easily distinguished: in the Library of Congress record, the modified record byte in field 008 will be set to "o" and it will not have any 066, character sets present field. A Comparison of OCLC, RLG/RLIN, and WLN University of Oregon Library The following comparison of three major bibliographic utilities was prepared by the University of Oregon Library's Cataloging Objectives Committee, Subcommittee on Bibliographic Utilities. Members of the sub- committee were Elaine Kemp, acting assis- tant university librarian for technical ser- vices; Rod Slade, coordinator of the library's computer search service; and Thomas Stave, head documents librarian. The subcommittee attempted to produce a comparison that was concise and jargon- free for use with the university community in evaluating the bibliographic utilities un- der consideration. The University Faculty Library Committee was enlisted to review this document in draft jorm and held three meetings with the subcommittee for that purpose. The document was also shared with library faculty and staff in order to elicit suggestions for revision.