lib-MOCS-KMC364-20131012113359


210 

Reports and Working Papers 

Inclusion of Nonroman 
Character Sets 

The following document was prepared by 
staff of the Library of Congress as a work-
ing paper for discussions on incorporating 
the techniques described into the MARC 
communications format. 

The document defines the principles for 
inclusion of nonroman alphabet character 
sets in the MARC communications format 
and the procedural changes needed to allow 
implementation of the principles. This 
technique was agreed upon at the MARBI 
Committee meeting on February 2, 1981. 

Any questions on the description of the 
inclusion of nonroman character sets in the 
MARC communications format should be 
addressed to: Library of Congress, Process-
ing Services, Attention: Mrs. Margaret Pat-
terson, Washington , DC 20540. 

1. INTRODUCTION 

The cataloging rules followed by Ameri-
can libraries favor recording the title page 
data in the original script when possible. 
This helps those who consult catalogs to 
read the most essential information about 
the book. (Reading his or her name in ro-
manized form is just as difficult for someone 
who knows Arabic as reading your name 
when it's written in Arabic. ) The new cata-
loging rules also specify that names and ti-
tles in notes be given in their original script, 
AACR2 l. 7 A.3. Technological advances 
have made it possible to provide many, if 
not all , nonroman alphabets in machine-
readable cataloging records. OCLC and 
RLIN are in the process of enhancing their 
systems so they can handle some nonroman 
writing systems. The Library of Congress 
has entered into a cooperative agreement 
with RLIN for the development and use of 
an augmented RLIN system for East Asian 
(i.e., Chinese, Japanese, and Korean) bib-

liographic data. Although the Library itself 
will not be creating and distributing MARC 
records with nonroman characters in the 
near term , the goal of this proposal is to 
define how these data can be included now 
so others can do so soon. 

The technique known as an escape se-
quence announces that the codes which fol-
low will represent letters in a specific differ-
ent alphabet instead of the roman letters the 
codes would otherwise stand for. 

2. PRINCIPLES 

The following principles will govern in-
clusion of other alphabets in MARC rec-
ords. Note that these deal only with the 
MARC communications format record, not 
the details of its processing-keying, sort-
ing, display, etc.-by any bibliographic 
agency or utility. These principles are a 
slightly revised version of ones reviewed 
and approved in principle by the MARBI 
Character Set Committee in 1976. The ear-
lier version was also distributed that year as 
working paper N77 of ISO TC46/SC4/ 
WGI. 

(1) Standard character sets should be 
used when available. 

(2) Standard escape sequences should be 
used when available. 

(3) Escape sequences should be used only 
when needed. 

(4) Escape sequences are locking within 
a subfield but revert at any delimiter 
or field or record terminator code. 

Example: (For demonstration purposes 
only, EC represents escape to Cyrillic and 
EA escape to ASCII) 

245 10$aECRussian title proper 
:$bECRussian Subtitle. F 
not 
245 10$aECRussian title proper 
:EA$bECRussian subtitle. EAF 
and not 
245 10$aECRussian title proper :$bRus-
sian subtitle. F 


(5) Records which contain an escape se-
quence will also contain a special 
field which specifies what unusual 
character sets are present. 

3. IMPLEMENTATION 

The following will be done to realize 
these principles. 

• The ALA character set will be 
redefined-see table 1. 

• A new character sets present field will 
be defined. 

• Details of application such as distribu-
tion, filing indicator values, etc., will 
be defined. 

3.1 Discussion- ALA Character Set 

A character set is a list of characters with 
the code used to represent each one. Using 
this definition , the ALA character set as 
given in appendixes III.B and III.C of 
MARC Formats for Bibliographic Data ac-
tually consists of eight character sets. 

(1) ASCII and ALA diacritics and spe-
cial characters with their eight-bit 
code. 

(2) Superscript zero to nine, plus, 
minus, open and close parentheses 
with their eight-bit code. 

Table 1. Proposed Revised ALA Character Set 

-

~ p ~ p 
p p p I 

p p I p 

P P I I 

p I p p 

P I P I 

P I I P 
~ I I I 

I ~ ~ ~ 

I P ~ I 

I ~ I p 

I P I I 

I I P P 

I I P I 

I I I P 

I I I I 

4 3 2 I 
BITS 

p 

I 

2 

3 

4 

~ 

6 

7 

R 

9 

10 

II 

12 

13 

14 , 

'I p I p I 2 
NUL OLE SP 

SOH DCI ! 

fSTX DC2 . 
ETX DC3 " EOT DC4 s 
ENQ NAK " ACK SYN & 
BEL !::TO 

OS CAN I 
HT EM I 
LP SUB 
VT F:SC + 
FF FS 

CR OS , -
so ns , 
S l us' I 

~ p 

l p 9 I I I I p p p I p I 3 ~ ~ . 
p @ p 

I A Q 
2 B R 
3 c s 
4 I> T 
~ E u 
6 p v 
7 G w 

8 H X 
!I I y 

J z 
; K I 
< 1. \ 

- M I 
> N - ' 
1 0 I 

ASCII 

6 

• 
b 

c 

d 

c 

r 

• 
h 

; 

j 

k 

I 

m 

n 

0 

Reports and Working Papers 211 

(3) Subscript zero to nine, plus, 
minus, open and close parentheses 
with their eight-bit code. 

(4) Greek lowercase alpha, beta, and 
gamma with their eight-bit code. 

(5-8) The same characters with their six-
bit codes. 

The six-bit character sets are used to dis-
tribute MARC records on seven-track tapes. 
There are very few subscribers. It is un-
likely that a method can be devised for dis-
tribution of nonroman character sets rec-
ords on such tapes. The present seven-track 
subscribers should be asked if they know of 
any way to do so. If they do not, the alterna-
tives are to cease distribution of seven-track 
tapes entirely or limit them to those records 
containing only roman alphabet 
characters-those without a character sets 
present field. In the latter case, they should 
pay proportionately less for their subscrip-
tion. 

The present four eight-bit character sets 
and their escape sequences do not conform 
to present standards. The present standards 
did not exist when the character sets were 
being defined. To avoid creating and dis-
tributing records containing both standard 
and nonstandard character sets and stan-

p p I I 
I I p p 
I I p p 
p I p I 

7 8 9 

. p 
q 

r 

' 
l 

u 

,. 
w 

X 

y 

, 
I' 
: 
I' 
-. 

DEL 

l 
I I /I p ~ I I I I I I p p I I p I p I 

II p ~~. I I 
10 II 12 13 

· u 
0 

L I l ' 
e • < 2 
0 d ' J - .. 
p ~ 

4 
4 - . 

;E • 5 s -
u 

<E .. 6 -, 
" 7 ' . . 8 .. < 

~ ( 9 - . 
411 b + . ~ 
:!: - r --
0 " ( 

~ 

lf u I ) 

~ 
. 

a y . 
Proposed Change D 
ALA Extension of ASCII 

I 
HB 
7 I 
6T 
~) s 


212 Journal of Library Automation Vol. 14/3 September 1981 

dard and nonstandard escape sequences, 
the ALA character set should be redefined. 
This change will be much less traumatic 
than it sounds. No new characters will be 
added; only the codes used to represent sub-
script, superscript, and Greek characters 
will be changed. These characters were 
found in the title field of 8.59 out of 1.1 
million records. If, as seems plausible, most 
or all MARC subscribers translate tapes into 
their own character set codes as a first step 
and for communication translate from their 
own codes into the ALA character set as the 
last step before distribution, only these two 
programs would need to be changed. 

The proposed redefined ALA character 
set is shown in table 1. On it, columns two 
through seven are the American standard 
code for information interchange (ASCII) 
which is a recognized standard with a regis-
tered escape sequence. Columns ten 
through fifteen are the ALA extension of 
ASCII with special characters and the three 
Greek letters in columns ten and eleven, su-
perscripts in column twelve, subscripts in 
thirteen, and diacritics in columns fourteen 
and fifteen. (It should be noted that six AS-
CII codes will not occur in MARC records: 
codes 5/14 circumflex, 5/15 underline, 6/0 
grave, and 7/ 14 tilde are redundant with 
the codes for these diacritics in columns 
fourteen and fifteen; codes 7 I 11 left brace 
and 7/13 right brace never occur because 
these characters do not occur in biblio-
graphic data. No change in this practice is 
proposed. It is the fact that these last two 
codes are used in some nonroman alphabet 
standard character sets that makes nonro-
man six-bit codes impossible.) The ALA ex-
tension of ASCII is not an official standard 
now; it does not have an escape sequence 
yet. 

In addition to the ALA extension of 
ASCII, there is a draft international stan-
dard extended Latin alphabet character set 
for bibliographic use-ISO DIS 5426 (table 
2). While both sets are identical in purpose, 
they differ in the characters they contain 
and the codes used to represent them. The 
ABACUS group has agreed that ISO 5426 
be used for international distribution of 
MARC records among the bibliographic 
agencies they represent once it is an ap-

proved international standard, cf. LC In-
formation Bulletin, November 16, 1979, p. 
475. The Library will, however, continue 
to use the ALA extension for U.S. distribu-
tion. Some of the characters only on the ISO 
set could be added to the ALA extension 
without affecting existing records. An ANSI 
Z39 subcommittee has been established to 
consider this possibility. While some 
changes may be desirable to the ALA char-
acter repertoire, it is important that this is-
sue not delay the separate matter of provid-
ing for inclusion of nonroman alphabets in 
MARC. 

3.2 Discussion-Escape Sequence 

For purposes of this discussion, escape se-
quences are defined as a combination of 
three characters. (See table 3.) The first is 
an escape character, hex 1111. The second 
character specifies which codes are having 
different characters assigned to them, those 
in columns 2- 7 or those in columns 10-15. 
The third character defines what characters 
are being assigned to these codes, e.g., Cy-
rillic, Greek, etc. This is a greatly simplified 
explanation of the escape sequence stan-
dards, ISO 2022 and ANSI X3.41. (Both are 
in the process of revision.) These standards 
provide for two types of escape sequences: 
public ones which reference registered 
character sets, and private ones for unregis-
tered character sets. While the meaning of 
the latter is governed by an agreement be-
tween the sender and the receiver, they are 
in conformity with the standard. Until the 
ALA extension of ASCII has a registered es-
cape sequence, such a "private" escape se-
quence could be defined for it in the charac-
ter set appendix and used. 

The second character of an escape se-
quence which changes the meaning of the 
codes in columns 2-7 contains either an 
open parenthesis, hex 2/8, or a less than 
sign, hex 2/12. The second character of an 
escape sequence which changes the mean-
ing of the codes in columns 10-15 contains 
either a close parenthesis, hex 2/9, or an 
equal sign, hex 2/13. 

The third character of escape sequences 
for certain registered character sets has 
been defined as follows: 


Table 2. Extended Latin Alphabet Character Set 

II . : 
0 

0 
0 

I ~ 0 lb71b61bs b4 b3 b2 bt 1\ow 

Set 
ASCII 

Bits 

Russian (1967 Cost 
Standard) (Table 3) 

ISO Greek 
rSO extended Cyrillic 

(Table3) 

0 0 0 0 0 

0 0 0 1 1 

0 0 1 0 2 

0 0 l 1 3 

0 1 0 0 4 

0 l 0 1 5 

0 1 1 0 6 

0 1 1 1 7 

1 0 0 0 8 

1 0 0 1 9 

1 0 l 0 10 

1 0 1 l 11 

1 l 0 0 12 

l 1 0 1 13 

1 1 1 0 14 

1 1 1 1 15 

Code 

registration applied 
for, code pending 

5/8, uppercase X 
517, uppercase W 

0 
0 

1 

The sixteen codes in column three can be 
used to designate sixteen different "private" 
character sets. In MARC records, ASCII 
and Russian would be assigned to columns 
2-7, while Greek and the extended Cyrillic 
(and the ALA extension of ASCII) would be 
assigned to columns 10-15. 

1 

Reports and Working Papers 213 

10 0 1 1 11 l 
1 1 0 0 1 1 

0 1 0 1 0 1 

2 3 4 5 6 7 . 7 
) . ... 

I l iE ae 
" -[) ct 

" . .J 
£ 

A 

0 1.1 
s ~ 

0 

¥ -
~ 1 

t .t 
u u ij 

9 
. 

.. 
I II .. _t. .r . . .. 

¢ = ¢ .. .. 0 
<E I <E 

« >> 
. 

B 
"' 

b * 
. p p 

© 
I II r 

® 
II J 1 

® L " '-../ 

Escape sequences would be given where 
needed in data fields. If necessary, it is per-
missible to embed escape sequences within 
a word. For example, a Latin diacritic 
might be needed with an extended Cyrillic 
letter to represent a letter in one of the non-
Slavic languages of Central Asia which uses 
the Cyrillic alphabet. 

In addition to escape sequences for non-
roman alphabets described above in which 
one code stands for one letter, the escape 
standards also define escape sequence pro-
cedures for changing to multiple byte char-
acter sets. Because the ideographic writing 


214 Journal of Library Automation Vol. 14/3 September 1981 

Table 3. Escape Sequence Character Set 

p p p p 

p p p 1 

p p 1 p 

~ ~ 1 1 

U I P P 

p I P 1 
p I 1 P 
P 1 I 1 

I P P U 

1 u u 1 
I ~ 1 P 

I ~ I I 

I I P U 

I I Q I 

I I I P 
I 1 I I 

·I 3 2 I 
JilTS 

g 

1 

2 

J 

4 

r. 
6 

7 

8 

9 

Ill 

11 

12 

IJ 

(.1 

1$ 

fl ~ ~ u ~ p 1 
p 1 

~ p 
Q p 
1 1 
p I 

2 3 

SP p 

! 1 

" 2 
# J 

ll 4 
\\ $ 

& 6 

7 

I 8 

I 9 

: 

• : 
< 

- -
> 

I ? 

l 
~ 

I I I 
g g I 
p I ~ 

4 !o G 

10 n 10 
a « A 
15 p !j 

1.1 c ll 
A T .n 
e >' E 

<!> 
"' 

<l> 
r • r 
" • X 
u .. lo( -,, 

3 ~ 
K w " J1 . n 
... U1 M 
K ~ ll 
0 . - 0 

~ I I 

l 
I 

I u g p u I 
1 p ~ I 1 u 
1 ~ 1 ~ I ~ 

7 8 9 10 11 12 
n r 
.!l ~ 
p r 

c c 
T ~ 

y s 

lK j 

B 'i 

b j 

bl "' 
3 

,_ 
ill II 

3 " 
Ul y 

'l ,, ,, 

I 
I 
p 
1 

13 

-t 

9 

v .. 

[ 

J 

/I I I I 1 
H ~ 1!o 

1 

/ 

r '!; 
,; e 
f y 
c ;;{ 
j:: 
s 
1 
.. 

J 
Jb 

H, 

1\ 

,( 
y 
ll 

" 

I 
Hll 
7 I 
GT 
r, s 

COST 13052-67 Russian ISO DIS 5~27 Extended Cyrillic 

systems of East Asia use thousands of differ-
ent characters, it will be necessary to use 
two or three bytes/codes to identify a single 
specific character uniquely. The Japanese 
Industrial Standard character set, JIS 6226, 
uses two bytes per character, and it has been 
submitted to ISO to obtain a registered es-
cape sequence. The first volume of the Chi-
nese Character Code for Information Inter-
change, CCCII, has been issued; the second 
is expected in December. It uses three bytes 
per character. In all probability the LC/ 
RLIN East Asian cooperative project will 
adopt either these character sets and their 
escape sequences or machine reversible ad-
aptations of them. The need to expand East 
Asian character sets constantly to provide 
for infrequently used characters poses prob-
lems whose solutions cannot be predicted at 
this time. 

3. 3 Discussion- Character Sets Present Field 

As specified in the sixth principle, there is 
need for a special field which specifies what 
character sets are present whenever a set 
other than ASCII and the ALA extension of 
ASCII are present in a record. The pro-

posed field will use tag 066 and be defined 
as follows: 

066 Character Sets Present 
This field specifies what character sets 

are present in the other than ASCII and the 
ALA extension of ASCII. The field is not 
repeatable. Both indicators are unused and 
will contain blanks. 

$a This subfield will contain all but the 
first character of the escape sequence 
to the default character set in 
columns 2-7 whenever the default 
character set is not ASCII. This is not 
likely to occur in records created in 
the United States. Since there can 
only be one default character set, the 
subfield is not repeatable. 

$b This subfield will contain all but the 
first character of the escape sequence 
to the default character set in 
columns 10- 15 whenever the default 
character set is not the ALA extension 
of ASCII. This is not likely to occur in 
records created in the United States. 
Since there can be only one default 
extension character set, this subfield 
is not repeatable. 


$c This subfield will contain all but the 
first character (or all but the first if a 
longer escape sequence is used) of 
every escape sequence found in the 
record. If the same escape sequence 
occurs more than once, it will be 
given only once in this subfield. The 
subfield is repeatable. This subfield 
does not identify the default charac-
ter sets. 

Example : l'>l'>~c)W A record 
containing 
the ISO 
extended 
Cyrillic 
character 
set. 

l'>l'>$c)W$c)X A record 

3.4 Discussion-Other Details 

containing 
both the ISO 
Greek and 
extended 
Cyrillic 
character 
sets. 

When a field has an indicator to specify 
the number of leading characters to be ig-
nored in filing and the text of the field be-
gins with an escape sequence, the length of 
the escape sequence will not be included in 
the character count. 

When fields contain escape sequences to 
languages written from right to left, the 
field will still be given in its logical order. 
For example, the first letter of a Hebrew 
title would be the eighth character in a field 
(following the indicators, a delimiter, a 
subfield code, and a three-character escape 
sequence). The first letter would not appear 
just before the end of field character and 
proceed backwards to the beginning of the 
field. 

A convention exists in descriptive cata-
loging fields that subfield content designa-
tion generally serves as a substitute for a 
space. An escape sequence can occur within 
a word, after a subfield code, or between 
two words not at a subfield boundary. For 
simplicity, the convention that an escape 
sequence does not replace a space should be 
adopted. One other convention is also advo-
cated: when a space, subfield code, or 
punctuation mark (except open quote, pa-

Reports and Working Papers 215 

renthesis or bracket) is adjacent to an escape 
sequence, the escape sequence will come 
last. 

Wayne Davison of RLIN raised the fol-
lowing issue. After the Library of Congress 
has prepared and distributed an entirely ro-
manized cataloging record for a Russian 
book, a library with access to automated 
Cyrillic input and display capability will 
create a record for the same book with the 
title in the vernacular. (Since AACR2 says 
to give the title in the original script "wher-
ever practicable," the library could be said 
to be obligated to do so.) In such an event 
the local record could have all the authori-
tative Library of Congress access points. To 
keep this record current when the Library 
of Congress record is revised and redistrib-
uted, it would be necessary to carry the LC 
control number in the local record . Most 
automated systems are hypersensitive to the 
presence of two records with the same con-
trol number. The two records can be easily 
distinguished: in the Library of Congress 
record, the modified record byte in field 
008 will be set to "o" and it will not have 
any 066, character sets present field. 

A Comparison of OCLC, 
RLG/RLIN, and WLN 

University of Oregon Library 

The following comparison of three major 
bibliographic utilities was prepared by the 
University of Oregon Library's Cataloging 
Objectives Committee, Subcommittee on 
Bibliographic Utilities. Members of the sub-
committee were Elaine Kemp, acting assis-
tant university librarian for technical ser-
vices; Rod Slade, coordinator of the 
library's computer search service; and 
Thomas Stave, head documents librarian. 

The subcommittee attempted to produce 
a comparison that was concise and jargon-
free for use with the university community 
in evaluating the bibliographic utilities un-
der consideration. The University Faculty 
Library Committee was enlisted to review 
this document in draft jorm and held three 
meetings with the subcommittee for that 
purpose. The document was also shared 
with library faculty and staff in order to 
elicit suggestions for revision.