Iberica 13


Ibérica 24 (2012): 75-86

ISSN 1139-7241

Abstract

Starting with a description of  the software and hardware used for corpus

linguistics in the late 1980s to early 1990s, this contribution discusses difficulties

faced by the software designer when attempting to allow users to study text.

Future human-machine interfaces may develop to be much more sophisticated,

and certainly the aspects of  text which can be studied will progress beyond plain

text without images. Another area which will develop further is the study of

patternings involving not just single words but word-relations across large

stretches of  text.

Keywords: history of  corpus linguistics, MicroConcord, concordance,

software design, collocational span.

Resumen

Mirando hacia atrás o mirando hacia delante en la lingüística del corpus:
¿Qué nos sugieren estos 20 años para las próximas décadas?

Esta aportación comienza con una descripción de los programas y los equipos

informáticos que se emplearon en la lingüística del corpus a finales de los años

80 y comienzos de los 90 y seguidamente estudia las dificultades que han tenido

que afrontar los diseñadores de programas informáticos para conseguir que los

usuarios pudieran estudiar un texto. Cabe la posibilidad de que en el futuro se

desarrollen interfaces hombre-máquina mucho más sofisticadas, y con toda

seguridad se avanzará en aquellos aspectos textuales que puedan ser objeto de

estudio superando el texto plano sin imágenes. Otro aspecto en el que se

continuará avanzando será el estudio de los modelos o patrones que contienen

no sólo palabras sueltas sino relaciones de palabras en tramos extensos de texto.

Looking back or looking forward in

corpus linguistics: What can the last 20

years suggest about the next?

Mike Scott

Aston University (United Kingdom)

mike@lexically.net

75


Ibérica 24 (2012): 75-86

MIkE SCOTT

Palabras clave: historia de la lingüística del corpus, MicroConcord,

concordancia, diseño de programas informáticos, colocaciones.

Introduction

The last twenty years have seen a revolution in the circulation of  information

and opinion just as important as the development of  metal-smelting, of

writing or of  printing in previous epochs. An apparently simple change in

technology (heating crushed rock, the creation of  a character set, a machine

to copy text quickly, a means of  linking up millions of  computers) brings

about an enormous social change: it becomes possible to store knowledge so

that generations can learn from their ancestors, to distribute it so that

everyone in a community can know more, and now with the Internet, to

enable discussion and exchange of  ideas with very little regard for where or

with whom one happens to be. Most of  the psychological and social impacts

of  those technological changes could not have been imagined by the

inventors and first users of  those technologies.  The first uses of  writing

were mostly for storing records concerning ownership and conquest but the

effects soon began to include a spread of  ideas and opinion and a sense of

history; early printing centred on culturally-sanctioned official works but

before very long led to challenges to the status quo, chiefly concerned with

the right to freedom of  speech. We are still in the very earliest years of  the

Internet revolution and we do not yet know all the changes it is bringing

about.

One of  them is a plague of  zombies: public spaces full of  people present in

body but not in mind. About twenty years ago I had my first experience of

Internet body/mind separation1, sitting at my desk in my university office in

Liverpool but mentally dislocated to an Australian university, reading

documents stored on their servers. These online documents were highly

factual, like the earliest Sumerian records from 5000 years ago (and about

equally gripping). Terms like “online” and “server” were yet to be met and

of  course “social” did not yet collocate with “networking”.

What follows is my own personal retrospective, leading I hope to

anticipation of  some possible future developments, but with the warning

already implicit in what was written above: most of  the interesting

developments cannot easily be imagined in advance, even if  they seem

obvious in hindsight. That the motor car might lead to a network of  surfaced

76


highways, with road signs and indeed with traffic accidents probably was

predictable in the year 1896 or 1897, but Los Angeles’ enormous suburban

sprawl, the development of  out-of-town shopping malls and the blight on

town centres was not.

Changes

In the early 1990s most educated people had never heard of  a

“concordance”; the few that had associated it with study of  religious text and

an enormous amount of  manual compilation. The word “concordancer”

was even more restricted2. I had come across the form “concordancing”

myself  in the previous decade, thanks to Tim Johns: he and I were both

officially concerned with English for Specific purposes and that is really why

we came into contact (Scott, 2012). However, Tim’s enthusiasm for what

micro-computers (as they were then known3) could be made to do matched

my interest and so in the late 1980s we collaborated on a concordancer,

which was published by Oxford university press as MicroConcord (Scott &

Johns, 1993): the element “micro” and the mid-word capital letter very much

matching the spirit of  the times.

Figure 1 shows 1993 technology: the large floppy disk which held the

program, the manual, and a card to pace above the function keys on the

keyboard. There were two floppies for different capacities of  disk-drive;

some users had no hard disk so had to run the program direct from the

floppy.

LOOkIng bACk Or LOOkIng FOrWArd In COrpuS LInguISTICS

Ibérica 24 (2012): 75-86 77

        
Figure 1. MicroConcord (Scott & Johns, 1993), 5.25” floppy disk and F-key card. 

F              

               
    In 1994, Apple computer ran a magazine ad with a photo of 

w              

           
At that time computer memory was very limited. MicroConcord required 200k

of  rAM – a standard micro-computer would have at least 640 but usually a

whole megabyte.4 disk space was at a premium too: not only did some

micro-computers have no hard drive at all, but costs were quite high. In

October 1992, IbM launched the Thinkpad (mid-word capitals again)

starting at uS$4,350, with between 4 and 16 Mb of  rAM, a 10.4 inch colour

display (640 by 480 pixels) and a huge 120Mb hard drive (Stengel, 2012).

That was a leading-edge computer, much more advanced than the university

and lecturer machines for which MicroConcord was designed. In May 1992,

Word perfect 5.1 was issued; it cost uS$495.5 In 1994, Apple computer ran

a magazine ad with a photo of  what a bike messenger cum screenwriter

would have on his (of  course) Apple powerbook (mid-word caps here too):

“My first screenplay, a dictionary, a thesaurus, a spellchecker ... the number

of  a girl from my screenwriting class, a list of  good bike repair shops, a

detailed map of  downtown, my résumé, my grocery list, Microsoft Word, the

number of  ray bazire who owes me money”, etc. (Wichary, 2012). Clearly

the powerbook copywriters are suggesting that a computer does not only

help users to do their work but also stores a variety of  non-text resources

and personal notes. 2012 advertising copy would be able to presume that

many of  these resources do not need to be stored on the user’s computer but

can be found online.

This little foray into a history which anyone aged 40 or more may remember,

whether sharply or fuzzily, shows two things. First, the technology has

changed very fast indeed, and second but more interestingly, user

expectations and knowledge have changed importantly too. 

In the early 1990s I was responsible, amongst other things, for introducing

new overseas post-graduate students arriving at Liverpool university to the

computer facilities they would have to use in their pre-sessional EAp classes.

Students had to prepare a sizeable piece of  academic writing concerned with

their specialist subject, deliver a mini-presentation on the same topic, and

design a poster for a poster session at the end of  their 6-, 10- or 13-week

course, so we wanted them all to feel confident with the university’s

computer systems. I do not think email was yet a requirement but the ability

to type and print out one’s academic writing was, because the student’s own

department would very soon require this too.  At that time it was not

uncommon for students from certain countries in the Middle East and South

Asia not to know their way around the computer keyboard, so much hunting

and pecking took place; students from the EEC (it became Eu in 1993)

MIkE SCOTT

Ibérica 24 (2012): 75-8678


typically had some basic familiarity with the keyboard but still needed a lot

of  help in getting started with logging into the university computers, using

email and getting started with word-processing.6

There were two main problems beside hunting and pecking. First, our

university systems were somewhat complex and tricky, designed by

computer engineers and not yet tuned to the knowledge and skills of

ordinary users. Logging in (at the university of  Liverpool in those days)

assumed an awareness of  the differences between unix and Windows, since

the university’s core system was unix but the software we ran was within a

Windows environment running on top (not to mention another system to

handle the network of  different servers)! In the same way, starting a car

engine in the 1900s required the user to get out a starting handle and crank

it. The software of  the 1990s was much less standardised than it is now, so

users not only did not know what was possible, nor what each function is

called, nor how to access it! Likewise, even as late as the 1980s you might

meet someone who would claim to know how to drive a VW but not a Ford,

because the basic functions were not yet standardised.

Second, many of  my adult students were afraid to experiment, for fear of

breaking something or looking foolish. The same problem I was very used

to as an EAp and EFL teacher, risk-taking, but here for some amplified by

the fear the humanities-trained student has of  science and engineering.

At that time, the idea of  a concordance was a tricky one. Any student (or

worse because usually more conservative, any teacher) found it immediately

daunting, because we are all trained to read linearly. To see a screenful of  text

where each line was unrelated to the line above and where words were

incomplete at left and right edges of  the kWIC concordance was very

distracting and impeded face validity. Even now it is hard to learn to read a

concordance vertically, sorting on the collocates at L1 or r1 position to gain

an impression of  word-patternings. 

The problem of  unfamiliarity has diminished greatly because we are all in the

habit of  using search-engines, and essentially these give a view which is

rather similar to a concordance, with a set of  unrelated entries, usually with

our search-word highlighted and centred in the display.

LOOkIng bACk Or LOOkIng FOrWArd In COrpuS LInguISTICS

Ibérica 24 (2012): 75-86 79


Design problems

My own work in devising corpus linguistic software had to solve three kinds

of  problem: 

• of  making each function work as it should; 

• allowing the user to know what the appropriate settings are and alter

them; and

• explaining the point of  each function.

Compare that with the designers of  Microsoft Word, where most functions

such as inserting a footnote or a picture with text flowing around it, or a

section in italics, are ones which the literate reader and writer already knew

about in printed essays, books, magazines, etc. even if  they did not know

how to carry them out in MS Word. And many of  the technical terms

“footnote”, “italics” and so on, were already in standard use (admittedly

users did have to learn a new meaning for the verbs “crash” and “back up”).

Accordingly, MS Word’s designers did not need to worry much about the

third problem. I have found it in some ways the hardest of  the three, in the

sense that a lot of  time needs to be spent on designing the Help system and

then on software demonstrations, training workshops and follow-up email

support. Empathy is required, an ability to guess what the other person

imagines, knows and does not know; an ability all teachers need.

The first of  the problems above is technical, it is not always easy but it

requires patience and problem-solving, in thinking of  ways to get something

done despite limitations of  machines. On the whole that has probably taken

up about one-third of  my time developing corpus software. In the early years

the shortage of  memory and disk space meant that ways had to be found to

compute in the most economical way possible; this made it hard to read the

code later and remember the various space-saving tricks employed – in other

words, debugging and improving was made harder. In later years, software

has become bloated because there is no longer the same need to find really

economical ways of  solving a problem. Moore’s Law (Moore, 1965) has

meant dramatic improvements in the speed of  the chips inside the computer,

the size of  the disk drives and memory, which in turn has made corpus

procedures work faster. We may now require more text to be processed and

corpora have also grown in size, but still a procedure which ten years ago

would have slowed things down unacceptably can now be allowed, so more

MIkE SCOTT

Ibérica 24 (2012): 75-8680


error-checking to avoid crashes can be brought in. At the same time,

programmers like to find elegant solutions, and I will sometimes

unnecessarily spend hours refining a routine in a search for elegance and

efficiency simply because I want my code to satisfy me aesthetically. 

The second problem concerns the human-machine interface and is much

trickier. To make the way to operate a tool so intuitive that a user can succeed

without looking for help is the aim,7 and part of  the solution is to keep things

simple so that the burden of  choosing does not overload the user. In the case

of  a pencil or a pen there isn’t a problem anyway since the tool can only do

one thing, but a multi-faceted tool like an automobile involves lots of  choices

about fuel, road traction, safety etc., so much so that we do not let people do

it without quite a lot of  training and basic skill testing. On the other hand,

simplicity means strait-jacketing the user, and it seems silly to restrict what is

possible merely because it is hard to show the user what the choices are. 

Corpus linguistics in the Future

Let us now turn to the future. Will corpus linguistics grow as a discipline, or

maybe even die out? What directions may corpus research follow?

Sound and video

Twenty years ago the pC already was capable of  colour and sound, but

corpus software typically only used three or four of  16 basic colours and no

sounds except beeps indicating an error. Even now, corpora with sound and

video accompaniment are very initial and sparse. There are many databases

of  isolated utterance recordings made for general computational linguistic

projects (for example at the Linguistic data Consortium or LdC8), but as

these do not consist of  normal running text or conversation, they do not

contribute straightforwardly to corpus linguistics. The International Corpus

of  English (ICE) corpora do contain sound files (ICE-gb has a total of  70

hours of  recorded speech, for example).9 MICASE at the university of

Michigan has 200 hours, just under 2 million words of  Academic English.10

The british national Corpus has 10m words of  speech, but did not include

the corresponding sound files.11

In the future, I believe ways could be found to transcribe corpora based on

radio and TV programmes and movies. As always, copyright is the main

LOOkIng bACk Or LOOkIng FOrWArd In COrpuS LInguISTICS

Ibérica 24 (2012): 75-86 81


stumbling block, but many TV programmes are pre-scripted or transcribed

in the production process so nearly all the work is done and technology to

align the sound/video file with the text transcript is already available.

Copyright is not necessarily a problem if  the resulting video is sold, as the

presence of  dVd stands in supermarkets shows.

Individual corpora

One of  the strengths of  WordSmith Tools and similar software is the ability to

process any corpus the user has access to, which can include official corpora

like the bnC but, more interesting, corpora the user builds up him- or

herself  by downloading or institutionally from colleagues or students. As

electronic resources on the web are increasing, it is very likely that more and

more informal or home-made corpora will be built and used.

Corpus linguistics and other disciplines

It was already clear twenty years ago that corpus tools have a lot to offer

language teachers (not that all teachers expressed interest, or in my opinion

need to) but getting corpus resources used by students is still not a general

practice in schools world-wide. With time, I foresee increasing expansion.

However, the schools and colleges in most countries are still seriously

restricted in computer (as opposed to mobile phone) availability and

resources are simply not there for the purchase of  software. It is likely that

increasing use will be made of  free software like AntConc (Anthony, 2012),

which shares many facilities of  WordSmith.12 It is not likely that interest in

language-learning will diminish over the foreseeable future, though the mix

of  languages studied will carry on changing.  At the same time corpus

resources and corpus tools will become standard resources. not everyone

will need to use them, for not every gardener uses a spade, but they will

become standard tools used by a wider range of  professionals: historians,

biologists, medical and political science students, for example. That may

mean that corpus linguistics itself  ceases to be a discipline in its own right

(there is no department of  Spade Sciences that I’m aware of). 

Certainly corpus tools will continue to develop. Corpus Linguistics is not just

about data resources, such as corpora. It is about “adding value to data”.

That is, we might drown in a flood of  data if  we were not able to filter it and

seek out patterns in it. For example, after early emphasis on single words or

phrases of  interest to a researcher, studying their keyword-in-Context

MIkE SCOTT

Ibérica 24 (2012): 75-8682


(kWIC) contexts trying to establish or refute typicality, we have now moved

further into much more focussed study of  collocation. In other words,

looking at how patternings within the context give a richer understanding of

word patterning. 

Collocation

Collocation must here be understood in its widest sense. The influential Osti

report study reporting on work carried out in the 1960s appeared to show

that although “each node has an infinite region of  influence, the influence

decreasing the further away from the node you go …” (Sinclair, Jones &

daley, 2004: 48), it is very difficult in practice to find significant collocates of

a word if  one’s horizons extend beyond four words to the left or right.

Sinclair, Jones and daley (2004) do not put it metaphorically, but it seems

they viewed collocation as one might view magnetism, an attractive force

which tails off  rapidly with distance.  On the other hand, armchair

experimentation tells us that if  say “eat” and “bananas” are collocates, it

would be very possible for a large number of  words or even clauses or

sentences to be found between the two tokens, as in a story beginning “Let

me tell you what to eat and what not to eat when travelling in …”.

Accordingly, perhaps we should think of  the attraction between node and

collocate as also sharing some qualities with gravity, which does not diminish

with distance.  Further, some space must be left for a kind of  negative

collocation, where a node shows a tendency to avoid a given collocate (much

as human beings avoid each other). Hoey’s theories (Hoey, 2005; but also in

numerous other publications) stress that words keep or avoid company over

much greater spans than just four or five words. Similarly we have seen

increasing interest in multi-word units, n-grams, bundles, clusters, and

concgrams (Cheng, greaves & Warren, 2006).

Conclusion

Corpus Linguistics may or may not survive as a discipline but I am very

confident that the ideas and resources built on foundations going right back

to the 1930s will continue to develop and shape resources and tools. This will

in turn keep corpus linguists and members of  many other disciplines, and for

that matter readers of  Ibérica, busy for many years to come. 

[Paper received 13 March 2012]

[Revised paper accepted 21 May 2012]

LOOkIng bACk Or LOOkIng FOrWArd In COrpuS LInguISTICS

Ibérica 24 (2012): 75-86 83


References

Originally qualified as a language teacher, teaching English for the british

Council in brazil and Mexico, Mike Scott eventually moved to Liverpool

university where he worked first in Applied Linguistics with emphasis on

language teaching and English for Specific purposes. A parallel interest in

Corpus Linguistics and software design and development, however,

eventually led to the publication of  first MicroConcord and then WordSmith

Tools. nowadays he works and researches in Corpus Linguistics while

maintaining the development of  WordSmith Tools and supporting its extensive

community of  users in many parts of  the world.

NoteS

i See the novels of  William gibson, for instance Count Zero published in 1986, for early awareness of  this.

2 Still is. A new copy of  MS Word 2010 underlines the word in red.

3 Computers at the time were either mainframe, mini- or micro-computers. The term pC had not yet
come into common use, nor had its association with the Windows operating system. The size of  the
computer’s box has now given way to whether it is placed on one’s lap or one’s desk, and though huge
air-conditioned computers still exist these are likely to be called super-computers.

4 A 2012 version of  Windows with the 2012 WordSmith Tools would be uncomfortable with less than one
thousand times as much rAM.

5 See polsson’s Chronology of  Personal Computers at urL: http://pctimeline.info/comp1992.htm 

6 Another very dated word: “word-processing” had come in with dedicated “word-processors” in the
1980s, that is specialised computers which could only be used for preparing documents, because
computing still seemed redolent of  white-coated technicians, air conditioning, expensive machinery.

7 I am very aware that in this aim I get a very mediocre score!

MIkE SCOTT

Ibérica 24 (2012): 75-8684

Anthony, L. (2012). AntConc. URL: http://www.

antlab.sci.waseda.ac.jp/antconc_index.html

[01/03/12]

Cheng, W., C. Greaves & M. Warren (2006). “From

n-gram to skipgram to concgram”. International

Journal of Corpus Linguistics 11: 411-433.

Gibson, W. (1986). Count Zero. New York: Victor

Gollancz.

Hoey, M. (2005). Lexical Priming: A new Theory of

Words and Language. London: Routledge.

Moore, G.E. (1965). “Cramming more components

onto integrated circuits”. Electronics vol. 38 No. 8.

URL: ftp://download.intel.com/ museum/Moores_

Law/Articles-Press_Releases/Gordon_Moore_

1965_ Article.pdf [01/03/12]

Scott, M. & T. Johns (1993). MicroConcord.

Oxford: Oxford University Press.

Scott, M. (2012). URL: http://www.lexically.net/

personal_pages/memories %20of%20Tim%20

Johns.html [01/03/12]

Sinclair, J., S. Jones & R. Daley. [1970] (2004).

English Collocation Studies: The OSTI Report.

London & New York: Continuum.

Stengel, S. (2012). URL: http://oldcomputers.net/

[01/03/12]

Wichary, M. (2012). URL: http://www.aresluna.

org/attached/computerhistory/ads/international/

apple/pics/annual94-powerbook 5 [01/03/12]


8 urL: http://www.ldc.upenn.edu/Catalog/topten.jsp

9 For ICE, see urL: http://ice-corpora.net/ice/index.htm. For ICE-gb see http://www.ucl.ac.uk/
english-usage/projects/ice-gb/index.htm

10 urL: http://micase.elicorpora.info/

11 They are being edited and may be available by the time you read this – see urL:
http://www.phon.ox.ac.uk/SpokenbnC. but also see dave Lee’s Devoted to Corpora site for more sound
archives at urL: http://tiny.cc/corpora

12 And has been produced by Laurence Anthony in Japan with my blessing and encouragement since 2002.

LOOkIng bACk Or LOOkIng FOrWArd In COrpuS LInguISTICS

Ibérica 24 (2012): 75-86 85