SELIM 16.indb Melania Sánchez Reed & Antonio Miranda García, Selim 16 (2009): 121–147ISSN: 1132–631X A SEMI-AUTOMATIC PART-OF-SPEECH TAGGING SYSTEM FOR MIDDLE ENGLISH CORPORA: OVERCOMING THE CHALLENGES Abstract Historical corpus annotation is very much a manual, time-consuming task. The last few years have witnessed advances in the use of computational tools for the annotation of Middle English corpora. In 2007 an attempt at creating a semi-automatic system for part-of-speech (POS) tagging, based on the use of parallel texts, was developed at the University of Texas. Although this work still revealed manual annotation to be more accurate, it proved the potential of computational tools for the creation of tagging systems. We propose the development of a semi-intelligent and semi-automatic POS tagging program for ME corpora capable of tagging any given ME text with a high rate of success; no such computational system is currently available. This task entails challenges of a two-fold nature: a) linguistic diffi culties; and b) computational limitations. This paper discusses these diffi culties and provides possible solutions to them in order to create a tool that will facilitate POS tagging and help searching for linguistic information. Keywords: POS tagging, Middle English, historical corpora, computational linguistics. Resumen La anotación de corpus históricos es en gran medida una tarea manual y laboriosa. Los últimos años han sido testigos de muchos avances en el uso de herramientas computacionales para el etiquetado de corpus de inglés medio. En el 2007 la Universidad de Texas desarrolló un sistema semi-automático de etiquetado morfológico basado en el uso de textos paralelos y, aunque el estudio siguió revelando que el etiquetado manual era más preciso, demostró el potencial de las herramientas computacionales para la creación de sistemas de etiquetado. Proponemos el desarrollo de un etiquetador morfológico semi-inteligente y semi-automático para corpora de inglés medio capaz de etiquetar cualquier texto con mucha precisión; actualmente, no disponemos de tal sistema. Esta tarea supone desafíos tanto lingüísticos como computacionales. Este artículo analiza estos problemas y o ece posibles soluciones al objeto de crear una herramienta que facilite el etiquetado morfológico y ayude en la búsqueda de información lingüística. Palabras clave: etiquetado morfológico, Inglés medio, corpora históricos, lingüística computacional. 1 Introduction E lectronic corpora are almost inexhaustible sources of linguistic knowledge. However, without the appropriate annotations most of this information would be as lost as a needle in a haystack. Part-of-speech (henceforth POS) annotation/tagging is undoubtedly the most common type of corpus annotation, simply because it stands as the basis of all corpus studies. Assigning POS-tags to raw corpora is essential 122 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) for performing further analyses, such as syntactic parsing and semantic fi eld annotation (McEnery and Wilson 1997), and furthermore to perform collocation studies and obtain word  equency lists, among others. All this is of great help in fi elds such as lexicography and language teaching and learning. Many automatic POS taggers are available on-line nowadays that can tag large amounts of raw text in a matter of seconds. However, this task was entirely manual prior to 1971, when Greene and Rubin developed TAGGIT, the fi rst POS tagging program. Although the TAGGIT system was very primitive and at fi rst could guarantee a success rate of just 71%, many diff erent systems have been developed over the years, each one providing new improvements, such as CLAWS (1983), developed by UCREL at Lancaster; the Brill Tagger (1993); or GENIA (2006), which also performs shallow parsing, and named-entity recognition for biomedical texts. Actually, the people behind CLAWS, which served to tag the famous BNC corpus, worked for a number of years on improving the system ever since it was developed in 1983, and by 1994 it could already claim a success rate of up to 97–98%. In light of this, most computational linguists today consider the automatic POS tagging process to be a close case, and although there is still much controversy as to what extent it is actually entirely solved (see Giesbrecht and Evert’s 2009 discussion on the nature of fi ve current German tagging systems). It is a fact that if, as Wolfgang Fischel claims, “human annotators agree in just 96% of the cases” (2009: 7) and this is the same percent of success that an automatic tagger can feasibly achieve on average, then the remaining percentage can be attributed to “the ambiguity in the language itself ” (2009: 7) and not, therefore, on the tagging programme’s limitations. But all in all, and bearing this in mind, we could easily consider the task of automatic POS tagging of English texts to be virtually resolved. On the other hand, English historical corpora has lagged behind its modern counterpart: in fact, not until the last few decades has historical linguistics even become “strictly corpus-based”. The common procedure was to take “a selective approach to empirical data” and simply to “look for evidence of a particular phenomena […] making rough estimates at 123 A semi-automatic POST system for ME corpora Selim 16 (2009)  equency” (McEnery and Wilson 1997). However, since 1984, when the Helsinki Corpus of English Texts: Diachronic and Dialectal (the most famous historical corpus of English) was compiled, many other historical corpora have been developed or are currently in the making: the Innsbruck Computer Archive of Middle English Texts (1994), the Corpus of Early Middle English Tagged Texts and Maps (1997), or the Corpus of Early English Correspondence (1998), to quote but a few. Here in Spain we should mention the Coruña Corpus, developed at its namesake University, and The Corpus of Late Middle English Scientifi c Prose, currently being compiled with the collaboration of the Universities of Málaga, Oviedo, Murcia, Jaén and Glasgow. While similar in scope (the two of them deal with scientifi c English prose), there are also important diff erences between both projects. Most importantly, the former corpus is tagged and diachronic, while the latter is POS-annotated and synchronic. The Malaga Corpus, as we can call it for short, pursues the electronic editing of the Middle English material housed in the Hunterian Collection at Glasgow University Library. This corpus currently holds approximately 250,000 words, and the fi nal target is to reach no less than half a million words. In view of the late development of the creation of historical corpora, it stands to reason that the development of automatised POS tagging systems for such corpora is dilatory. The current state of art reveals only two attempts at creating an automatic system for the automatic POS tagging of English historical texts, the fi rst regarding Old English corpora and the second dealing with the tagging of Middle English texts, which is in fact the object of our present study. A part-of-speech tagger for OE was developed at Zurich University (Switzerland) by Beni Ruef; it consisted in a rule-based tagging system following transformational-based learning.1 A manually tagged corpus of 108,000 words was employed for training the program into learning the rules of the language. The total rate of successful tagged words was of 88.5% (91.5% accuracy for known tokens and 56.5% for unknown tokens). As we can see, the main problem this system had is that it could not recognise items that had not previously been included during the training 1 See also Miranda-García et al. 2000 and 2001 on the implementation of a POS tagger of OE, developed at the University of Málaga. 124 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) process. In turn, the ME POS tagging system, developed by the University of Texas at Austin in 2007, attempted to create a semi-automatised tagger for ME based on the alignment of already tagged parallel contemporary English texts. The parallel texts chosen were excerpts taken  om the Bible. This new tagger was trained using the modern tagged texts as basis; through multiple alignment with the ME texts the appropriate tag was to be identifi ed. Moreover, to ensure a higher rate of success and further automatisation a bigram tagger was trained on these alignments. Finally, the C&C (Curran and Clark) maximum entropy tagger, which was initially employed to tag the modern version of the Bible, was then bootstrapped onto the ME text which had been, in turn, tagged by the trained bigram tagger (Moon and Baldridge 2007: 393). They also attempted the use of unsupervised bootstrap methods to train the tagger without previously having to tag the texts manually. However results revealed that “a manually annotated training set of 400–800 sentences surpassed our best bootstrapped tagger”. Overall, their methods managed to obtain “an accuracy of 84%” (Moon and Baldridge 2007: 391). Note further that this method presents a clear limitation as it relies on the existence of a text written in two languages and, consequently, proves unfeasible for the tagging of ME texts that do not have modernised equivalents, which is our case. As we can see, even though the last few years have witnessed this important advance in the development of a semi-automatic system for POS tagging for historical texts, much work still needs to be done before we can plead success. Nevertheless, these fi ndings are encouraging as to the potential of computational tools for the creation of tagging systems and will undoubtedly set the ground-work for the development of a system of these characteristics. We propose to devise a semi-intelligent and semi-automatic part-of- speech tagging program for ME corpora that is capable of tag any given ME text successfully with a very high rate of success, much more than any computational system of similar characteristics that we know of is currently able. However, this task con onts us with several challenges. 125 A semi-automatic POST system for ME corpora Selim 16 (2009) If contemporary English POS taggers pose problems when it comes to ambiguity and unknown words (among others), the range of diffi culties encountered for the creation of an automatic POS tagger for ME words is considerably wider due to the nature of the language—mainly its orthographical variation. Consequently, we face challenges of a twofold nature: (a) linguistic diffi culties; and (b) computational limitations. The present paper discusses the nature of each of these diffi culties and provides solutions, whenever possible, to overcome them, in order to create a useful tool that will facilitate the POS tagging process and, therefore, help the linguist’s search for linguistic information.2 The present paper is organised into 4 diff erent sections. Section 2 deals with the challenges and is, accordingly, divided into two diff erent subsections. The fi rst (2.1) enumerates and discusses linguistic diffi culties, and the second (2.2) accounts for computational limitations. Section 3, in turn, provides the possible solutions to overcome the diffi culties enumerated in the previous chapter. And fi nally, section 4 provides the conclusions. 2 The challenges Before we begin to discuss the diffi culties posed by the design of a semi- automatic ME tagger, we consider it important to highlight and establish, if at a very basic level, the main steps involved for the creation of an algorithm for any automatic POS tagger, regardless of the language. We follow Wolfgang Fischl’s summary for the task. He divides the process into three basic steps. First comes tokenization, wherein “the text is divided into tokens”, including “end-of-sentence punctuation marks and word-like units”. Ambiguity look-up then follows. Here each token that has been previously identifi ed will be provided with a number of “possible part of speech tags”. For example, ME bath would be tagged initially as both a noun and a verb. The fi nal step is disambiguation: every word that has been assigned more than one tag in the previous phase will be given a single, correct tag. The program will have to choose the correct POS tag and assign it to the token in question. Homonyms and polysemic words 2 The present research, entitled Etiquetador Morfológico (POS-tagger) de Inglés Medio, is funded by the Autonomous Government of Andalusia (project P09-HUM–4790). This grant is hereby gratefully acknowledged. 126 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) are particularly bound to undergo the latter process. This complicated task can be solved by using two diff erent types of taggers: rule-based and stochastic ones (Fischl 2009: 2). Section 2.2, which deals with computational matters, expounds further on the nature of these two systems. 2.1 Linguistic diffi culties Spelling and word formation in the ME period was irregular due to a lack of standardisation in the language. Moreover, manuscripts were o en compiled by diff erent scribes or written by the same scribe but compiled  om several diff erent sources, many belonging to diff erent dialects or even diff erent languages, mainly Latin. The following section presents some of the linguistic problems that one encounters when dealing with ME texts. 2.1.1 Choosing the transcription The fi rst step to build a successful POS tagger for ME begins at the level of transcription. Depending on the type of transcription we are working with, the possibility of it being “taggable” will be more or less feasible, will be accomplished automatically or manually. We provide three diff erent models of transcription below in order to ascertain, according to their specifi c features, whether they would be compatible with a semi- automatic POS tagging system. Let us begin by considering a graphetic diplomatic transcription, maintaining the text as originally written by the scribe insofar as it not only preserves the original spelling, emendations and other scribal mistakes, but also reproduces the abbreviation symbols without expanding them. This model will normally also maintain the original punctuation as well. This type of transcription is completely incompatible will POS tagging, manual or automatic. First of all, it is graphetic and so reproduces every distinct letter type, resulting in a number of diff erent graphs to represent the same letters. For example, in Fig. 1 below, we have an instance of two letter ⟨r⟩ shapes found within the same word. Due to the great number of symbols that the tagger would have to learn, it would be highly time- consuming and not at all practical. Furthermore, if no expansions are 127 A semi-automatic POST system for ME corpora Selim 16 (2009) provided for the many abbreviations that appear across the witnesses, POS tagging is an impossibility even in manual tagging, as even if the tagger could be trained into understanding the diff erent symbols, this would only be possible if there were a one-to-one correspondence between the symbols and the letters they represented during the period—which, of course, is not the case. The same symbols are  equently found to stand for diff erent letters, not only intertextually, but also intratextually. See, for instance, the cases of ⟨peper⟩ and ⟨parte⟩ (Fig. 2–3) where the groups ⟨ar⟩ and ⟨er⟩ have been abbreviated by means of the same symbol, a bar across the stem of letter ⟨p⟩. Last of all, if the punctuation remains the same as in the original MS, correct tokenization has also proved impossible. Fig. 1. f. 47v (Hunter 328) Fig. 2. f. 59r (Wellcome 397) Fig. 3. f. 62v (Wellcome 397) Our second model of transcription proposes, again, a semi-diplomatic transcription, using a graphemic—rather than a graphetic—approach, i.e. not distinguishing individual letter types (for example, s longa as opposed to diamond-shaped and sigma-shaped s) but presenting most phonemes in the text via one and the same graph (in the above case, ⟨s⟩ for all instances). Moreover, abbreviations would also be expanded. However, punctuation would still remain the same as in the original source, and, for this reason, our second model also has to be rejected for the achievement of a successful POS tagging process. Our third and fi nal model is also a graphemic semi-diplomatic transcription as the one above, only diff ering in that sentential punctuation is now so -marked according to some rules. This feature is key to obtaining a model of transcription which feasibly allows for being processed by an automatic tagger, since having a more or less standard punctuation will allow a computer programme to identi sentence patterns with which to perform the automatic POS tagging process. Note that all three models involved a (semi-)diplomatic transcription. Indeed a general edition, wherein punctuation, and sometimes even 128 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) spellings, are regularised/standardised to a degree (take for instance the Riverside Chaucer), would solve many problems instantly. However, we wish to work with diplomatic transcriptions as we aim to provide the reader with versions of original sources as unbiased as possible, ones that are reliable for linguistic, codicological, palaeographical and historical research purposes. 2.1.2 Item/word recognition: tokenization Across ME witnesses we fi nd many instances of words that appear separated as if they were two diff erent units when they are in fact just one, such as ⟨be fore⟩, ⟨with out⟩ and ⟨a boue⟩, below. Fig. 4. f. 53v (Wellcome 397) Fig. 5. f. 66v (Wellcome 397) Fig. 6. f. 49r (Hunter 328) We have just stated in 2.1.1 that we are to follow a semi-diplomatic transcription to reproduce the original source faithfully. Therefore, if the scribe wrote these words separately for any given reason, then we must respect this and reproduce it accordingly in the transcription. However, when feeding this transcription into a POS tagging program these words would be considered as two diff erent items. For example, ⟨be fore⟩ would appear as ⟨be⟩ and ⟨fore⟩ and tagged as verb and preposition, respectively. We can also fi nd the opposite situation: words that appear written continuously as a single token when they are in fact two separate words. Note cases as ⟨adragme⟩ and ⟨aman⟩ (Fig. 7–8), consisting of a determiner and a noun. These words would be understood as one item and as a result fail to be recognised by an automatic POS tagger. Fig. 7. f. 54r (Wellcome 397) Fig. 8. f. 53v (Wellcome 397) 129 A semi-automatic POST system for ME corpora Selim 16 (2009) Another problem when dealing with ME texts is line-fi nal word division. Sometimes we have a hyphen at the end of the line indicating that the word continues on the next line. Theoretically at least, we could train the tagger to recognise these hyphens, as they are accurately reproduced in the transcription. However, what happens when we have no hyphen indicating line-fi nal word division, which is the unfortunate case more o en than not? The genitive morpheme is also trouble making. Apart  om being a compound and, therefore, to be considered as one single unit (compounds will be discussed further below), how can we make the system understand that in a noun phrase like ⟨bores grece⟩ the fi rst token is a noun in the genitive case and thus avoid the real danger that the system automatically interprets it as a noun in the plural? We use the apostrophe nowadays to tell the genitive singular ⟨’s⟩  om the plural ⟨⒠ s⟩, and consequently most POS taggers for Present-Day English are trained to identi the ⟨’s⟩ morpheme as a separate unit. However, in the 15th-century the genitive ending -es, which survived  om the OE declension for singular nouns, was still very much in use. The apostrophe did not appear until the ⟨e⟩ was fi nally dropped, since it fact it was used to indicate this contraction (Cavella and Kernodle 2003: 2). Last of all, Middle English included letterforms which are no longer extant in our contemporary alphabet: these include, thorn ⟨þ⟩, yogh ⟨ȝ⟩ together with their respective capital counterparts ⟨Þ⟩, ⟨Ȝ⟩ and (since we work with semi-diplomatic transcription) we should probably include dotted ⟨ẏ⟩.3 So, the tagger will have to be trained into recognising these letterforms, and moreover, into interpreting these letter forms as possible variants. Take for one the following spellings of the defi nite article: ⟨þe⟩, ⟨the⟩ and ⟨ye⟩. The tagger will have to realise that all these diff erent letterforms are representing the same word and that they should accordingly acquire identical tags. 3 Note that for early ME texts at least ⟨ð⟩, ⟨æ⟩, ⟨Ð⟩ and ⟨Æ⟩ should be added to the inventory. 130 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) 2.1.3 Word identification Once all the tokens have been suitably established, the system must recognise them in order to be able to add their corresponding morphological tag⒮ . However, this is not a straightforward process, as not all the items will be recognised by the tagger. Dialectal variants, scribal errors, roman numerals and terms belonging to other languages, mainly Latin, will be the main source of our problems. A possible solution regarding dialectal variants and scribal errors would be to standardise and correct them, respectively. As we are dealing with semi-diplomatic transcription, this is naturally out of the question. The manual tagging process followed by the Corpus of Late Middle English Scientifi c Prose lemmatises the words according to the online version of the Middle English Dictionary (henceforth e-MED), but not only lemmas are provided, since their original spellings are also maintained. So, a semi- automatic tagger for ME would need to be trained to recognise these variants and, furthermore, to identi them as belonging to the same lemma. However, the real problem would arise when variants never seen before appear in a text, as a tagger trained on a specifi c set would not be able to recognise them. As for Roman numerals, they can easily be input into the system as indeed they already are in most Present day English taggers. Concerning foreign terms (Latin, French, etc.), the basic problem is that we have a limited knowledge base. Our transcriptions, which have been manually annotated, have been lemmatised, as mentioned above, according to the entries recorded by the e-MED, but this source is of little use when it comes to such Latinate terms and other foreign words as were not considered borrowings by the editors of MED. Words not recorded in the e-MED have been tagged consulting other sources (see Moreno-Olalla & Miranda-García 2009: 137 for details). Furthermore, we can also rely sometimes on prior experiences, that is, if a particular word has already appeared in a previous text then we can tentatively assign it the same tag—but of course this practice is very limited. Moreover, if the tagger is solely trained on items that have already appeared then it is bound to encounter  equently new words which it is unable to identi . All in all, these problems are tied to the same main concern, the existence of “unknown” words that will not be identifi ed by the system. This is, 131 A semi-automatic POST system for ME corpora Selim 16 (2009) in fact, a diffi culty that automatic POS taggers designed for present day languages still have to contend with. 2.1.4 Tagging criteria Before attempting to devise a semi-automatic and semi-intelligent POS tagging system a criterion must be established for the task. What information do we want our tagger to provide? And, how do we want this information organised? The manuscripts compiled in the Corpus of Late Middle English Scientifi c Prose have all been tagged manually according to the following criteria. First of all, the transcriptions are downloaded onto a Microso Excel spreadsheet, so that all the words appear vertically ordered in the fi rst column. Then, each word is annotated with its corresponding lemma and morphological information in the remaining horizontal columns. Every word is tagged with the same information: lemma, word class, accidence, folio, line manuscript number, and meaning. Each lemma will moreover appear with its specifi c word class attached in order to procure disambiguation. The entries would look as presented on Fig. 9 below. Our objective is to design a semi-automatic POS tagger that can provide the following information: (a) lemma (disambiguated according to its morphological category); (b) POS tag; and (c) accidence. In addition to that, we plan to off er some information on the dialectal provenance of each of the variants whenever this can be ascertained. Moreover, our goal is not solely to provide tags at a simple word level. We also wish to take into consideration compound words, collocations and other phrases. So, our tagging system aims to perform POS tagging but also chunking to a certain extent, doubling up as a simple syntactic parser. At word level, on virtually any text we will fi nd both simple and compound nouns, such as ⟨enula campana⟩ and ⟨v levyd grase⟩, that should be tagged as one token only. However, this will cause diffi culties for the system, as the terms are divided in the transcription and the tagger will assume that they are separate items. For example, ⟨v levyd grase⟩ would be tagged independently as a numeral determiner ⒱ , an adjective (levyd) and a noun (grase), when our objective is in fact to tag the whole chunk as one noun. 132 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) Fig. 9. System of tags 133 A semi-automatic POST system for ME corpora Selim 16 (2009) We also intend to identi common nouns and proper nouns, both simple, such as ⟨Galion⟩ (MS Hunter 497, f. 88v) and ⟨Eneas⟩ (Ms Hunter 497, f. 90r), and compound ones such as ⟨kynges Rogere⟩ (MS Wellcome 397, f. 54r) and ⟨Christofer Rochester⟩ (MS Hunter 329, f. 30 v). How will the automatic system recognise proper names? Normally, they appear capitalised in ME witnesses. However, we must be aware that we can have instances of proper names which appear entirely in small case letters, such as ⟨uirgil⟩ (MS Hunter 497, 90r) or ⟨ypocras⟩ (Hunter 497, 48r), and also instances of common nouns that appear capitalised in the middle of a sentence, as in ⟨Mynttis⟩ (MS Hunter 328, 56v) and ⟨Coriander⟩ (MS Hunter 328, 60v). Therefore, the system will not be able to rely on capitalisations in order to identi proper names. Collocations and other phrases are divided into two types: (a) those wherein all the units remain together in a consecutive sequence, such as ⟨because of⟩, ⟨take  om⟩, ⟨to and  o⟩ (Hunter 503), ⟨yn as moche as⟩ (Hunter 513a) or ⟨in respect of⟩ (Hunter 513a), among many others; and (b) those that appear divided, such as ⟨not only … but also⟩, ⟨whether … or⟩, ⟨if … than⟩ or ⟨neyther … ne⟩, among others. Below, we have examples of these “divided” phrases found within context: not only puttyth oute sauerey hyr chylde whether yt be quyk or deed yf she ete sauerey. but also yf sauorey be under put to þe woman þat ys with chylde (MS Hunter 497, 28v) whether yt be quyk or deed (Hunter 497, 28v) if he parbrake malum signifi cat . than serche þe wounde . & Chaff e þe brokyn bonys (Hunter 328, 64v) Neyther þe rose coloure ne þe lylye may ouerpasse þe uiolet (Hunter 497, 15v) These phrases should be tagged as a single item. However, how can we make the system recognise them as such? As we can see, we have the same problem that we fi nd with compound nouns. Furthermore, phrases that appear divided, such as ⟨not only … but also⟩ pose even more of a challenge for the system since, as we have seen above, they can  equently be lines apart. 134 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) 2.1.5 Homonyms Homonyms can only be disambiguated within context, and, as a result, they are a source of problems for POS taggers, particularly since our target tagger also aims to provide the meaning of any given token. For example, PDE ⟨lap⟩ can be both a round of a race track and the part of the body when sitting down. In ME we also have homonymous words. The following sample pairs are entries taken  om the e-MED: lēchen, v (1) ‘to cut, slice’ vs. lēchen, v (2) ‘to cure, treat’ whītel, n (1) ‘a knife, dagger’ vs. whītel, n (2), ‘a blanket’ Note further that homonyms do not necessarily have to share the same POS: ⟨fair(e, n⟩ and ⟨fair(e, b⟩, both taken  om the e-MED, are a noun meaning “an aff air or business” and an adverb meaning “beautifully”, respectively. 2.1.6 Word order Word-order in Present Day English (henceforth PDE) is very fi xed, which gives PDE automatic POS taggers a distinct advantage when it comes to item recognition. For instance, even if a word is unknown or ambiguous to the tagger, the system will  equently be able to tag the word successfully by means of analysing its immediate context, a task that is performed by many existing algorithms. On the other hand, in ME word-order was not so rigid. For example, a number of adjectives can appear before or a er the noun they are modi ing, as in ⟨piper long⟩ (Hunter 328, f. 47v) or ⟨longe peper⟩ (Wellcome 397, f. 55r), which is problematic for devising a POS tagger. Or let us take the following phrase, “mellicratum cum piper long & Alys specibus” (Hunter 328, f. 47v), and imagine that the word ⟨long⟩ is unknown to our system. If our tagger is trained according to the “adjective + noun” pattern, which is the fi xed pattern in PDE, then a phrase like this is bound to be tagged erroneously. The context around the word ⟨long⟩ would be looked at, and as soon as the tagger realised that it had a noun to the right and the conj unction ⟨and⟩ to the le , the program would most likely assume (of course erroneously) that the unknown word is another 135 A semi-automatic POST system for ME corpora Selim 16 (2009) noun, when it is in fact an adjective. Our ME tagger would then have to train the system into recognising diff erent patterns for adjectives, and this would now cause ambiguity problems. 2.1.7 Punctuation Tokenization relies heavily on sentential punctuation, as it allows a POS tagging system to recognise sentences and, as a result, to identi the items that compose it. However, ME punctuation poses a problem for this process due to its extreme haphazardness. The most common punctuation marks in ME include the period (.), the virgule (/) and the paragraph mark (¶), but, in any given witness, “the signifi cance of a given mark varied almost as  equently as spelling did” (Petti 1977: 25). To make matters worse, “practice o en diff ered  om writer to writer” (Petti 1977: 25).4 Therefore, we cannot train our tagging system to recognise ME punctuation if there are no standard rules. For example, the virgule was  equently employed between words as is our present-day comma, as in the following sequence: “Take turbite / clowes / armodactules / of eueryche x dragynes […]”. But it could also function as a full stop, indicating the end of a paragraph. In a similar vein, the period (.) could also function as a comma or as full stop. Another symbol found within our transcriptions is square brackets ([ ]). These have been added by the transcriber in order to re-introduce marginalia and other interlinear additions within the main body of the text, so they are not are not part of the original witness. However, as they are found within the transcription, and will therefore be fed into the system, they still have to be dealt with. We have two options: (a) to delete them before introducing the transcription into the system; (b) to train the tagging system to ignore these symbols and solely acknowledge their contents, which will appear in superscript, as in “put it into a fayre vessel [of glasse]. & vse to drynk it o en” (MS Hunter 328, f. 45v). Note incidentally that this example illustrates as well the use of the period (.) as a present day comma. 4 See Calle-Martín 2004, Calle-Martín and Miranda-García 2005 and Marqués-Aguado 2009 for further information regarding the nature of ME punctuation. 136 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) 2.2 Computational limitations In order to devise an successful semi-automatic POS tagging system for ME it is necessary to look at the existing systems and algorithms that are currently available for the task, irrespective of their target language, in order to establish those methods that are best suited to fulfi ll our needs as to the information we wish our tagger to provide. 2.2.1 Classification POS taggers can be classifi ed into four basic types: (a) rule based; (b) stochastic or probabilistic; (c) hybrid; and (d) based on alignment and projection of parallel texts. (a) Rule based taggers Rule-based taggers, such as TAGGIT, employ specifi c rules to eliminate ambiguity. These rules can be fed into the system by introducing a pretagged corpus, a “tagger dictionary” (Van Guilder 1995: 1), where om the program can be trained, and by introducing rules (manually or automatically) by searching for generalised patterns of word order, the most  equent clause patterns including noun phrases, verb phrases and prepositional phrases. On the one hand, manually established rules require a high level of linguistic knowledge of the language in question and, moreover, a great deal of human eff ort, which is precisely what we are trying to reduce. Furthermore, manually established rules limit the tagger to the language and/or domain which it was designed for in the fi rst place, not allowing for accurate results when trying to tag texts of a diff erent nature. On the other hand, automatic taggers learn the rules automatically,  ee of human intervention, through a training process on a previously tagged text. (b) Stochastic or probabilistic taggers Stochastic or probabilistic taggers, as their name implies, rely on probabilistic methods for disambiguation. Usually trained on a previously tagged text, although not necessarily (see Eric Brill 1995 for more information regarding this matter), the system chooses those tags with the highest rate of  equency for the given word sequence. Simple stochastic taggers will assign tags relying solely on  equency, that is, an ambiguous 137 A semi-automatic POST system for ME corpora Selim 16 (2009) word will be tagged depending on the most  equent tag it has throughout the trainer text, which can cause words to be tagged erroneously. To surmount this problem many diff erent types of stochastic taggers have been devised, all employing diff erent systems based on probability. Let us look at seven commonly used stochastic systems: Hidden Markov Models, Maximum Entropy Taggers, decision trees, sliding windows, Support Vector Machines and memory based learning. Hidden Markov Models (henceforth HMM), used in taggers such as HunPos or TATOO, combine “tag sequence probabilities and word  equency measurements” (Altunyurt, Orhan and Güngör 2007: 66), that is, they tag a word by making assumptions based not only on the  equency of the tag in itself, but also on the  equency of the tag appearing with the previous tags. This way the context of the word is taken into account and a tag will be applied or not depending on its preceding and succeeding words. For instance, note the position of the word bathe in the following phrases found in MS Hunter 497: “a bathe of hem” (f. 6v) and “wasshe or bathe well þe heed” (f. 26r). Out of context bathe could be either a noun or a verb, so to resolve ambiguity, a HMM would fi rst of all calculate the probability of it being one or the other, depending on its rate of  equency found in the trainer text, and then it would look at the probability of these tags appearing with other tags, and thanks to the fact that in “a bathe of hem” bathe is preceded by a determiner the tagger will be able to correctly identi it as a noun, given that the structure determiner + noun is invariable across any text. In turn, Maximum Entropy Taggers are statistical models based on mathematical formulas for automatic POS tagging. This model searches for the probability of distribution of the maximum entropy according to common restrictions and “combines diverse forms of contextual information in a principled manner, and does not impose any distributional assumptions on the training data” (Ratnaparkhi 1996: 133). Decision Trees involve the creation of a decision tree generated  om a previously tagged corpus that is used for the training process. The resulting tree will then be employed for the subsequent tagging of any text. Schmid claims that decisions trees would require a smaller training 138 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) corpus than other methods, such as HMM, in order to obtain accurate results (Schmid 1994a: 48). A Sliding Window is defi ned as “a system which assigns the part of speech of a word based on the information provided by a fi xed window of words around it” (Sánchez-Villamil, Forcada and Carrasco 2004: 454), that is, that although the window, or  ame, is mobile in itself, the number of words that is scanned at a time by that sliding window is always fi xed. Sánchez-Villamil et al. present a tool that allows for the system to be trained  om a raw corpus, i.e. unsupervised, not having to be tagged previously, furthermore implementing the tagger “exactly as a fi nite-state machine” (2004: 454). Morphologic tagging can also be modeled and resolved with Artifi cial Neural Networks (ANN). These networks “consist of a large number of simple processing units” which “are highly interconnected by directed weighted links”. Each unit will have its own activation value, this activation being “propagated to other units” through tile connections (all quotations  om Schmid 1994b: 172). These networks can learn self-suffi ciently by adapting the weight of their connections  om a group of classifi ed samples. This method has been claimed to “have shown performances comparable to that of Hidden Markov model systems or even better” (Lippmann 1989 cited in Schmid 1994b: 172). The concept of Support Vector Machines (SVM) is defi ned by Pianta and Zanoli in the following manner: “Support Vector Machines are based on the Structural Risk Minimization strategy [7],5 which aims at fi nding a hypothesis H for which we can guarantee the lowest true error, that is the probability that H will make an error on an unseen and randomly selected test example” (Pianta & Zanoli 2007: 8) A SVM “performs classifi cation tasks by constructing hyperplanes in a multidimensional space that separates cases of diff erent class labels” (Electronic Statistics Textbook 2010). Murata, Ma and Isahara claim yet that SVM’s can implement the POS tagging learning process effi ciently only when large corpora are not being used as supervised data (2001: 24). 5 From Vladimir N. Vapnik 1995: The Nature of Statistical Learning Theory. Springer. 139 A semi-automatic POST system for ME corpora Selim 16 (2009) Finally, the last stochastic method we will be dealing with is memory- based learning. This is a supervised inductive training method. When tagging a text both the tagged words of the training text and their context are stored in the system’s memory and the words in our target text will be tagged according to their similarity with those words that have been stored in the memory. This system can provide assets such as the possibility of using a small tagged corpus for training, i.e. “incremental learning”, and it apparently yields “good results on unknown words without morphological analysis”, among others (Daelemans and Zavrel 1996: 25). (c) Hybrid taggers Hybrid taggers combine aspects of both stochastic and rule-based methods. Available hybrid taggers include, among others, systems based on transformation-based-learning (henceforth TBL) such as the Brill Tagger,6 and those based on a combination of rules and HMM, such as CLAWS4. TBL “is an error-driven approach to induce the retagging rules  om a training corpus” (Algahtani, Black and McNaught 2009: 67). These taggers are trained  om previously tagged texts. This process, however, is carried out in two diff erent stages. Firstly, the words in the sample text are tagged according to their most  equent tag, and secondly (in the rule-learning stage), the tagger applies a series of possible disambiguation rules and evaluates them whilst gauging their adequacy, which is expected to reduce the error-rate. In this way the system can learn  om its own errors and only learn those rules that are most eff ective. A er the learning process has taken place, the texts are initially tagged without taking the rules into account; these are applied at a second stage to improve the fi rst tagging. (d) Taggers based on alignment and projection of parallel texts This method, as seen above concerning the ME tagger developed at the University of Texas, relies on the existence of a text written in two diff erent languages. One of the texts is tagged, either manually or by means of any 6 Note that TBL can be classifi ed as a rule-based method (see Brill 1992) and also as a stochastic method (Charniak 1997). However, given that it combines aspects of both methods, this study as considered it to be of a hybrid nature. 140 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) of the aforementioned automatic POS taggers, and then the other text is tagged by means of alignment and projection with the previously tagged text, therefore, automatically acquiring the same tags. Note that for our present objective this method is unfeasible, as no contemporary tagged versions of the ME texts we are dealing with are currently in existence. 2.2.2 Our beta tagger Given the complexity of our objectives, the most adequate solution to fulfi ll our specifi c needs is to build a hybrid POS tagging system, wherein both rule-based and probabilistic methods are combined for the sake of a higher accuracy. At present we have a beta version of our target ME POS tagger that is fully operative and is continuously being improved by testing diff erent methods. However, the tagger is able so far to tag items that have been previously trained into the system only, and does not deal as yet with unknown tokens. We distinguished above the three stages involved in the POS tagging process: (a) “tokenization”; (b) “ambiguity look-up”; and (c) “disambiguation”. Our current tagger is in the second phase of this process: it divides the text into tokens and then assigns all the possible tags that could possibly be applied to each of them. Take the token ⟨powder⟩ for instance. Our tagger is already able to assign it the POS tags of noun and verb, given that it can function as both, as seen  om the following examples: “tyll the powder be consumed” (Hunter 503, p. 35) vs. “and powder it and medyl it wyth suger captyn” (Hunter 503, p. 124). The next step in line is hence to achieve a successful disambiguation, as ambiguity remains the main computational challenge. The central trainer for our system is our tagger dictionary, compiled  om the morphological information contained in the manually tagged transcriptions belonging to the Corpus of Late Middle English Scientifi c Prose. When a text is fed into the system, it is fi rst of all tokenized: tokens are acknowledged and supralinguistic elements such as spaces and punctuation marks are skipped for the moment. Then the system searches for these tokens in the database and those that coincide with those words stored in the tagger dictionary are tagged automatically. Nothing really new here: this is how ordinary tokenizers work. However, in view of the aforementioned diffi culties that the ME language poses for a successful 141 A semi-automatic POST system for ME corpora Selim 16 (2009) tagging process, which cause so many words to remain unidentifi ed, our system needs to be more complex. Therefore, the identifi cation of the words that are unknown to the tagger becomes our second main computational challenge. 3 Solutions 3.1 Linguistic problems Let us now move on to provide possible solutions to several of the aforementioned linguistic problems. Note that our linguistic problems are  equently solved by means of computational solutions. When a word is not identifi ed automatically it will undergo several processes: a) The fi rst process surmounts our diffi culties concerning line-fi nal word division. Take, for instance, ⟨cle-pid⟩ or ⟨whi-ch⟩ when there is no hyphen indicating that these instances are just one word. The system performs the following procedure to identi them. First of all, it searches for the fi rst part, in the case of ⟨cle-pid⟩ it will search for ⟨cle⟩, and if it cannot fi nd it, the system will automatically unite it with the following token and perform the search again, this time searching for ⟨clepid⟩; if found, it will be added the corresponding tag⒮ . Note that this solution only works for words which are divided into tokens that are unrecognisable to the system. For example, the word ⟨with out⟩ if divided would fail to undergo this process, as ⟨with⟩ would be identifi ed and tagged as a preposition and ⟨out⟩ would also be found and tagged as an adverb. To avoid this problem, before any word is tagged by default the system will look at the word that follows it, to see if they exist in conj unction with another in the tagger dictionary. This way, ⟨with out⟩ will be successfully tagged as one token. Note that whenever a longer form is available in the tagger dictionary it will be favoured and chosen as the valid tag even if individual tags are found for the shorter tokens. b) The second process solves the instances in which two words appear united. Take ⟨aman⟩, a determiner plus a noun, which was one of the aforementioned examples. The system fi rst searches for the word as a whole, and since it will not found it begins to break down the word into two parts, going through all the possible combinations and searching for both parts respectively in the tagger dictionary. Our example ⟨aman⟩ can theoretically be divided in three ways: a-man, am-an and ama-n. 142 Melania Sánchez Reed & Antonio Miranda García Selim 16 (2009) In this case, the fi rst division is successful as the other two options are impossible. The system searches for ⟨a⟩ and then ⟨man⟩ and they acquire their respective tags, and what was once one token instantly becomes two. c) For the system to recognise ME characters, such as ⟨þ⟩ and ⟨ȝ⟩, it is necessary to employ a character repertory that contains these characters. This corpus complies with version 5.0 of the Unicode standard, as it is becoming the most used one in the world. d) As for ME irregular capitalisation, identifi cation will pose no problems if the proper names appear in the training text, as they will automatically receive their corresponding tag, regardless of whether they happen to appear capitalised in our target text or not: the system will invariably check the word we wish to tag both in upper-case and lower- case letters. That is, if our text presents a proper noun in lower-case that appears capitalised in our database, the system will fi rst look for it in lower-case exactly as it is found in the text, and then search for it with a capital initial. Likewise, for a common noun that appears capitalised the system will search for its capitalised form and, when not found, a lower case version will automatically be searched. However, this method can be problematic when we are dealing with ME characters, as the Unicode standard can sometimes cause problems when converting lowercase into uppercase and conversely. Moreover, our problem still remains for unknown proper nouns. How can they be identifi ed? e) Finally, to surmount the lack of standardisation in ME punctuation we initially developed a system of symbols that were to be introduced at transcription level and which would allow the tagger to recognise sentences. Three symbols were employed: ⟨**⟩, ⟨@@⟩ and ⟨%⟩, all of them respectively placed a er the scribal punctuation. The two asterisks indicated that the system was to ignore the preceding punctuation mark. The two “at” symbols indicated that the mark we were dealing with should be acknowledged as valid also in PDE. Finally, the percent symbol indicated that the preceding mark of punctuation was added by the transcriber in order to provide a contemporary punctuation, irrespective of the value of the mark in the original MS. An example of this system looks as follows: ⟨& make .** a plaster & ley it to þe dyssese .@@⟩. Here we are telling the program to ignore the fi rst period, and acknowledge the last one. Using this system we could teach the POS tagger to identi 143 A semi-automatic POST system for ME corpora Selim 16 (2009) sentences without disturbing the scribe’s original punctuation, as once the transcription has been fed into the system, the added symbols can be removed or made invisible. A similar system of symbols was developed to lemmatise tokens that appeared divided, including compounds, collocations and other phrases. This was solved by placing ⟨ÑÑ⟩ in between the separated parts, this way telling the tagger that they were to be taken as one lemma. For example, ⟨withÑÑout⟩, ⟨enulaÑÑcampana⟩ and ⟨becauseÑÑof⟩. Furthermore, for those grammatical expressions that could appear lines apart (as in ⟨not only … but also⟩) not only ⟨ÑÑ⟩ would be used to unite each section but arrows would also be added: > meaning that something follows; and < meaning that something precedes, as seen below. “notÑÑonly> puttyth oute sauerey hyr chylde whether yt be quyk or deed yf she ete sauerey.