01 Editorial.qxd


Ibérica 33 (2017): 279-288

ISSN: 1139-7241 / e-ISSN: 2340-2784

Abstract  

In this research note we report on the first results of  SCAP, the Spanish Corpus

Annotation Project, applied to tourism discourse. In particular, we present and

assess a new TreeTagger parameter set for Spanish (SCAP-TT), which has been

trained for the Part-of-Speech tagging (POS-tagging) and lemmatisation of

Spanish promotional tourism texts. Although SCAP-TT has been trained for

specialized tourism discourse, we also show promising results for the annotation

of  other text genres such as essays and literary texts.  

Keywords: POS-tagging, lemmatisation, Spanish, TreeTagger, tourism

discourse, SCAP.

Resumen  

S CA P- T T :  E l  e ti qu e ta do  g ra ma tic al  y  l a l e ma ti z ac ió n de l  d is c u rs o  tur ís t ic o

e s p a ño l ,  y  má s  a l l á   

En esta nota de investigación describimos los primeros resultados de SCAP, el

Spanish Corpus Annotation Project, aplicado al discurso turístico. Presentaremos y

evaluaremos una nueva versión para el español del etiquetador TreeTagger (SCAP-

TT), diseñado específicamente para el etiquetado y la lematización de textos

turísticos promocionales en español. Si bien SCAP-TT ha sido diseñado para el

discurso especializado del turismo, mostraremos también resultados muy

prometedores para la anotación de otros tipos de discursos, como ensayos y

textos literarios.  

Palabras clave: etiquetado gramatical, lematización, español, TreeTagger,

discurso turístico, SCAP-tur. 

SCAP-TT: Tagging and lemmatising

Spanish tourism discourse, and beyond  

Patrick Goethals, Els Lefever and Lieve Macken  

Ghent University (Belgium)  
patrick.goethals@ugent.be, Els.Lefever@ugent.be &
Lieve.Macken@ugent.be  

279


Ibérica 33 (2017): 279-288

P. GOEThAlS, E. lEFEvER & l. MACkEn

1. Introduction  

This research was motivated by two observations. The first of  them is that in

Spanish specialized discourse corpus compilation projects, POS- and lemma-

annotation are not yet self-evident features. Corpora often consist of  raw

text, allowing for word form-based queries, but not for more abstract POS-

or lemma-based queries. Regarding Spanish tourism discourse, for example,

the two main corpus projects, Linguaturismo (http:// www.linguaturismo.it)

and Cometval (http://www.uv.es/cometval) do not (yet) contain linguistic

annotations. This observation is not intended as a criticism towards these

specific projects, but rather as one example of  a broader dichotomy between

current practices in corpus and computational linguistics. 

The second observation is related to TreeTagger1 (TT, Schmid, 1994, 1995).

TT is a tool for automatic POS-tagging and lemmatisation which predicts

the most probable POS-tag for each word taking into account its inherent

formal characteristics and the surrounding POS-context. TT can be run

using the built-in parameters, but it also offers a training tool to generate new

parameter sets, which means that it can be adapted and improved depending

on the specific needs of  a corpus project. Although the main architecture is

language-independent, the output quality varies according to the language,

since the tool depends on language-specific input, such as a lexicon, a tag set,

a list of  multi-word items or a training corpus (for technical details, see

Schmid, 1994, 1995). It is generally accepted that the results for the Spanish

TreeTagger are not as good as for English, for example (Göhring, 2009).

Moreover, it should be noted that the adaptiveness of  TreeTagger appears to

be underused, at least for Spanish, since there are no newly trained and

publicly available parameter sets for Spanish.

Taking into consideration these observations, our aim is to use the inherent

adaptiveness of  TreeTagger and to make an improved parameter set for

Spanish. In order to stimulate the development of  annotated corpora, the

parameter set is made available at the project’s website (www.scap.ugent.be).

At the same website, readers will find further technical information, as well

as advanced tools and automated applications for further processing the TT-

output. In what follows, we will first briefly discuss the performance of  the

current Spanish TreeTagger parameter set (Standard-TT). Then, we will

describe the main decisions that were taken in the development of  a new

parameter set (SCAP-TT), and compare the results of  SCAP-TT with

Standard-TT. Finally, it is important to emphasize that in this research note,

280


we will not compare the results of  TreeTagger with those of  other tagging

tools, such as IULA (Martínez et al., 2010), GRAMPAL (Moreno & Goni,

1995) or FREELING (Carreras et al., 2004) (see e.g. Parra & Martínez, 2015

for a recent comparison).  

2. T r e e T a gg e r

At least three features determine the success of  a tagging tool among corpus

linguists: its user-friendliness, the accuracy of  the output and the granularity

(or level of  detail) of  the coding categories. As has been argued by several

authors (e.g. durán, 2010; Argüelles & Muñoz, 2012; Crespo & Frías, 2015),

TreeTagger scores high on user-friendliness. It is free and easily downloadable,

it can be run locally, it can be used with a high number of  languages and it

does not require advanced technical skills, especially since the publication of

graphic interfaces by duibhín2 and Anthony (TagAnt3). Although some of

these arguments might seem trivial for computational linguists, they are

essential for corpus linguists, and the fact that TreeTagger scores high on user-

friendliness basically guarantees that it will continue to be used in the future. 

however, with respect to accuracy, the results of  the current Spanish

Standard-TT are sub-optimal. Reports vary between 90% and 95% accuracy,

depending on the text type and on how accuracy is measured. In a recent

comparison of  several tools (TreeTagger, IULA, Freeling and IXA), Parra and

Martínez (2015) conclude that Standard-TT ends third out of  four both for

POS-tagging and lemmatisation. Also with respect to granularity, TT does

not score optimally, mainly because it does not include inflectional

information.

In this contribution, the focus is on improving the accuracy of  both POS-

tagging and lemmatisation. In a later stage, when more training data will be

available, we will attempt to also improve the granularity of  the coding scheme. 

3. Design of  the new parameter set  

In what follows we describe the main steps taken to generate a new

parameter set for Spanish. This information is not exhaustive: detailed and

explicit coding guidelines can be found at the project’s website

(www.scap.ugent.be).  

SCAP-TT: TAGGInG And lEMMATISInG SPAnISh TOuRISM dISCOuRSE 

Ibérica 33 (2017): 279-288 281


3.1. Tokenization and multi-word units 

A first decision concerns an optional TT-feature that affects the definition

of  multi-word units. TreeTagger defines multi-word units (e.g. “sobre~todo”)

during the tokenisation process, on the basis of  a closed list (approx. 300

items in Standard-TT). This procedure presents some inherent weaknesses.

First, the built-in list seems relatively arbitrary, including for example

“tal~vez”, “a~veces”, and “San~Pablo” but not “a~lo~mejor”,

“de~vez~en~cuando” or “Santa~María”. The end user can partly overcome

this problem by adding or omitting items. The second weakness, however, is

more problematic and is caused by the fact that multi-words are defined

before the contextual analysis takes place (i.e. during the tokenisation

process). This means that all instances of  these word sequences are a priori

defined as multi-words. In a sentence such as “sobre todo el territorio”,

“sobre todo” would be analysed as a multi-word unit, just like in “es, sobre

todo, muy inteligente”. Therefore, we took the decision not to use a list of

multi-word items. The end-user of  TreeTagger is still free to use (and modify)

such a list, but we do not recommend this, especially not with SCAP-TT,

since it is not trained this way.  

3.2. POS tag set: Accuracy and granularity 

The total tag set increased from 77 tags in Standard-TT to 110 in SCAP-TT.

Although the number of  tags is sometimes used as an indicator of

granularity, our main reason for remodelling some aspects of  the tag set was

to improve the accuracy. SCAP-TT scores at least as good, and perhaps

slightly better, on granularity than Standard-TT, but it still lacks inflectional

information.

In what follows, we discuss three decisions to illustrate the overall

philosophy. As said before, a detailed comparison between the two tag sets

can be found at the project’s website (www.scap.ugent.be).  

3.2.1. Past participles 

The classification of  adjectival and verbal uses of  past participles is a well-

known problem, which is handled differently by different taggers (Parra &

Martínez, 2015). The distinction causes frequent errors (e.g. when

“adecuada” in “una medida adecuada” is analysed as a verb and lemmatised

as “adecuar”), and infelicitous decisions (e.g. when “pasado” receives the

P. GOEThAlS, E. lEFEvER & l. MACkEn

Ibérica 33 (2017): 279-288282


same label in “los tiempos pasados” and in “ha pasado algo”). The

consequence of  this problematic labeling is that both categories of

adjectives and verbal participles are contaminated and should be entirely

corrected during post-edition, which is very time-consuming. As a pragmatic

solution, Parra and Martínez (2015) decide to classify all past participles as

verbal forms, and a similar option is taken in the EAGLES tagset.4 In fact,

the problem is also conceptual, since it seems artificial to divide what is a

continuum ranging from prototypical adjectival uses to prototypical verbal

uses in only two categories. Given both the pragmatic and conceptual

arguments, we decided to create three categories:  

(i) ADJ, for adjectives not derived from a participle;  

(ii) VPART, for participles in their most prototypical verbal

contexts, namely in combination with “haber” and in passive

constructions with “ser”;  

(iii) ADJV, for adjectives derived from a participle that do not occur

in combination with “haber” and passive “ser”.  

As a result, the categories AdJ and vPART are far more homogeneous, and

the end user can choose (i) whether it is necessary to check the intermediate

category of  AdJv or (ii) to add the whole category of  AdJv to AdJ or to

vPART.  

3.2.2. Enclitics  

TreeTagger does not separate verbal forms and enclitics, but assigns a specific

POS-tag to verbs containing one or several clitics. In SCAP-TT, this option

is further refined, assigning different labels to the four possibilities: 

(i) SE when the verb is followed by “se”(“arrepentirse”);  

(ii) CLI when the verb is followed by one pronoun (“comprarlo”); 

(iii) SECLI when the verb is followed by “se” and a pronoun

(“decírselo”);  

(iv) CLI2 when the verb is followed by two pronouns

(“comprármelo”).  

SCAP-TT: TAGGInG And lEMMATISInG SPAnISh TOuRISM dISCOuRSE 

Ibérica 33 (2017): 279-288 283


3.2.3. Articles  

In contrast with Standard-TT, which has only one label for articles, SCAP-TT

distinguishes between definite, indefinite and neutral (“lo más importante”)

articles.  

3.3. Lexicon  

SCAP-TT uses a lexicon of  almost 79,000 lemmas and 670,000 word form

– POS tag pairs. This lexicon combines various sources, but takes as a

starting point the stemming list made available by Boleslav Měchura5
(containing some 490,000 word form – lemma pairs). The latter list lacked

important information: it did not include invariable forms (e.g. adverbs),

word forms that coincide with the lemma, verbal forms with enclitics, and,

most importantly, information on the POS-category of  the word forms. In

the SCAP lexicon all this information was added.6

3.4. Training corpus  

The SCAP parameter set was trained on a manually annotated 200,000 word

corpus containing two types of  descriptive-promotional tourism texts: digital

descriptions of  tourism attractions published in Minube, an online 2.0 travel

guide, and brochures published by TurEspaña, the national tourism agency. 

4. Testing and results  

SCAP-TT was tested on three 5,000 words-corpora. The first testing corpus

belongs to the same series of  tourism brochures as those used in the training

corpus. There is, however, no textual overlap between testing and training

data. Additionally, we considered two other 5,000 words excerpts from the

essay Las venas abiertas (E. Galeano), and from the novel El Club Dumas (A.

Pérez Reverte). The testing data allow us to assess the added value of  SCAP-

TT, both in the specialized context for which it is trained, and in other

discourse domains.

Two preliminary observations should be made. First, it is important to note

that we applied an assessment procedure and not a gold standard procedure,

which means that we evaluate accuracy with respect to the internal logic of

the parameter set, and that tags are only counted as errors when they do not

correspond to what is expected from the tag set definitions. This avoids that

P. GOEThAlS, E. lEFEvER & l. MACkEn

Ibérica 33 (2017): 279-288284


the error analysis gets biased by judgments about the felicitousness of

specific coding decisions. For example, “nadie” is systematically tagged by

Standard-TT as a Quantifier, while it is a Pronoun in SCAP-TT. Although we

believe that the second label is more appropriate than the first one, we did

not count this as an error in the Standard-TT output. On the other hand,

when a participle following “haber” is tagged as vladj in Standard-TT and as

AdJv in SCAP-TT, this is considered correct in the former case, but

erroneous in the latter one, because in SCAP-TT we would expect to find

vlPART. Secondly, we will take into account the effect of  using the optional

TT-feature of  “Capitalization heuristics”. using this heuristics means that

the tagger seeks unknown capitalized words in the lower-case lemma list,

which may affect the ratio of  proper nouns in the tagging output, and also

the type of  errors that occur. Therefore, errors are subdivided in categories:

(a) erroneous POS-tags excluding proper nouns, (b) items which receive a

correct POS-tag but are not appropriately lemmatised, again excluding

proper nouns, (c) proper nouns which were not recognized as such, and (d)

items which were incorrectly labelled as proper nouns. Two totals are

provided: one excluding the proper nouns, and one overall total.  

The tourism testing corpus clearly shows that SCAP-TT improves POS-

tagging and lemmatisation substantially, especially when proper nouns are

excluded. The number of  errors is even reduced with more than 80% (30 vs.

215 or 37 vs. 271). In addition, not using the Capitalization heuristics reduces

the number of  missed proper nouns drastically (27 vs. 135), although it also

slightly increases the number of  general POS-errors (27 vs. 17) and the

number of  false proper nouns (11 vs. 5). These are important results, but the

question arises whether the improvements are only due to the specialized

training modalities. Yet, the results for the other testing corpora show that

SCAP-TT yields better results in literary texts and essays as well, although

the benefit is less pronounced than in the tourism domain. For example,

considering the <–Cap heuristics> modus, the total number of  accuracy

fails is reduced from 7,2% to 1,5% in the tourism testing corpus, from 6,2%

to 2,8% in the literary corpus and from 4,1% to 2,7% in the essay corpus.

We also notice that the most significant improvements in the literary and

essay corpus concern lemmatisation, and that the use of  capitalization

heuristics leads to considerable shifts in the results of  SCAP-TT. Based on

these results, the best strategy is to combine both outputs by replacing in the

<+Cap. heuristics> output those POS-labels in which the – capitalization

heuristics tags a proper noun.

SCAP-TT: TAGGInG And lEMMATISInG SPAnISh TOuRISM dISCOuRSE 

Ibérica 33 (2017): 279-288 285


5. Conclusion

We have shown that SCAP-TT considerably improves the tagging and

lemmatisation results of  the current Spanish TreeTagger, especially but not

P. GOEThAlS, E. lEFEvER & l. MACkEn

Ibérica 33 (2017): 279-288286

SCAP_TUR: TAGGING AND LEMMATISING SPANISH TOURISM DISCOURSE 

Ibérica 33 (2017): …-… 

 Standard-TT SCAP-TT 

 
+ Cap. 
Heuristics 

- Cap. 
Heuristics 

+ Cap. 
Heuristics 

- Cap. 
Heuristics 

Test 1: Tourism Brochure     

(a) POS-error (no proper nouns) 
133 

(2.7%) 
134 

(2.7%) 
17 

(0.3%) 
27 

(0.5%) 
(b) correct POS but erroneous 
or unknown lemma (no proper 
nouns) 

82 
(1.6%) 

83 
(1.7%) 

13 
(0.3%) 

10 
(0.2%) 

(c) missed proper nouns 
159 

(3.2%) 
141 

(2.8%) 
135 

(2.7%) 
27 

(0.5%) 

(d) false proper nouns 
3 

(0.1%) 
3 

(0.1%) 
5 

(0.1%) 
11 

(0.2%) 

TOTAL (no proper nouns) 
215 

(4.3%) 
271 

(4.3%) 
30 

(0.6%) 
37 

(0.7%) 

TOTAL 
377 

(7.5%) 
361 

(7.2%) 
170 

(3.4%) 
75 

(1.5%) 

Test 2: Literary Prose     

(a) POS-error (no proper nouns) 
132 

(2.6%) 
133 

(2.7%) 
58 

(1.2%) 
102 

(2.0%) 
(b) correct POS but erroneous 
or unknown lemma (no proper 
nouns) 

107 
(2.1%) 

108 
(2.2%) 

5 
(0.1%) 

5 
(0.1%) 

(c) missed proper nouns 
72 

(1.4%) 
68 

(1.4%) 
76 

(1.5%) 
29 

(0.6%) 

(d) false proper nouns 
2 

(0.0%) 
2 

(0.0%) 
2 

(0.0%) 
6 

(0.1%) 

TOTAL (no proper nouns) 
239 

(4.8%) 
241 

(4.8%) 
63 

(1.3%) 
107 

(2.1%) 

TOTAL 
313 

(6.3%) 
311 

(6.2%) 
141 

(2.8%) 
142 

(2.8%) 

Test 3: Essay     

(a) POS-error (no proper nouns) 
88 

(1.8%) 
91 

(1.8%) 
46 

(0.9%) 
54 

(1.1%) 
(b) correct POS but erroneous 
or unknown lemma (no proper 
nouns) 

73 
(1.5%) 

74 
(1.5%) 

7 
(0.1%) 

7 
(0.1%) 

(c) missed proper nouns 
68 

(1.4%) 
35 

(0.7%) 
107 

(2.1%) 
21 

(0.4%) 

(d) false proper nouns 
1 

(0.0%) 
4 

(0.1%) 
5 

(0.1%) 
53 

(1.1%) 

TOTAL (no proper nouns) 
161 

(3.2%) 
165 

(3.3%) 
53 

(1.1%) 
61 

(1.2%) 

TOTAL 
230 

(4.6%) 
204 

(4.1%) 
165 

(3.3%) 
135 

(2.7%) 

Table 1. Testing results of Standard-TT and SCAP-TT for three corpora. . 


exclusively for tourism discourse. We believe that this is an important

contribution since it may reinforce the use of  an already well accessible and

well-known tool and, as such, contribute to integrating POS-tagging and

lemmatisation into the current practice of  Spanish corpus researchers.

unsurprisingly, we have also found that the new tagger gives the best results

for the specific discourse domain for which it is trained. 

Acknowledgements

We wish to thank h. Schmid for the very helpful and quick answers to our

practical questions regarding the use of  the TreeTagger Training Tool. .

Article history:
Received 5 May 2016

Received in revised form 6 September 2016
Accepted 9 September 2016 

References

SCAP-TT: TAGGInG And lEMMATISInG SPAnISh TOuRISM dISCOuRSE 

Ibérica 33 (2017): 279-288 287

Argüelles Álvarez, I. & A. Muñoz Muñoz (2012).
“An insight into Twitter: A corpus based contrastive
study in English and Spanish”. Revista de
Lingüística y Lenguas Aplicadas 7: 37-50.

Carreras, X., I. Chao, L. Padró & M. Padró (2004).
“FreeLing: An open-source suite of language
analyzers” in Proceedings of _The Fourth
International Conference on Language Resources

and Evaluation, LREC 2014, 239-242. Lisbon:
European Language Resources Association
(ELRA). URL: http://www.lrec-conf.org/
proceedings/lrec2004/pdf/271.pdf [xx/xx/xxxx]

Crespo, M. & A. Frías (2015). “Stylistic authorship
comparison and attribution of Spanish news forum
messages based on the TreeTagger POS tagger”.
Procedia-Social and Behavioral Sciences 212:
198-204.

Durán Muñoz, I. (2010). “A corpus-based
ontoterminological tool for tourist translations”.
International Journal of Translation 22: 149-165.

Göhring, A. (2009). Spanish Expansion of a
Parallel Treebank. MaThesis. University of Zurich.

Martínez, H., J. Vivaldi & M. Villegas (2010). “Text

handling as a Web Service for the IULA processing
pipeline”. In Proceedings of LREC 2010:
Workshop on Web Services and Processing

Pipelines in HLT: Tool Evaluation, LR Production

and Validation, 22-29. Paris: European Language
Resources Association (ELRA). URL:
http://www.lrec-conf.org/proceedings/lrec2010/
workshops/W3.pdf [xx/xx/xxxx]

Moreno, A. & J.M. Goni (1995). “GRAMPAL: A
morphological processor for Spanish implemented
in PROLOG” in arXiv preprint cmp-lg/9507004.

Parra Escartín, C. & H. Martínez Alonso (2015).
“Choosing a Spanish Part-of-Speech tagger for a
lexically sensitive task”. Procesamiento del
Lenguaje Natural 54: 29-36.

Schmid, H. (1994). “Probabilistic part-of-speech
tagging using decision trees” in Proceedings of
The International Conference on New Methods in

Language Processing, 44-49.

Schmid, H. (1995). “Improvements in part-of-
speech tagging with an application to German” in
Proceedings of The ACL SIGDAT-Workshop, 1-9. 


Patrick Goethals is Associate Professor at the department of  Translation,

Interpreting and Communication, Ghent university (Belgium). his main

research interests are Spanish linguistics, corpus-based translation studies,

multilingual communication, and tourism communication. he has published

several articles in international journals such as Journal of  Pragmatics, Meta,

Linguistics, Ibérica and Multilingua. 

Els Lefever is Assistant Professor at the lT3 language and translation

technology team at Ghent university. She has a strong expertise in machine

learning of  natural language and multilingual natural language processing,

with a special interest for computational semantics, cross-lingual word sense

disambiguation and multilingual terminology extraction. She teaches

Terminology and Translation Technology, language Technology and digital

humanities courses. 

Lieve Macken is Assistant Professor at Ghent university with strong

expertise in multilingual language processing. Research interests are

computer-assisted translation, terminology extraction, human-computer

interaction in translation and machine translation. She is the operational

head of  the language technology section of  the department, where she also

teaches Translation Technology, Machine Translation, localisation and

Technical Translation. 

NOTES 

1 http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/. 

2 http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm. 

3 http://www.laurenceanthony.net/software.html. 

4 http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html. 

5 http://www.lexiconista.com/datasets/lemmatization/.

P. GOEThAlS, E. lEFEvER & l. MACkEn

Ibérica 33 (2017): 279-288288