INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 12(2):217-226, April 2017. Detecting Bridge Anaphora D. Gîfu, L.I. Cioca Daniela Gîfu* 1. Romanian Academy - Iaşi branch Codrescu, 2 Iaşi, 700481, Romania 2. "Alexandru Ioan Cuza" University of Iaşi General Berthelot St., 16, Iaşi, 700483, România "Alexandru Ioan Cuza" University of Iaşi, România *Corresponding author: daniela.gifu@info.uaic.ro Lucian-Ionel Cioca "Lucian Blaga" University of Sibiu 10, Victoriei Bd., Sibiu, 550024, România lucian.cioca@ulbsibiu.ro Abstract: The paper presents one of most important issues in natural language processing (NLP), namely the automated recognition of semantic relations (in this case, bridge anaphora). In this sense, we propose to recognize automatically, as ac- curately as possible, this type of relations in a literary corpus (the novel Quo Vadis), knowing that the diversity and complexity of relations between entities is impressive. Furthermore, we defined and classified the bridge anaphora type relations based on annotation conventions. In order to achieve the main goal, we developed a com- putational instrument, BAT (Bridge Anaphora Tool), currently still in a test (and implicitly an improvable) version. This study is intended to help especially specialists and researchers in the field of natural language processing, linguists, but not only. Keywords: bridge anaphora, annotated novel, Bridge Anaphora Tool, testing corpus, corpus-driven, statistics. 1 Introduction The novelty of this study consists in the development of a web application for the automated identification of bridge anaphora type relations in a corpus from the literary area. In this case, the target is the Romanian version of the novel Quo Vadis, authored by the Nobel laureate Henryk Sienkiewicz [24]. Initially, a similar study carried out by the same team consisted in the supervised extraction of Bridge Anaphora type relations, using WEKA statistics [8]. Moreover, there was defined a set of annotation conventions for 11 bridge relations as a result of manual annotations made by a team of trained students in Computational Linguistics. The hypothesis of this paper is that the triggers have a fundamental role in the automated recognition of semantic relations generally and particularly of bridge anaphora relations. The paper is structured in 5 sections. After a brief introduction about the importance of this study, section 2 mentions some important works focused on bridging anaphora. Section 3 describes bridge anaphora relations in the context of semantic relations and section 4 describes a new tool functionality, called BAT (Bridge Anpahora Tool). The last section highlights con- clusions and mentions the future intentions, one of the main projects of Romanian researchers in NLP. Copyright © 2006-2017 by CCC Publications 218 D. Gîfu, L.I. Cioca 2 State of the art In our context, the semantic relations play a fundamental role in the information extraction process [1], regardless of the nature of the corpus [2, 9, 10]. Up to now, researchers in the NLP area have allocated a lot of time to identifying the best annotation conventions of semantic relations for various literary types [3, 4, 12] based on which the process of automated recognition of semantic relations was not only simplified, but the accuracy of results increased as well, for example in their unsupervised extraction [7]. One of the best known studies on the "bridging" concept originates with H.C. Herbert [11]. He starts from several scenarios in which an inference step is needed in order to understand the sense intended by the speaker and he states that the text itself does not offer the solution for solving the inference relation; the reader (or the computational instrument/machine) must use his/its knowledge on the anaphora and the antecedent in order to make a correct text interpretation. In the automated recognition of semantic relations, a special attention is granted to the anaphora resolution [23, 25], using statistic models [19, 22], something that we too exploited. NLP uses for recognizing entities and identifying relations in the text (bridging) systems based on manually created rules (see Hobb’s algorithm) [15], but also systems using statistical models that are in turn based on automated learning techniques in order to lessen the workload, models such as Conditional Random Fields (CRF) [26]. 3 Bridge anaphora in semantic relations context In order to better understand a content, we need thinking instruments, necessary for dis- covering new ideas or for clarifying the existing ones, illustrating the link between them. The semantic relations [16] describe these interactions, that are indispensable for interpreting texts. The properties of semantic relations were described in [17], this marking the relations between two entities (called poles) as open class. The application describes 10 types of bridge anaphora1. 3.1. A short introduction about semantic relations The semantic relations are represented as being distributions over several paragraphs [18]. In processing the natural language, the semantic relations play a fundamental role in the field of Information Extraction (IE), that targets the automated extraction of structured information referring to entities such as person names, localities etc. from semi-structured or unstructured texts. The ability to identify and understand these relations in a text can b useful in very many directions, such as: Machine Translation - MT; Computer Assisted Assessment - CAA; Clustering and so on. In order to create an instrument that can carry out, for example the automated translation, the interpretation of anaphora is also very important, especially in cases in which the translation is from a language in which the pronouns have different forms for each gender, into a language in which the pronoun has the same form regardless of gender [15, 22]. 1Class-of - relation between PERSON-CLASS & PERSON; Has-as-member - relation between PERSON- GROUP & PERSON; Has-as-part - relation between PERSON & PERSON-PART; Has-as-subgroup - relation between PERSON-GROUP & PERSON-GROUP; Has-name - relation between PERSON & PERSON-NAME; ISA - relation between PERSON & PERSON-CLASS; Member-of - relation between PERSON & PERSON- GROUP; Name-of - relation between PERSON-NAME & PERSON; Part-of - relation between PERSON-PART & PERSON; Subgroup-of - relation between PERSON-GROUP & PERSON-GROUP. Detecting Bridge Anaphora 219 3.2. About bridge anaphora Bridge anaphora [8] are referential semantic relations (beneath the co-referential or anaphoral ones) [5, 6] that includ linguistic expressions that give meaning to the analysed text (here, the narrative "thread" of the novel). Our documentation shows that the analysis of semantic relations is focused on structured corpuses such as: online newspapers, blogs, Wikipedia texts etc. [1]. A bridge anafora or "bridging" is a semantic relation that represents a link between the anaphora and the antecedent [11, 12]. These two elements will be mentioned in the following also as poles of a bridge-type semantic relation. In the next section we present the 10 types of bridge anaphora relations based on which the BAT was developed. An example of bridge-type semantic relation: Andrei este numit în diferite cercuri micuţul, din cauza înălţimii. —>(En.) Andrei is called in different circles the little guy, because his height. where: • Andrei is an antecedent; • micuţul —>(En.) the little guy is an anafor. 3.3. Bridge anaphora vs. anaphora A bridge anaphora type relation differs from an anaphorical relation firstly by the fact that it can be identified in the text using a trigger. This trigger can be a word or a group of words that has the property of indicating the presence in the text of the bridge anaphora relation, helping to identify it. In the following, we will exemplify the anaphorical relation and the bridge anaphora type relation in oder to clarify the difference between the two relations, both being referential type relations: - Anaphorical relation (coreferential) 1:[Marcus] era foarte supărat pentru toate cele întâmplate în ultima perioadă, însă 2:[el] nu avea de gând să renunţe. —>(En.) 1:[Marcus] was very upset about what happened lately, but 2:[he] was not going to give up. =>[2] anaphorical relation [1]; - Bridge anaphora type relation (below, the type class-of2) Cândva, 1:[Petronius] fusese guvernator în 2:[Bitinia]... —>(En.) Sometime 1:[Petronius] was governor in 2:[Bithynia]... =>[1] bridge anaphora type relation [2], while governor in is the trigger for this relation. This is a segmentation annotation in XML standoff format: Cândva , 2Class-of - is a bridge anaphora type relation linking a PERSON-CLASS type concept to a PERSON type instance. 220 D. Gîfu, L.I. Cioca Petronius fusese guvernator în Bitinia The anaphorical relations are a widely debated subject [12, 13, 14], proven by numerous specialty papers that present computational instruments for the automated identification of these relations, especially for the pronominal anaphora [15, 22]. This type of relation is much easier to identify in the text, as opposed to a bridge anaphora type semantic relation, because both poles of the the relation, the anaphora and the antecedent refer to the same entity [20]. In order to be able to automatically identify bridge type anaphorical relations, there is necessary a more complex mechanism, that would carry aut in a first phase a preprocessing of the text for its de-ambiguization that consists in segmentation, tokenization, lemmatization, part-of-speech tagging, name entity recognition, and anaphora resolution. 4 BAT - description Bridge Anaphora Tool is a computational instrument implemented in Java language, on the framework Java Server Faces and uses a series of libraries3. BAT is created for the automated recognition of bridge type semantic relations, more precisely of the 10 types of referential relations for which annotation conventions have been determined. The output XML file was used in the process of training and testing. We chose the novel Quo Vadis [24], given that it is a corpus translated into more than 40 de languages, having an impressive number of entities and semantic relations. Using the instrument PALinkA [21] 3see http://primefaces.org/ Detecting Bridge Anaphora 221 the entities and semantic relations were annotated manually. The annotator was already used successfully for annotating the novel Quo Vadis, a work presented in [3]. This web application (fig. 1) executes in a first phase the training process after which the automated recognition can be initiated. Figure 1: The interface of the computational tool In the following, we describe briefly the work methodology. For the trening process, following steps were conceived: - The option "YES" is selected for the relations that will be included in the training; Figure 2: BAT - selecting relations for training - The XML file is loaded from the application, using the button "Train relations", the XML is selected (the manually annotated corpus) after which the button "Process file" is pressed. 222 D. Gîfu, L.I. Cioca Figure 3: BAT - menu for introducing the corpus and for trening relations When the BAT identifies the tag , it carries out four steps: - it saves in the MySQL database the type of relation (it can be one of the 10 "class-of", "has-as-member", etc.) in the table "referential_type"; Figure 4: BAT - the table "referential_type" generated after the training - it saves the type of entity from the identified relation (in the example above we have TYPE="PERSON") in the table "entity_type"; Figure 5: BAT - the table "entity_type" generated after the training - it saves the word/words from the tag specific to the relation in the table "entity_words"; if there are two or more words, they are concatenated with the symbol "|"; Detecting Bridge Anaphora 223 Figure 6: BAT - the table "entity_words" generated after training - it saves the actual structure of a bridge type semantic relation, i.e. the "TYPE" and the words forming it, in XML they being identifiable with the elements "ID", "FROM" and "TO". Figure 7: BAT - the table "referential_entity" generated after training The processing of the XML file in the training phase of the BAT for one or more bridge type semantic relations can take from one minute to several hours, function of the number of relations existing in the annotated corpus. For the "part-of" relation, the training took 2.67 hours, being the most often encountered in the XML file, with a number of 1612 relations. 5 Statistics and interpretations Bridge Anaphora Tool used for training 66% of the corpus of Quo Vadis. Mitkov (1998) suggested, for measuring the performance of a computational instrument aim- ing at identifying anaphorical relations in the text, an equation that defines its success rate. The definition of the success rate is as follows: BAT success rate = 534 correctly identified relations / 1035 total existing relations = 61.5%. So at this moment, the BAT recognized correctly over 61% of the bridge anaphora type semantic relations that should have been identifying in the text, thus fulfilling the set goal. 224 D. Gîfu, L.I. Cioca Table 1: The results of recognizing the bridge type semantic relations with BAT Bridge Bridge Anaphora Bridge Anaphora Bridge Anaphora Anaphora number identified number identified number identified types with BAT in automatically in manually in driven corpus testing corpus testing corpus class-of 189 58 28 has-as-member 347 115 82 has-as-part 109 31 12 has-as-subgroup 176 55 22 has-name 28 9 7 isa 81 25 14 member-of 169 51 38 name-of 158 53 29 part-of 1612 530 249 subgroup-of 337 108 53 Total 3206 1035 534 We think that the variations of the values in the column "number of relations identified automatically by BAT in the testing corpus" are due also to the fact that the instrument searches "mechanically" in the preprocessed text rigid definitions of the relations. For example: entity of the type PERSON-NAME + PERSON =>relation "name-of". Moreover, there exist two relations that have the same definition, namely the relations: has- as-subgroup and subgroup-of being given by the entities of type PERSON-GROUP+PERSON- GROUP, the only difference between them being made by the triggers, during testing. Conclusions and future work This paper presents a methodology for the automated recognizing of 10 bridge anaphora (or bridging) type semantic relations, each having several particularities. The achieved results are promising, offering a base for future researches. We suggest using in parallel of machine learning models (Naïve Bayes and Support Vector Machines). The BAT instrument, developed for the automated recognition of Bridge Anaphora relations, will be improved through the addition of several triggers to the existing list, or in the situation in which there would be available even more data for training. BAT is far from being a perfect instrument, but it can be improved since it showed to be efficient at least for an experimental purpose for various applications in the NLP area. Acknowledgments We thank Alexandru Iliescu for developing the BAT instrument. We are also grateful to all colleagues from NLP-Group@UAIC-FII who developed the tools for natural language processing used in this research. Detecting Bridge Anaphora 225 Bibliography [1] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., Etzioni, O. (2007); Open in- formation extraction from the Web, IJCAI07 Proceedings of the 20th international joint conference on Artifical intelligence, 2670-2676. [2] Buraga, S.C., Cioca, M., Cioca, A. (2007); Grid-based decision support system used in disaster management, Studies in Informatics and Control, 16(3):283-296. [3] Cristea, D., Gîfu, D., Diac, P., Maraunduc, C., Bibiri, A., Scutelnicu, A., Colhon, M. (2014); Quo Vadis: A Corpus of Entities and Relations, Springer International Publishing Switzerland, 2014. [4] Cristea, D., Dima, G. E., Postolache, O. D., Mitkov, R. (2002); Handling complex anaphora resolution cases, Proceedings of the Discourse Anaphora and Anaphor Resolution Collo- quium, Lisbon, 2002, 1-6. [5] Branco, A., McEnery, T., Mitkov, R. (2005); Anaphora Processing: Linguistic, Cognitive and Computational Modelling, John Benjamins, 2005. [6] Cojocaru, L. (2000); The study of the anaphoric relationship on the Romanian corpora, Inference in Computational Semantics, at the Inference in Computational Semantics, ICoS- 2, Schloss Dagstuhl, Germany, July 29-30. [7] Conrath, J, Afantenos, S., Asher, N., Muller, P. (2014); Unsupervised extraction of seman- tic relations using discourse cues, International Conference on Computational Linguistics - COLING (Dublin, Ireland), 2184-2194. [8] Gîfu, D., Iliescu, A. (2014); Analysis of Bridge Anaphora across novel, Procedia- Social and Behavioral Sciences, 180: 1474-1480. [9] Gîfu, D.; Cioca, M. (2013); Online civic identity. Extraction of features, Procedia - Social and Behavioral Sciences, 76:366-371. [10] Gîfu, D.; Cioca, M. (2014) Detecting Emotions in Comments on Forums, International Journal of Computers Communications & Control, 9(6):694-702. [11] Herbert, H. C. (1975); Bridging. Proceedings of the 1975 Workshop on. Theoretical Issues in Natural Language. Processing, TINLAP ’75, 169-174. [12] Hendrickx, I., Clercq, O., Hoste, V. (2011); Analysis and reference resolution of bridge anaphora across different text genres, Anaphora Processing and Applications - 8th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC 2011, Faro, Portugal, LNAI 7099:1- 11. [13] Krahmer, E., Piwek, P. (1997); Varieties of anaphora, Proceedings of the 11th Amsterdam Colloquium, University of Amsterdam, 5-20. [14] Korzen, I., Buch-Kromann, M. (2006); Anaphoric relations ĂŽn the Copenhagen Depen- dency Treebanks, Proceedings of COLING-ACL 06, 3:83-98. [15] Lappin, S., Leass, H. J. (1994); An Algorithm for Pronominal Anaphora Resolution, Com- putational Linguistics, 20(4):535-561. 226 D. Gîfu, L.I. Cioca [16] Liang, T., Wu, D. (2010); Automatic Pronominal Anaphora Resolution ĂŽn English Texts, Computational Linguistics and Chinese Language Processing, 9(1):21-40. [17] Murphy, M. L. (2003); Semantic relations and the lexicon Antonymy, synonymy, and other paradigms, Cambridge University Press, Cambridge, UK, 2003. [18] Năstase, V., Nakov, P., Seaghdha, D. O., Szpakowicz, S. (2013); Semantic relations between nominals, California: Morgan & Claypool Publishers, 2013. [19] Niyu, G., Hale, J., Charniak, E. (1998); A Statistical Approach to Anaphora Resolution, Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 161-170. [20] Nedoluzhko, A., Mirovsky, J., Ocelak, R., Pergler, J. (2009); Extended Coreferential Rela- tions and Bridging Anaphora in the Prague Dependency Treebank, Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2009), Goa, India, 1- 16. [21] Orăşanu, C. (2003); PALinkA: A highly customisable tool for discourse annotation, Pro- ceedings of the 4 th SIGdial Workshop on Discourse and Dialogue, ACL’03, 1-5. [22] Rello, L., Ilisei, I. (2009); A comparative study of spanish zero pronoun distribution, Pro- ceedings of the International Symposium on Data and Sense Mining, Machine Translation and Controlled Languages (ISMTCL), 1-5. [23] Wasow, T. (1967); Anaphoric relations in English, Ph.D. dissertation, Massachusetts Insti- tute of Technology, 1967. [24] Sienkiewicz, H. (1991); Quo Vadis, translated in Romanian by Luca, R. and Linţă, E., Ed. Tezi, Bucharest. [25] Singh, S., Lakhmani, P., Mathur, P., Morwal, S. (2014); Analysis of Anaphora Resolution System for English Language, International Journal on Information Theory, 3(2):5157, DOI : 10.5121/ijit.2014.3205. [26] Z̆itnik, S, S̆ubelj, L, Bajec, M (2014); SkipCor: Skip-Mention Coreference Reso- lution Using Linear-Chain Conditional Random Fields, PLoS ONE, 9(6): e100101. doi:10.1371/journal.pone.0100101.