Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2914-2918 2914 www.etasr.com Asma and Amiri: Using Association Rules to Enrich Arabic Ontology Using Association Rules to Enrich Arabic Ontology Asma Ksiksi Technologies of Information Laboratory (LR-SITI) National Engineering School of Tunis (ENIT) Tunis, Tunisia asma.ksiksi@gmail.com Hamid Amiri Technologies of Information Laboratory (LR-SITI) National Engineering School of Tunis (ENIT) Tunis, Tunisia hamidlamiri@gmail.com Abstract—In this article, we propose the use of a minimal generic base of associative rules between term association rules, to automatically enrich an existing domain ontology. Initially, non- redundant association rules between terms are extracted from an Arabic corpus. Then, the matching of the candidate terms is done through the matching between the concepts of the initial ontology and the premises of the association rules, with three distance measures that we define. Keywords-ontology; automatic enrichment; association rules I. INTRODUCTION Ontology is a tool for representing knowledge and reasoning that serves the organization of a set of concepts in a specific field, as well as the relations between these concepts [1-3]. Ontologies are regularly subject to updates and changes. Performing these updates manually is an expensive and time- consuming task as it mobilizes experts in the field to identify and classify new vocabulary items in the ontology. To accelerate this process of evolution and adaptation and to take away any form of subjectivity, recent research has focused on the implementation of semi-automatic and automatic ontology enrichment techniques. The majority of approaches, often based on statistical or linguistic tools, focus on adding new concepts and/or relationships between them. The ontology enrichment process can be divided into two stages: the search for new concepts and relations and the placement of these concepts and relationships within the ontology [3]. The general process is depicted in Figure 1. Several works have focused on this process of enrichment of ontologies, addressing one or more of its stages: • Extraction of representative terms in a specialized field. • Identification of lexical relations between terms. • Placement of new terms in an existing ontology In these works, the term ontology takes several meanings like thesaurus, taxonomy or more generally controlled vocabulary. The work dealing with the extraction of candidate terms in the ontology enrichment process is based on statistical and syntactic methods. Statistical methods select terms according to their distribution in the corpus [1-3], as well as other measures such as mutual information, “the probability of the appearance of the word A knowing that the word B has appeared”, or else measures calculating the probability of occurrences of a set of terms [4-6]. These different propositions make it possible to identify new ontology elements, but do not allow their placing in the ontology, without human intervention. Syntactic methods aim at determining the grammatical function of a word or a group of words within a sentence. They are based on the hypothesis that grammatical dependencies reflect semantic dependencies. These techniques lead to the proposition of new concepts, linked by relations that are not yet semantically identified. Regarding the identification of concepts and relationships and their placement in the ontology, the extraction of ARs is one of the major techniques proposed by the data mining community. Many other works propose the use of frequent correlations which can exist between the terms of a corpus. These approaches consist most often of extracting ARs between candidate terms, previously identified by statistical or syntactic tools [7]. At the end of the process, authors get a set of ARs, describing the existence of a relationship between two concepts [8-10]. Fig. 1. General process of ontology enrichment. In this paper, we propose a methodology for building a conceptual network formed by the combination of two types of knowledge, namely, knowledge present in the initial ontological structure specific to a domain and represented by semantic links, and knowledge derived from the minimal generic base of associative rules (ARs) between terms, essentially representing correlations that are appreciated by statistical measures. II. EXISTING TECHNIQUES FOR ONTOLOGY ENRICHMENT A. Existing Approaches to Discover Candidate Concepts We distinguish two types of methods for the discovery of candidate concepts: Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2914-2918 2915 www.etasr.com Asma and Amiri: Using Association Rules to Enrich Arabic Ontology • Statistical methods: they select the terms according to their distribution in the corpus [1-3], as well as other more complex measures such as mutual information, tf-idf, etc., or the use of statistical distributions of terms [4]. These different propositions make possible to identify new ontology elements, but do not make possible placing them in the ontology, without tedious human intervention [12]. • Syntactic methods: they aim to determine the grammatical function of a word or group of words within a sentence. They are based on the hypothesis that grammatical dependencies reflect semantic dependencies [13]. They define in a sentence, the verb (V) as being the relation which links the subject (S) to the complement (C). They thus have the disadvantage of identifying only the relationships labeled by the verbs. Some approaches also use syntactic patterns [12]. The extracted terms illustrate the new candidate concepts for enrichment, but also the existence of relations between them. However, these relationships are not labeled semantically. Moreover, no measure evaluating semantically new added relations is calculated. B. Existing Approaches to the Concept Placement in Ontology After the discovery of the candidate terms, it is essential to detect the relations between these new terms and those which link them to the initial ontology. In [2], authors propose a statistical approach based on the frequent co-occurrence of candidate terms with terms of the initial ontology. The major drawback of this work lies in the fact that they do not allow the precise addition of new concepts and relations in the ontological structure [14]. Other approaches in the literature suggest using search techniques data [10, 11, 13]. The work in [4, 15], is based on a classification method in order to bring together the candidate terms contained in the texts of the concepts present in the ontology. The principle is similar to that explained in the approaches of [1, 15], which group together terms by a clustering method according to their number of occurrences within the corpus. However, these methods do not detect the relations between the candidate terms, i.e., these new terms can therefore be added only by human intervention. In addition, several studies propose the use of frequent correlations that exist between the terms of a corpus. These approaches consist of extracting rules association [7] between candidate terms [8-10]. At the end of the search process, a set of ARs between terms is generated. Each rule expresses the existence of a relationship between two concepts of the domain. This process of enrichment remains semiautomatic because on the one hand the number of derived ARs is very important and on the other hand, a human intervention is necessary to semantically define the relations discovered and to name them. III. ASSOCIATION RULES Association rule mining is a famous knowledge discovery technique for finding associations between items from a transaction database. Its definition varies according to the three main currents initiated by the following: author in [16] defines rules of statistical implication to help educationalists find relationships between acquiring basic notions in class, authors in [17] are more interested in orderly representation of concepts with informative implications, authors in [18] favored optimized extraction of ARs in large databases. Subsequently, these forms have known extensions in several directions. The binary properties are no longer required, we can now make ARs with digital properties [19, 20]. To avoid the vast increase of rule extraction time, more efficient algorithms have been proposed [21]. The semantics of the rules have been refined through many quality indices [22], which helps the user to choose the most appropriate rules for his needs. Navigation and queries by using an appropriate language have been developed [23] to facilitate the exploration of this set of rules. ARs present conditional relationships between the attributes of a database. They represent an implication of the form A→B where A and B are an itemsets. The set of items A is called antecedent and B consequent of the rule which provides information about the existing relations between A and B. It expresses how objects or items are related to each other, and how they can be grouped together. The first step of extraction in the association rules mining is finding out the frequent itemset which is called candidate (te). This transaction can be measured by two statistic measurements called support and confidence. The support (Sup(A→B)) is defined as the relative frequency of transactions in the data set D that contains the itemsets A and B. D t}Bet t A:D{t B) Sup(A = B)Sup(A ⊆⊆∈ =∪→ (1) The confidence (Conf(A→B)) of a rule measures the reliability of the inference given by rules. Sup(A) B)Sup(A =B)Conf(A ∪ → (2) Then, the important association rules are filtered from the candidate itemsets. A rule r is available only if Sup(A→B)>minsup and Conf(A→B)> minconf where minsup represents the threshold of support and minconf represents the threshold of confidence. These two values are specified by the user. A. Process for Association Rules Extraction The process of extracting association rules consists of several phases ranging from data selection and preparation to result interpretation (Figure 2). Several works have focused on this process of enrichment of ontologies, addressing one or more of its stages: • Data selection and preparation (cleaning): In this phase, the database data used for the extraction of the association rules are selected and the transformation of these data into an extraction context occurs. This phase is necessary to be able to apply rule extraction algorithms to different kinds of data from different sources, to concentrate the search on the useful data and to minimize extraction time [24]. To have significant rules the extraction of morphological analysis of Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2914-2918 2916 www.etasr.com Asma and Amiri: Using Association Rules to Enrich Arabic Ontology each word must follow the order described in [25] and shown in Table I • Generation of association rules: is carried out from the frequent itemsets generated previously. In general, the generation of association rules is done directly, without access to the extraction context, and the cost of this phase in execution time is therefore low compared to the cost of extracting frequent itemsets. • Visualization and interpretation: This phase consists in the visualization of the association rules extracted from the context and their interpretation. Thus the domain expert can judge their relevance and usefulness. TABLE I. REPRESENTATIVE SCHEMA OF AN ARABIC WORD STRUCTURE Enclitic Suffix Schematic body Prefix Proclitic Base post Radical Near_base Fig. 2. Process of Association Rules. IV. PROPOSED APPROACH This stage consists in bringing closer to our initial ontology, that will be noted as O, the terms which appear in the premises of the candidates rules of the base of the sequential rules. These terms are identified as candidate concepts for enrichment. • Definition 1: An ontology is a quadruplet O=(CD, ≤C, R, ≤R) where CD is the set of concepts of the domain, ≤C is the partial order defined on CD, R is the set of relations defined on CD×CD and ≤R is the partial order relation defined on R. We consider that a formal extraction context is a triplet K=(D, T, R) where D represents a finite set of documents from the corpus C, T is a finite set of terms and R a binary relation. Each pair (d,t)∈R means that the document d∈D contains the term t∈T. • Definition 2: A termset is a non-empty set of terms denoted by (t1, t2 ... tk). An associative rule R is valued by two statistical metrics, namely support and trust [18]. The support of the associative rule R: Ti→Tj, denoted by Supp (R), expresses the frequency with which the two termsets Ti and Tj co-occur together in corpus C. The confidence of R, denoted by Conf(R), expresses the conditional probability for a document to contain termset Tj, knowing that it contains the termset Ti. An associative rule is valid if its confidence is greater than or equal to the minimum confidence threshold noted by minconf. Fig. 3. Process of enrichment of Ontology by using Association Rules. A. Extraction of the Ontology and Creation of the Generic Base We use the GEN-MGB algorithm [26] for the extraction of the generic base of RA no redundant MGB. This base is characterized by its significant compactness, i.e., it contains a minimal core of ARs, from which all the redundant and valid rules can be deduced by means of a complete and valid axiomatic system [26, 27]. By considering the context of text extraction K, we adapt the definition of the MGB base given in [26] to the problem of Ontology enrichment. We remind that non-redundant ARs have one only term of the domain in the premise [28]. k k 1 k M G B = { R : t T | C o n f(R ) m in c o n f T = { t ,..., t } T } M G B = → ≥ ∧ ⊆ (3) We use then a semi-automatic tool such as Protege 2000 [29] for the ontology O construction from CO corpus. It is validated downstream by a domain expert. The evaluation of the semantic link between O concepts are computed from the proposed similarity measure in [30] that takes into account both the depth of concepts in the hierarchy of concepts and the structure of the latter. Thus, the similarity between two concepts C1 and C2 of the ontology O is calculated as [30]: 1 2 1 2 (2 × depth(c)) SimWu(C ,C )= (depth(c )+ depth(c ) ) (4) where depth (ci) corresponds to the depth level of the concept ci and c represents the most specific concept that generalizes c1 and c2 in O. B. Adopted Approach for the Ontology Enrichment The enrichment process we propose is iterative and includes the following steps: 1) Calculation of the candidate concepts for the enrichment We compute for each concept ci of ontology O the set of the candidate concepts to be connected to ci. This set includes the terms figuring in the conclusions of the valid associative rules whose premise is ci as well as those of the redundant rules [31]. Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2914-2918 2917 www.etasr.com Asma and Amiri: Using Association Rules to Enrich Arabic Ontology According to the example shown in Figure 4, the candidate concepts for enrichment related to the concept c1 are {c10, c12, c5, c15}. Fig. 4. Example of calculating candidate concepts. 2) Placement of the new concepts This step consists in placing the candidate concepts while preserving the coherence of the concepts and pre-established relations in the initial ontology. This makes possible not to add relational redundancies in the case of a concept being candidate to be related to several concepts of ontology O [32]. Figure 5 shows the addition of the new concepts c10 and c11 and the displacement of c15 because Conf(c1⇒c15)>Conf(c7⇒c15). 3) Calculating the neighborhood of ci and distance measurements We define the notion of neighborhood of a concept of the ontology O as: • Definition 3: The neighborhood of a concept represents the set of corners connected to it in the ontology, by one or more valid association rules [32]. The relations between ci and its neighbors, are evaluated on the basis of a statistical metric that we call measure of distance between ci and its neighborhood, and denoted by DistO MGB. It is computed according to the measure of confidence of associative intervening during the ontology enrichment and the measures of similarities calculated between the concepts of the initial ontological structure [33]. Fig. 5. Example of placement of candidate concepts. The measure of distance that we define is calculated according to three possible cases [34]: • Case 1: If the two concepts ci and cj come from the base C then DistO C(ci, cj)=Conf(R: ci⇒cj). • Case 2: If the two concepts ci and cj belong initially to ontology O then DistO C(ci, cj)=SimWu(ci, cj). The similarity between the two concepts c1 and c2 of ontology O is calculated as in (3). • Case 3: If ci is a concept added to the ontology O and it is related to the concept ck of the initial ontology O in a way that DistO MGB(ck, ci)=Conf(R:ck⇒ci)=β then any concept cx of the ontology O in relation to ck such that SimWu(ck, cx)=α, is also in relation with ci. In this case, the distance measure is mixed, i.e., DistO MGB (ci, cx)=α×β. The three cases are illustrated in Figure 3. Thanks to this enrichment technique we are able to add new concepts and relationships. Fig. 6. Example of a figure caption. V. CONCLUSION Various ontology enrichment techniques have been proposed in the literature. Their limitations come from the fact that they do not allow the entire enrichment process without the intervention of the domain expert. In this article, we presented an automatic ontology enrichment process with a generic base of associative rules. The originality of our approach is that it exploits the maximum of concepts for enrichment without resorting to a priori knowledge. Its advantage is that it allows the learning of the distance represented by any relation of the enriched ontology. REFERENCES [1] E. Agirre, O. Ansa, E. Hovy , D. Martinez, “Enriching very large ontologies using the WWW”, ECAI 2000 Workshop on Ontology Learning, Berlin, Germany, August 2000 [2] A. Faatz, R. Steinmetz , “Ontology enrichment with texts from the WWW”, 2nd Semantic Web Mining Workshop at ECMLI/PKDD, WS’02, Helsinki, Finland, pp. 20-33, 2002 [3] V. Parekh, J. Gwo, T. Finin, “Mining Domain Specific Texts and Glossaries to Evaluate and Enrich Domain Ontologies”, Proceedings of the International Conference of Information and Knowledge Engineering, Las Vegas, USA, June 21, 2004 [4] K. Neshatian, M. R. Hejazi, “Text categorization and classification in terms of multi-attribute concepts for enriching existing ontologies”, Proceedings of the 2nd Workshop on Information Technology and its Disciplines, pp. 43-48, 2004 [5] P. Velardi, M. Missikoff, R. Basili, “Identification of relevant terms to support the construction of Domain Ontologies”, ACL-EACL Workshop on Human Language Technologies, Toulouse, France, July 2001 Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2914-2918 2918 www.etasr.com Asma and Amiri: Using Association Rules to Enrich Arabic Ontology [6] A. Xu, S.-K. Park, S. D'Mello, E. Kim, Q. Wang, C. Pikielny, “Novel genes expressed in subsets of chemosensory sensilla on the front legs of male Drosophila melanogaster”, Cell and Tissue Research, Vol. 307, No. 3, pp. 381-392, 2002 [7] R. Srikant, R. Agrawal, “Mining generalized association rules”, Future Generation Computer Systems, Vol. 13, No. 23, pp. 161-180, 1997 [8] R. Bendaoud, “Construction et enrichissement d’une ontologie à partir d’un corpus de textes”, Actes des Rencontres des Jeunes Chercheurs en Recherche d’Information (RJCRI’06), Lyon, France, pp. 353-358, March, 2006 (in French) [9] A. Maedche, S. Staab, “Mining ontologies from text”, Lecture Notes in Computer Science, Vol. 1937, pp. 189-202, Springer-Verlag, 2000 [10] G. Stumme, A. Hotho, B. Berendt, “Semantic web mining : State of the art and future directions”, Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 4, No. 2, pp. 124-143, 2006 [11] L. Jorio, L. Abrouk, C. Fiot, D. Hérin, M. Teisseire, “Enrichissement d’ontologie basé sur les motifs séquentiels”, Actes de la Plateforme AFIA 2007, Atelier Ontologies et gestion de l’hétérogénéité sémantique, 2007(in French) [12] A. Maedche, V. Pekar, S. Staab, “Ontology Learning Part One - On Discovering Taxonomic Relations from the Web”, in: Web Intelligence, pp. 301-319, Springer Verlag, 2002 [13] N. Hernandez, J. Mothe, C. Chrisment, D. Egret, “Modeling context through domain ontologies”, Information Retrieval, Vol. 10, No. 2, pp. 143-172, 2007 [14] P. Cimiano, A. Hotho, G. Stumme, J. Tane, “Conceptual Knowledge Processing with Formal Concept Analysis and Ontologies”, Lecture Notes in Computer Science, Vol. 2961, pp. 189-207, Springer-Verlag, 2004 [15] E. Han, G. Karypis, “Centroid based document classification : Analysis and experimental results”, Lecture Notes in Computer Science, Vol. 1910, pp. 424-431 Springer-Verlag, 2000 [16] R. Gras, Contribution à l'étude expérimentale et à l'analyse de certaines acquisitions cognitives et de certains objectifs didactiques en mathématiques, Thèse d'Etat, Universit e de Rennes, 1979, (in French) [17] J. L. Guigues, V. Duquenne, “Familles minimales d'implications informatives résultant d'un tableau de données binaires”, Mathématiques et Sciences Humaines, Vol. 95, pp. 5-18, 1986 [18] R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules between sets of items in large Databases”, Proceedings of ACMSIGMOD Conference,Washington, USA, pp. 207-216,May 25-28, 1993 [19] S. Guillaume, Traitement des données volumineuses, mesures et algorithmes d'extraction de RA et règles ordinales, PhD Thesis, Nantes, 2000 [20] M. Cadot, “RA et codage flou des données”, 11èmes Rencontres de la Société Francophone de Classification (SFC'04), Bordeaux, France, pp. 130-133, 2004, (in French) [21] N. Pasquier, “Data Mining : Algorithmes d'Extraction et de Réduction des RA dans les Bases de Données”, PhD Thesis, Université Blaise Pascal-Clermont-Ferrand II, 2000 (in French) [22] F. Guillet. “Mesure de qualité des connaissances en ECD”, Cours donné lors des journées de la conférence EGC 2004, Clermont-ferrand, January 2004, (in French) [23] M. Botta, J. F. Boulicaut, C. Masson, R. Meo, “A Comparison between Query Languages for the Extraction of Association Rules”, Lecture Notes in Computer Science, Vol. 2454, pp. 1-10, Springer-Verlag, 2002 [24] M. Jarrar, “Building a Formal Arabic Ontology”, Experts Meeting on Arabic Ontologies and Semantic Networks, Alecso, Arab League: Tunis, pp. 26-28, July 26-28, 2011 [25] F. Z. Belkredim, F. Meziane, “DEAR-ONTO: A Derivational Arabic Ontology Based on Verbs”, International Journal of Computer Processing of Languages, Vol. 21, No. 03, pp. 279-291, 2008 [26] C. C. Latiri, L. B. Ghezaïel, L. B. Ahmed, T. Tunsisie “Fast-MGB: Nouvelle base générique minimale de règles associatives”, EGC’2006, Lille, France, pp. 217-222, January, 2006, (in French) [27] C. L. Cherif, W. Bellagha, S. Ben Yahia, G. Guesmi, “VIE-MGB : A Visual Interactive Exploration of Minimal Generic Basis of Association Rules”, 3rd International Conference on Concept Lattices and their Applications (CLA’05), Olomouc, Czech Republic, pp. 179-196, September, 2005 [28] C. Fankam, OntoDB2: un système flexible et efficient de Base de Données à Base Ontologique pour le Web sémantique et les données techniques, PhD Thesis, ISAE-ENSMA Ecole Nationale Supérieure de Mécanique et d’Aérotechique-Poitiers, 2009, (in French) [29] N. F. Noy, R. W. Ferguson, M. A. Musen, “The Knowledge Model of Protégé-2000 : Combining Interoperability and Flexibility”, Lecture Notes in Computer Science, Vol. 1937, pp. 17-32, Springer-Verlag, 2000 [30] Z. Wu, M. Palmer, “Verb semantics and lexical selection”, 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, USA, pp. 133-138, June 27-30, 1994 [31] T. R. Gruber, “The Role of Common Ontology in Achieving Sharable, Reusable Knowledge Bases”, Proceedings of the Second International Conference, Cambridge, pp. 601-602, Morgan Kaufmann, 1991 [32] D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach, “Scalable Semantic Web DataManagement Using Vertical Partitioning”, 33rd International Conference on Very Large Data Bases, Vienna, Austria, pp. 411-422, September 23-27, 2007 [33] P. Gamallo, M. Gonzalez, A. Agustini, G. Lopes, V. S. de Lima, “Mapping Syntactic Dependencies onto Semantic Relations”, Proceedings of the ECAI 2002 Workshop on Machine Learning and Natural Language Processing for Ontology Engineering (OLT’2002), Lyon, France, pp. 15-22, 2002 [34] F. Cerbah, ”Learning highly structured semantic repositories from relational databases”, Lecture Notes in Computer Science, Vol. 5021, pp. 777-781, Springer-Verlag, 2008