Paper—Employing Information Extraction for Building Mobile Applications Employing Information Extraction for Building Mobile Applications https://doi.org/10.3991/ijim.v11i2.6569 Daoud M. Daoud Princess Sumaya University for Technology (PSUT), Jordan d.daoud@psut.edu.jo M. Samir Abou El-Seoud The British University in Egypt (BUE), Egypt selseoud@yahoo.com Abstract—We describe a SMS-based information system called CATS, which allows posting and searching through free Arabic text using Information Extraction (IE) technology. We discuss the challenges of applying IE technolo- gy for unedited real Arabic text. In addition, we describe the structure of this system and our approach to produce an open robust system capable of including more sub domains with the minimum effort. Keywords—Information Extraction, Arabic Language Processing, Classified Ads, Attribute Based Searching. 1 Introduction Natural language is considered the simplest technique of human-machine interac- tion. It is suitable for naïve users who know the task domain well. However, building a robust commercial application that employs natural language requires restricted domain where we have control over linguistic and world knowledge. Information Extraction (IE) is a comparatively new technology within the more general field of Natural Language Processing. IE is the process of identifying relevant information where the criteria for relevance are predefined by the user in the form of a template that is to be filled [9]. The current development in the field of IE can be followed in to the Message Understanding Conferences (MUCs). In this competition English has always been the unique target language, with the exception of MUC-6 (MET-1), where Spanish and Chinese were considered as well [10]. IE systems are usually designed for a specific domain, and the types of facts to be extracted are de- fined in advance [11]. Most of the researchers believe that the IE technology is prom- ising and pertinent to a wide range of fields, much of the research have been directed toward news items found in the web. IE systems are a key factor in encouraging NLP researchers to move from small-scale systems and artificial data to large-scale sys- tems operating on human [7]. iJIM ‒ Vol. 11, No. 2, 2017 99 Paper—Employing Information Extraction for Building Mobile Applications In this paper, we will describe our efforts for employing IE technology in a SMS based information system called CATS, which uses Arabic as an interaction language for connecting sellers and buyers through SMS in the classified ads domain. 2 Background The Classified Ads through SMS (CATS) system is a SMS based classified selling and buying platform. Users can send classified ads of the articles/goods they would like to sell, and can search for the goods/articles they desire in the platform economi- cally and while moving. It provides the user with a natural language interface where the user can specify his/her request by sending SMS text in Arabic to an assigned short number. SMS, or Short Message Service, is becoming the most popular channel for ex- changing information. The most important factor that explains this enormous success is its simple, immediate, and confidential way to communicate. Moreover, it has played a major role in narrowing down the digital gap caused by low level of internet penetration in some countries. As an example, SMS enables communication with more than 1.5 million Jordanian subscribers anywhere, anytime, and hence offers unmatched service coverage, beyond even that of the Internet as mobile phone pene- tration is much higher than Internet usage. In the same context, classified ads are an effective way of connecting buyers and sellers. Normally, they are concise with a limited but specialized vocabulary. They are rich in proper nouns, nouns, hint words, and numerical values. The CATS system can handle both unstructured free Arabic SMS texts and struc- tured data stored in a relational database. When the CATS system receives a text it extracts the relevant information and distinguish between the “posting text” and “search text.” Both of them are processed similarly by filling previously designed templates. For a “posting text,” the template is stored in a database, and for a “search text,” the template is used to build a query to retrieve information which resides in the database. The current version of the CATS system is in Arabic and is restricted to classified ads domain. The cars and real estate sub-domains are implemented in this version. However, the system is structured to adapt easily other sub-domains. Moreover, we have plans to produce a multilingual version of the system. 3 Information Extraction and Arabic Not all languages have received equal investment in linguistic resources and tool development [11]. As an example, most of the research published on IE discussed problems related to English, which is a resource-rich language. In the same context, some of the existing English based IE systems performance is comparable to human experts. On the other hand, Natural Language Processing (NLP) in the Arabic lan- guage is still in its initial stage compared to the work in the English language [12]. 100 http://www.i-jim.org Paper—Employing Information Extraction for Building Mobile Applications Regarding Information Extraction, Arabic was not one of the languages considered in the MUCs events. However, Arabic is supported, along with English and Chinese, in the Automatic Content extraction (ACE) program that is operating under the DARPA Program in Translingual Information Detection, Extraction, and Summariza- tion (TIDES). The ACE research objectives are viewed as the detection and character- ization of Entities, Relations, and Events [13]. Annotation of named entity is the core of this project. Full-scale Chinese annotation is well underway, while Arabic annota- tion is just beginning [13]. The common practice of automatically extracting information has been through us- ing of templates, which specify what information should be harnessed. Accordingly, for example, a template for a car classified ads scenario might specify fields such as "Car Make", "Car Model", "Year", "Color", "Mileage", “Price”, “Phone”. The IE engine would then try to fill these fields similar to filling the information in a data- base. This task (referred to as the Template Element task) has been examined in detail in MUCs, aiming at accurately identifying names, dates, and organizations in a text. Despite these developments, building a large-scale information system based on IE that supports Arabic poses new challenges that do not exist in English or other re- source-rich languages: • Classified ads are rich in proper names, which in Arabic are not distinguished by using upper case letters like English. This makes it not nearly as easy to locate them in Arabic text as in English text [16]. • The variations of spelling of the Arabic text caused by its complex orthography adds more challenges to the processing of the Arabic text. As an example people tend to interchange between the Alef “!” , “!” and “!” in their writing, also be- tween the Ha’ “!” and Ta’ “ ’! ”, and between Ya’ “!” and Alef-Maqsoura “!”. In Arabic, spaces are normally used to separate words. Most of Arabic letters are con- nected from both sides (cursive writing system), causing them to have different shapes depending on their positions (first, middle, or last). But some letters “!”, “!” ,”!” , ”!”, ”!” and ”!” can be connected only from the right side making their shapes unchanging at any position of the word. After any of these letters, people tend to insert a space or simply drop it (e.g., “!"# $#%” or “!"#$#%” {Abu-Baker}). • The inconsistency of the Arabic spelling of transliterated proper nouns is a major challenge. This appears frequently in the classified ads text where many of the proper names (car make and model as an example) are transliterated from other languages. This phenomenon is noticeable within unedited and spontaneous classi- fied ads, reflecting the cultural and educational background of the text writer. As an example the car-make CITROEN could have different spelling in Ara- bic:{SATARWEN}“!"#$%”,{SA:TERWEN} “!"#$%& “ , {SATERWE:N} “!"#$%&”, {SA:TERWE:N} ” !"#$%&' ”, {SE:TERWE:N} “!"#$%&'”. • Arabic uses a diverse system of prefixes, suffixes, and pronouns that are attached to the words, creating composite forms that further complicate text manipulation. For instance, articles such as "an" and "the" are not separate words as they are in languages like English but are actually appended to the words to which they refer (for example, "their two cars" is written as a single token, !"#$%&#').This can cause iJIM ‒ Vol. 11, No. 2, 2017 101 Paper—Employing Information Extraction for Building Mobile Applications ambiguity of forms (especially if short vowels are omitted). As an example, in Ar- abic,” !"#$% ” has two interpretations, which are (you will see) and (CITROEN a name of a car brand). Therefore, Arabic’s rich morphology and complex orthography present challenges for analysis, which requires significant pre-processing before it can be accurately indexed, searched, or put through any other text manipulation. 4 The Cats System 4.1 Overview The CATS system is an information system that uses IE technology. The goal of this application is to enable SMS users to post or search for classified ads in Arabic. It has two main functionalities: the submission for selling items and the answering of users’ queries through natural language interaction. The system receives an entry in full text without a pre-specified layout, recognizes the various relevant entries, and produces a logical representation for further processing. We have two types of users’ requests: • Posting type in which the user is a potential seller. • Searching type in which the user is a potential buyer. Suppose a user wants to sell his car, he can simply type a SMS message and send it to a specified short number (Figure 1). Similarly, the user can express his search re- quest and send it to the same number. As show in Figure 2 the system is capable of performing exact (e.g., Japanese) and specified ranges (e.g., the price should be less than 2500 JD) attribute-based search. If the system finds records that match his re- quest, the user will get a list of cars with contacts numbers (Figure 2). Otherwise, the user will get a notifying message that no match found at this time with the possibility to get results later on. In addition, The CATS system implements a leveled based strategy for searching. As an example, consider this query “I am looking for Clio 1999.” Suppose the system fails to find any match, it will look for all cars manufactured by Renault. In the same way, if it fails to find any match, it will look for any French car. In addition, The CATS system implements a leveled based strategy for searching. As an example, consider this query “I am looking for Clio 1999.” Suppose the system fails to find any match, it will look for all cars manufactured by Renault. In the same way, if it fails to find any match, it will look for any French car. Finally, the current version of the CATS system includes the cars and real estate sub-domains. However, we took care in the design to make the system customizable to include other sub-domains with the least possible effort. 102 http://www.i-jim.org Paper—Employing Information Extraction for Building Mobile Applications Fig. 1. Sending a selling classified ad “For sale Nissan Laurel 1984 in good condition” Fig. 2. Making a query and getting results through CATS system. “A Japanese car is wanted, full checkup, full option, above 1985; the price should be less than 2500 JD.” iJIM ‒ Vol. 11, No. 2, 2017 103 Paper—Employing Information Extraction for Building Mobile Applications 4.2 Template Design Developing a high-quality system requires a systematic approach. We started the development by collecting a corpus for each particular sub-domain. This corpus was collected from the web sites that provide unedited Arabic classified ads services. By having access to this corpus, we have been able to study the used patterns and even to anticipate patterns that were not seen in the corpus. More to the point, the corpus enabled us to depict the lexicon, styles and types of queries that interest users. We also made decisions on what is relevant and what is not to a particular domain. Then we began putting our template, which reflects our conceptual view of the relevant knowledge embedded in the free text of the classified ads. We have adopted the object-attributes to model the templates. This representation acts as medium between free text and structured database. They abstract our concep- tual view for a particular domain. For example a flat has a location, consists of parts, has an area, has a number of bedrooms, has a floor number, has a price and either it is “for sale” or it is “wanted.” Figure 3 shows that there is a main object, which has a value “flat” along with its attributes. Flat sale Amman kitchen2003 ad s t yp e location consist area be d ro om s 40000 Price 2 floor Fig. 3. An example of object-attributes model 104 http://www.i-jim.org Paper—Employing Information Extraction for Building Mobile Applications In real estate sub-domain, the main objects are “flat”, “land”, “shop”, “building”, etc. We also defined a set of specifiers for each sub-domain to give more details or to put some restriction on the attribute values. Specifically, they are used in normaliza- tion of numerical attributes values and in capturing mentioned ranges used to perform attribute based searching. As an example, in the following expression: Price (car, 5000@less) “@less” means that the price of the car should be less than 5000. In figure 4, “@meter” means that the unit of measuring is meter. Similarly, the “@build” used to indicate building area and “@space” indicates surrounding area. When a numerical value is attached to “@thousand”, this number should be multi- plied by 1000 for further processing. “For sale a Villa "American Style" independent, never occupied, in Der Ghubar, built area is 560 m^2, surrounding area is 500 m^2, 5 bedrooms, the price is 170 thousand Dinar” adstype(villa, sale) feature(villa, american style) feature(villa, new) location(villa, Der Gubar(area