Microsoft Word - brain_vol8_issue3_v6_ok1.docx 77 An Intelligent Method to Process Romanian Language Internet Reviews Versavia-Maria Ancusa Computer and Information Technology Department, Politehnica University of Timisoara, Piața Victoriei 2, Timișoara 300006, Romania Phone: 0256 403 000 versavia.ancusa@cs.upt.ro Olimpia Ban Economics Department, Faculty of Economic Sciences, University of Oradea, Str. Universitatii, 1, Oradea, Bihor, 0, Oradea, Romania Phone: 0259 408 105 oban@uoradea.ro Marian Cornea Computer and Information Technology Department, Politehnica University of Timisoara, Piața Victoriei 2, Timișoara 300006, Romania Phone: 0256 403 000 marian.cornea@student.upt.ro Abstract Internet reviews are a valuable information mine, however most research is oriented towards English based ones. The Romanian language reviews exhibit specific grammar rules, dialect challenges and polymorphism, which need customized methods to be dealt with. This paper offers a method for aggregating heterogeneous Romanian language reviews into a homogenous corpus, fit for further analyse. Keywords: electronic Word of Mouth (eWOM), mesoscopic approach, natural language processing, complex networks, amfostacolo.ro 1. Introduction Internet reviews can be seen as a modern communication form, adapted to the highly interconnected digital world of today, distributed in a relationship form of 1 to many, encouraged by social conscience, and generally not (direct) monetary gain, therefore representing an electronic word of mouth (eWOM) paradigm (Yoo, Gretzel, & Zach, 2011). The main difference between this paradigm and the traditional word of mouth paradigm consists in the impersonal communication format (Chung & Buhalis, 2008), with the person writing the review and the one reading it belonging to different social groups, having different status, education and most importantly, having never met one another. In fact, some researchers (Ayeh, Au, & Law, 2013) consider that the differences make the message more convincing as a whole, while niche messages create a more personalized image of the product / service for their targets. This reflects the social network evolution (Barabasi, 2012), in which clusters naturally evolve, leading the individual consumers to trust reviews from persons that resemble their own characteristics, even though they do not actually know them (Litvin, Goldsmith, & Pan, 2008), (Park & Allen, 2013). Even more surprising is the way people evaluate the credibility of the eWOM, not through a rigorous, conscious decision process, but mostly based on peripheral cues (Metzger, Flanagin, & Zwarun, 2003) and subsequent personal details clustering (Park & Allen, 2013). The reliance on this globalized feed-back method has reached the level in which on-line reviews are more credible than old-fashioned information sources (Dickinger, 2011). BRAIN: Broad Research in Artificial Intelligence and Neuroscience Volume 8, Issue 3, September 2017, ISSN 2067-3957 (online), ISSN 2068-0473 (print) 78 The sheer magnitude of the eWOM is reflected, from a Computer Science perspective, in the presence of massive user-generated data, that can be aggregated and analysed using data mining techniques. eWOM can be analysed at an individual (influencer) or market level (Cheung & Thadani, 2010) or from an input-process-output perspective (Chan & Ngai, 2011), but in any case, eWOM is represented as big data, therefore subjective to all the specific analytics stages and challenges. According to (Fisher, DeLine, Czerwinski, & Drucke, 2012), big data analytics has the following stages: the data acquiring phase, the modelling phase (which can be further split into architecture choosing and data moulding onto that architecture), the coding/debugging phase and, finally, the reflection phase. In itself, data analytics, either big or small, is an “inherently exploratory” (Fisher, DeLine, Czerwinski, & Drucke, 2012) pursuit, which leads to rapid analytics process successions, which, in turn can introduce errors, that propagate swiftly, compromising the entire process. However, most errors in working with big data are introduced during the data sampling, as well as data cleaning stages (Boyd & Crawford, 2011), which appear at the beginning of the analysis process. Practical surveys point that data cleaning tends to happen after a first iteration of the analytics process, when “a model looked odd”. (Fisher, DeLine, Czerwinski, & Drucke, 2012) While data sampling and cleaning have strong roots in the Statistics field (Kachigan, 1986), new challenges arise due to the vastly heterogeneous nature of the data. Classical statistical algorithms are not fit for the current context in which data is evolving, taking new forms, with each form partially complementing the other, enriching the larger picture. Novel methods and methodologies need to be created in order to operate in such shifting circumstances. The analytics process in itself can be applied either continuously, concurrent, yet succeeding the data collection process, in which case the data can be expanded/refined on the fly, or the analytics process is applied strictly after the data collection, in which case no further fine-tuning is possible. Each analytics type (static / dynamic) has different data cleaning issues: while the static options allows for more complex, time-consuming, refined algorithms to be applied, no further confirmation from the data sources of the correctness in the cleaning process can lead to unnecessary data censoring. On the other hand, dynamic cleaning has obvious time-constraints, but can be more easily improved over time. Data cleaning can be performed through a direct manipulation interface or through a script (Fisher, DeLine, Czerwinski, & Drucke, 2012). While either solution will get results, the traceability quality implied by the script makes this a preferred option in practice. In most cases, the results of the analytics process must often be conveyed to an audience that has little to no expertise in the analytics and/or statistics (Fisher, DeLine, Czerwinski, & Drucke, 2012) so an exchanging of ideas in a simple, common and clear manner is essential. An easy way to do this is to represent the results through pictures, as visual representations tend to be the most common ways of internal information processing (Goodale, 2014), (Healey & Enns, 1999). Conversely, this raises another problem i.e., the limitation of information visualization at a few million data points per screen (Shneiderman, 2008). On the other hand, data in itself is not enough, and not every part of it is equally important, interactions are what brings data to life, which indicate a new opportunity for visualization, through complex networks (Barabasi, 2012). The purpose of this paper is to present a hybrid methodology for building a eWOM data cleaning algorithm, and apply it for the Romanian language. The algorithm will be tested on the review database from amfostacolo.ro with the stated purpose of building a complex network text representation of that database. 2. Text analysis Many attempts have been made to model various human languages, leading to the emergence of three perspectives: the microscopic “collection of utterances” view, the macroscopic “set of grammar rules and a vocabulary” view and the mesoscopic hybrid “basic units and emergent V.-M. Ancusa, O. Ban, M. Cornea - An Intelligent Method to Process Romanian Language Internet Reviews 79 interactions” view (Choudhury & Mukherjee, 2008) Complex networks, due to their inherent nature of interaction portrayal, are especially fit to represent the mesoscopic view (Mihalcea & Radev, 2011). Referring strictly to linguistic networks, the two main uses for this particular representation are: (1) the discovery of languages’ inherent properties and (2) any type of knowledge manipulation (machine translation, information retrieval, summarization systems, natural language patterns, etc.). While the networks presented in Table 1 are used to depict language in order of ascertaining their properties, in different contexts, using different corpuses, they can constitute the basis of exploratory systems, such as natural language patterns – machine translation, information retrieval and summarization systems. (Choudhury & Mukherjee, 2008) Table 1: Natural language processing networks Network Type Graph Type Nodes Edges present based on: Use Lexical network Undirected Words phonetic and semantic similarity exponential degree distribution, high clustering coefficient Collocation network Undirected Words co-occurrences in similar contexts power-law distribution the presence of a core-lexicon Syntactic dependency network Directed Words / parts-of- speech grammatical (logical) relation Disassortative mixing, hierarchical organization, Small world structure Phonological networks Undirected Sub-lexical units (ex: phonemes, syllables) Co-occurrences in similar contexts a power law with an exponential cut-off towards the tail distribution high clustering coefficient strong patterns present In natural language processing, it can be argued that the context is crucial in determining the underlying value of one word (Turney & Pantel, 2010). The distributional hypothesis states that the context is what defines the semantic of one word, therefore, similar contexts containing different words will give associative qualities to the differences. Context includes the language in which texts are written and need to be analysed. If most of the scientific literature focuses on English language, there are some attempts to customize it for other languages like Chinese, German, Spanish, Turkish, Arabic, Japanese, Polish and even Romanian (Li, Ye, Zhang, & Wang, 2011), (Feraru, Teodorescu, & Zbancioc, 2010). However, most of these works focus on an academic corpus, which provides limited relevance in dealing with eWOM, since the patterns of the written and casually written/spoke language vary greatly. The research that deals with eWOM is not customized for Romanian, therefore presenting the opportunity to craft such an algorithm. 3. Experimental algorithm development In order to create the algorithm, we decided on a mixed static – dynamic approach, based on a three-phase process. The process starts with the collection and data aggregation, continues to the second stage, the unification or the basic processing, with the last stage focusing on the analysis BRAIN: Broad Research in Artificial Intelligence and Neuroscience Volume 8, Issue 3, September 2017, ISSN 2067-3957 (online), ISSN 2068-0473 (print) 80 (Figure 1). The analysis part is presented in a dedicated paper (Ban, Ancusa, Bogdan, & Tara, 2015) since it involves industry-specific aspects and further handling specific for the data’s origin domain. This paper focuses on the data aggregation and cleaning part. Figure 1. The eWOM aggregation, processing and analysis The corpus used was an online review database, consisting of 15200 reviews, written by 8912 different authors with Gaussian age distribution (Figure 2) as to represent a significant viewpoint on the modern Romanian language. The database structure included columns for unique identification of the author, trip geographical details (place, quality, etc.), a written review describing the experience, together with a trip satisfaction numerical score. The reviews written quality varied greatly from very correct to very colloquial, almost to the point of losing readability. A first aggregation decision was to include even very shoddy and philistine expressions as part of the corpus, together with the cultured expression of the language, because of two factors: (1) the evolutionary aspect of the language manifesting itself and (2) further sentiment analysis must include all forms and nuances because they are relevant in that context. Figure 2 Age distribution of the reviews Next step (partially based on (Fisher, DeLine, Czerwinski, & Drucke, 2012)) consisted in assigning meaning to the data values. Upon further analysis of the database we discovered four data categories: 1 – dictionary form words (e.g.: “am mers”, “minunat”, “mieunat”) 2 – words with spelling mistakes (e.g.: “am mrs”, “mniunat”, “mienuat”) 0 1000 2000 3000 4000 5000 6000 7000 8000 <20 20-29 30-39 40-49 50-59 >=60 N um be r o f un iq ue r ev ie w s Age interval (years) V.-M. Ancusa, O. Ban, M. Cornea - An Intelligent Method to Process Romanian Language Internet Reviews 81 3 – common alliterations, new words, colloquial (e.g.: “merem”, “miunat”, “mai”, “pt”) 4 – connector words, exclamations (e.g.: “wow”, “oooooaaaa”, “cu”) 5 – numerical values, punctuation (e.g.: “10”, “!!!!!”, “?!”) In order to achieve corpus consistency, each word was tested using two databases. The first database used was the Romanian language dictionary, where if a match was found, the word was reduced to its basic declination (e.g.: “minunatul” “minunat”, “mergem” “merge”). While this measure clearly affects the eligibility of the written text, it is non-important for a linguistic network in which patterns and centrality matter more than logical text contiguity. Next, we had to build a special database for the non-dictionary words. This was built dynamically, as our analysis progressed. Each non-dictionary word was added to the database together with a processing directive. The categories developed are presented in Table 2. Table 2. Categories developed in the dynamic correction database Category Description Action Example 1 clear spelling mistakes in which the word was easily recognizable replace misspelled word with dictionary form one, for all further occurrences, prompt only once for solution “măncărică” “mâncare” 2 spelling mistakes in which the word is not easily recognizable prompt supervisor for replacement option, show context * “miunat” “mieunat” or “minunat” 3 common alliterations replace word with dictionary form one, for all further occurrences, prompt only once for solution “pt” “pentru” 4 colloquial occurrences Prompt user for action: replace with user-provided solution or delete for all further occurrences “măi” delete “buuuuun” “bun” 5 foreign language imports Prompt user for action: replace with user-provided solution, delete once or delete for all further occurrences “omg” delete for all “pls” delete once “pls” “rog” 6 Web-specific elements delete for all further occurrences * “http://photos.adress” delete for all * Note: (1) This action could be further developed into a neural network with a training set consisting of this database. (2) To this date this action allows the use of wildcards like * and ? Immediately after this phase we proceeded to remove inconsequential items: connector words (prepositions, auxiliary verbs), numerical values, exclamations and punctuation signs. Some of these items might have been removed at the previous step, but to make sure, on the dynamic database we added all these options, starting from their dictionary form, with the associated action delete for all. Resuming, the data values meaning is reflected in four main categories, similar with the model presented in (Fisher, DeLine, Czerwinski, & Drucke, 2012), except for the missing values case: 1. Already “clean” data: dictionary form words 2. Corrupted data: common spelling mistakes BRAIN: Broad Research in Artificial Intelligence and Neuroscience Volume 8, Issue 3, September 2017, ISSN 2067-3957 (online), ISSN 2068-0473 (print) 82 3. Evolved data: common alliterations, foreign language imports, colloquial occurrences 4. Ignored data: numbers, connectors, exclamations, punctuation, hyperlinks By ignoring punctuation we took a calculated risk, since punctuation can change the meaning of an expression. The reason behind our decision was that in the analytics part of the process, associations were more important than connotations and one negative association due to sarcasm or irony could not overstate the regular ones. It is worth mentioning that the whole process suffered a second and a third iteration, when during the analytics phase some unexpected results emerged. The common thread behind these occurrences was the presence of unexpected patterns that passed the previously described filters. For example, the occurrence “om mers” was interpreted as clean data since each word on its own represents a correct dictionary input. However, in this case, the regional aspect of the language intervened, by substituting “am” with “om”. While the other iterations were based on dictionary and grammar based rules, as mentioned in Table 2, automating the cleaning process, this regional aspect required special handling. Therefore, a new set of rules were manually created to handle such polymorphism occurrences. During the third iteration, similar word forms, representing different notions were tackled. Such is the case of the noun “zi” with the verb “zice”, in which particular case the verb has a form similar with the noun. In order to solve this problem, we implemented, in a case-by-case manner, solutions. In this particular case, the noun was replaced with its articulated form “ziua”, requiring user prompting in order to vet the replacement. This solution needs strong further automatic handling, as it is otherwise very time-consuming and error-prone. The rules from Table 2 are obviously incomplete due to the nature of the Romanian eWOM and every iteration required human intervention in the creation of new rules for solving specific problems. By no means have we considered the final result correct, however, it is enough for a complex network mesoscopic analysis that allows errors as long as they are not in a too high number. After data cleaning, but before analysis, the final step is to create the co-occurrence complex network. The original network, without any processing, had 15.397.933 nodes, or singular occurrences. Through the data cleaning process 4.011.135 deletions were made. After substitution and dictionary form reduction this left only 82.422 unique nodes, a definite improvement from the original size (5.33%). Figure 3. Data as a network – before (left) and after (right) data cleaning V.-M. Ancusa, O. Ban, M. Cornea - An Intelligent Method to Process Romanian Language Internet Reviews 83 Figure 3 depicts part of the network before and after data cleaning, selected for legibility. The reduction is quite significant, even more poignant due to the graph density factor. As measurement, the initial average for each node was approximately 95 edges, rendering the visual analysis of the human researcher (almost) useless, while the final network is a lot “cleaner’ and easier to work with. 6. Conclusion and future work This paper presented a method used to prepare for analysis big data Romanian eWOM, taking into account language specific aspects. While the method was further used to gather insights, it is not without downfalls: it relies very much on human input and decisions, it needs to be constantly updated to keep pace with language evolution. Further research and work will focus on automating these stages, using machine learning algorithms that detect unexpected patterns and determine new rules for them, simplifying the researchers’ work. Acknowledgment: We would like to express our gratitude for the support in this research to Mr Cornel Bociort, the site developer of amfostacolo.ro. References Ayeh, J., Au, N., & Law, R. (2013). Do We Believe in TripAdvisor? Examining Credibility Perceptions and Online Travelers’ Attitude toward Using User-Generated Content. Journal of Travel Research, 4(52), 437-452. Ban, O., Ancusa, V., Bogdan, V., & Tara, I. G. (2015). Empirical Social Research to Identify Clusters of Characteristics that Underlie the Online Evaluation of Accommodation Services. Revista de Cercetare si Interventie Sociala, III(50), 293-308. Barabasi, A. L. (2012). The network takeover. Nature Physics, 8, 14-16. Boyd, D. & Crawford, K. (2011). Six Provocations for Big Data. A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. Chan, Y. Y. & Ngai, E. (2011). Conceptualising electronic word of mouth activity: An input- process-output perspective. Marketing Intelligence & Planning, 29(5), 488 - 516. Retrieved from http://dx.doi.org/10.1108/02634501111153692 Cheung, C. M. & Thadani, D. R. (2010). The Effectiveness of Electronic Word-of-Mouth Communication: A Literature Analysis. 23rd Bled eConference eTrust: Implications for the Individual, Enterprises and Society. Bled, Slovenia. Choudhury, M. & Mukherjee, A. (2008). The Structure and Dynamics of Linguistic Networks. In Dynamics on and of Complex Networks: Applications to Biology, Computer Science, Economics and the Social Sciences (pp. 145-166). Boston, USA: Springer. Chung, J. Y. & Buhalis, D. (2008). Web 2.0: A Study of Online Travel Community. (Springer, Ed.) Information and Communication Technologies in Tourism, 70-81. Dickinger, A. (2011). The Trustworthiness of Online Channels for Experience and Goal-Directed Search Tasks. Journal of Travel Research, 4(50), 378-391. Feraru, S. M., Teodorescu, H. N., & Zbancioc, M. D. (2010). SRoL - Web-based Resources for Languages and Language Technology e-Learning. Int. J. of Computers, Communications & Control, V(3), 301-313. Fisher, D., DeLine, R., Czerwinski, M., & Drucke, S. (2012, May - June). Interactions with Big Data Analytics. Interactions, 50-59. Goodale, M. A. (2014). How (and why) the visual control of action differs from visual perception. Proc Biol Sci, 281(1785). doi:10.1098/rspb.2014.0337. Healey, C. G. & Enns, J. T. (1999). Large Datasets at a Glance: Combining Textures and Colors in Scientific Visualization. IEEE Transactions on Visualization and Computer Graphics, 5(2). BRAIN: Broad Research in Artificial Intelligence and Neuroscience Volume 8, Issue 3, September 2017, ISSN 2067-3957 (online), ISSN 2068-0473 (print) 84 Kachigan, S. K. (1986). Statistical Analysis: An Interdisciplinary Introduction to Univariate & Multivariate Methods. New York: Radius Press. Li, Y., Ye, Q., Zhang, Z., & Wang, T. (2011). Snippet-Based Unsupervised Approach For Sentiment Classification Of Chinese Online Reviews. International Journal of Information Technology & Decision Making, 10, 1097-1110. Litvin, S., Goldsmith, R., & Pan, B. (2008). Electronic word-of-mouth in hospitality and tourism management. Tourism Management, 29, 458-468. Metzger, M. J., Flanagin, A. J., & Zwarun, L. (2003). College student web use, perceptions of information credibility, and verification behavior. Computers & Education (41), 271-290. Mihalcea, R. & Radev, D. (2011). Graph-based Natural Language Processing and Information Retrieval. Cambridge: Cambridge University Press. Park, S. & Allen, J. (2013). Responding to Online Reviews: Problem Solving and Engagement in Hotels. Cornell Hospitality Quarterly, 54(1), 64-73. Shneiderman, B. (2008). Extreme visualization:Squeezing a billion datapoints into a million pixels. Proc. of the ACM SIGMOD International Conference on Management of Data (pp. 3-12). New York. Turney, P. D. & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141-188. Yoo, K., Gretzel, U., & Zach, F. (2011). Travel Opinion Leaders and Seekers. Information and Communication Technologies in Tourism: Proceedings of the International Conference (pp. 525-535). New York: Springer. Versavia-Maria ANCUSA (b. March 11, 1981) received her BSc in Computer Science (2004), MSc in Advanced Computer Systems (2005), and PhD in Computer Science (2009) from “Politehnica” University of Timisoara. Now she is a Senior Lecturer in Department of Computers and Information Technology, Automation and Computer Faculty, “Politehnica” University of Timisoara. Her research crosses several domains, including Computer Science, Network Science, Medicine, Marketing and Linguistics. Olimpia BAN (b. February 23, 1978) received her Bachelor Degree of Finance from University of Economic Sciences Oradea (1997) and PhD in Economics (2005) from the West University of Timisoara. She is now a Professor at University of Oradea, Faculty of Economic Sciences, Department of Economics. Her main research interests are Tourism and Marketing, especially in a national context, with educational and practical applications. Marian-Gelu CORNEA (b. May 11, 1991) received his BSc in Computer Science (2013) from “Politehnica” University of Timisoara. He is now pursuing his MSc in Advanced Computer Systems at the same university, while working for Autoliv in program development. His main research interests are Python Programming and Agile Development.