 Kurdistan Journal of Applied Research (KJAR) | Print-ISSN: 2411-7684 – Electronic-ISSN: 2411-7706 | kjar.spu.edu.iq Volume 2 | Issue 3 | August 2017 | DOI: 10.24017/science.2017.3.22 Reviews Sentiment analysis for collaborative recommender system Alia Karim Abdul Hassan Computer science Dept. University Of Technology Baghdad, Iraq 110018@uotechnology.edu.iq Ahmed Bahaa aldeen abdulwahhab Informatics Dept. Middle Technical University Baghdad, Iraq ahmed80.ab@gmail.com Abstract: recommender system nowadays is used to deliver services and information to users. A recommender system is suffering from problems of data sparsity and cold start because of insufficient user rating or absence of data about users or items. This research proposed a sentiment analysis system work on user reviews as an additional source of information to tackle data sparsity problems. Sentiment analysis system implemented using NLP techniques with machine learning to predict user rating form his review; this model is evaluated using Yelp restaurant data set, IMDB reviews data set, and Arabic qaym.com restaurant reviews data set under various classification model, the system was efficient in predicting rating from reviews. Keywords: recommender systems, sentiment analysis, opinion mining, natural language processing, text classification. 1. INTRODUCTION Recommender systems collect information on user’s preferences for particular items and try to predict user needs in the field of products, information services [1]. Recommender systems are constructed using two paradigms: content based recommender systems and collaborative filtering recommender systems. The content based recommender systems are depending on item features. Content based recommender systems. Recommend items that similar in the feature with that chosen by the user before. Collaborative filtering (CF) recommender systems depending on the similarity between users, collaborative filtering. Filtering considers two users are similar if they chose same items before, and recommended the same item to each other [2]. So recommender system needs user or objects data to work fluently, but most RS. Suffering from cold start problems (new user cold start and new item cold start) that are caused by lack of information about the new user or new items added to the recommender system database, and the second problem of data sparsity which is caused by insufficient user rating during purchasing process in the recommender system. This yield many works to tackle these two problems [3] [4] ]. Beyond the traditional methods to build a recommender system, there is a growing paradigm to exploit additional unstructured text reviews as an added source of information about products and services [3]. User review and opinions could be useful in recommender system especially in the area of services like restaurants, movies, hospitals, doctors, more rather in measurable products [5] [6]. The contribution of this paradigm is, firstly, it decreases the effect of data sparsity problem in collaborative filtering systems by using the user review as for additional information about an item. Second, it provides a way to extend recommender system to measure services where numerical rating could be impossible or insufficient, third this research handle both English and Arabic languages for sentiment analysis 2. LITERATURE REVIEW Sentiment analysis (SA.) or opinion mining is the study of emotions toward an item or entity [7]. Sentiment analysis Can be treated as a classification problem; Sentiment analysis will determine whether the sentence expresses positive or negative opinion [7] [8]. The most important application of Sentiment analysis is product reviews; these reviews are important for the business holders as they can decide according to user’s opinion, and also for users because they Can be recommended for products according to opinions of other users [8]. PMI-IR (point wise mutual information and information retrieval) was proposed by Turing, who computed sentiment orientation is (SO) to two adjectives (poor, and excellent) and measure the synonymy of the word according to these two terms. PMI-IR used to build a synonymy lexicon [9]. Qumsiyeh. And Ng. is Urged a recommender system. That provides recommendations about multimedia items, by using data about features extracted from users reviews about a specific movie or picture [10]. Kim proposed a movie search engine depends on previous user ratings, reviews. When a user types a query which is expanded by selecting an item appeared which is an issue from the last review, but this approach leads to the over- fitting problem and decreasing in diversity in recommender results [11]. Content based recommender system also exploits user reviews in building review based product profile [5]. Ganu et al. proposed restaurant recommender system, that cluster users based on their reviews and the items which liked by them [12]. In recent years after the revolution of the social network, directions appeared to apply sentiment analysis on social network text posts, Molla. Made a sentiment analysis by using a post about Samsung products using a mailto:110018@uotechnology.edu.iq mailto:ahmed80.ab@gmail.com different account for Samsung company [6]. Dmah. H. and Xiao Zheng develop this direction and deploy sentiment analysis on twitter post for a trusted friends relate to a particular person account [13]. This work proposes a technique to alleviate the problem of data sparsity (poor user ratings), by training a naïve bays classifier to specify user preferences from the review (like/dislike) depending on review features and Word Net corpus. 3. Sentiment classification techniques Sentiment classification methods can be classified into:- 3.1 machine learning techniques This paradigm for sentiment analysis uses well-known machine learning algorithms to classify text according to linguistic features [7] [14]. 3.1.1 Probabilistic classifiers Probabilistic classifiers are based on hypothesizing a generative model which comes from product distribution over original features space, two Probabilistic important classifiers are:-  Naïve bays classifier: - this classifier is founded on Bayes theorem, this model computes the positive probability of the class, based on word distribution in a document; this model needs a bag of words feature extraction and ignore the position of the word in document [7] [14].  Maximum entropy classifier: - also known as a conditional exponential classifier, it uses labeled feature set as a vector, then calculates the possibility of each feature that could be computed to specify the most likely class for the feature set. 3.1.1.2 Linear classifier Linear classifier trains a model to classify objects to their class with a line. Assume ̅ { } is a vector of word frequencies, ̅ = {a1, a2, an} is a vector of features, so linear classifier would predict a line that is P= ̅ ̅ , where b is a scalar, and P is a line spreading plan between classes. Linear classifier is divided into:-  Support vector machine: - support vector machine is a supervised classifier used for classification and regression analysis. The idea behind the SVM is to determine a line that is separating in a search space, which separates different classes. SVM works well with text data, because of sparse nature of the text data. Text data are irrelevant, but somehow are correlated and could be organized in linearly separable categories [7] [14] [15].  Neural networks: -this paradigm is based on the neuron, NN consists of many neurons, multilayer neural networks can be used for nonlinear boundaries. The input to the neuron is X which is a vector of word frequencies in the document and set of weights paired with each neuron to calculate class prediction ̅̅̅ . [14] 3.1.1.3 Decision tree classifier: - decision tree uses the division of training data, in which the data is partite based on the condition of feature value, here we can use division by appearance or absence of a word to classify a document [14] [15]. 3.1.1.4 Rule based classifier: - it depends on modeling the data with a set of rules. If a word holds positive emotion text, it considered as positive, while if the word holds negative emotion, it classified as negative [14]. 3.1.2 Unsupervised learning This machine learning approach has no target to learn from, it is learning by observation. The most used approach in sentiment analysis and recommender system are clustering, which operates by gathering the similar items into one cluster based on Euclidian distance or other distance measurements. K-means, hierarchical clustering, Gaussian mixture model, hidden Markov are known clustering methods [14]. 3.2 Lexicon based approach This method uses opinion words by determining positive words to specify the desired state, and negative words to define unacceptable state [7].lexicon based method is divided into two categories:-  Dictionary based approach: - this mechanism a collected opinion word with known orientation. This detection could be expanded by searching Word Net corpora for synonyms and antonyms.  Corpus based approaches: - this method tries to measure the sentiment orientation by finding co-occurrence pattern words to determine their sentiment, and get another opinion words which have similar context [7] [14] [15]. 4. The Proposed system The proposed system consists of three components. The first one is web scraper which is used to scrap user’s reviews from web sites and other social networks, while the second component is responsible for analyzing user reviews and specifying positive and negative sentiments from collected review data set, the third one is a collaborative filter that provides recommendations. This paper focus on a study the second part which is the sentiment analyzer for reviews see figure below:- Figure 1. Overview of proposed system The sentiment analysis component system begins by scraping reviews from the web and preprocess the reviews data set for sentiment analysis operation. The preprocessing process aims to refine the reviews text from parts that decrease the efficiency of machine learning for sentiment analysis operation like stop words and finding the stems (root) of words in each review text and saving the result refined text in corpus list, preprocessing algorithm below illustrate this procedure:- The resulted corpus of reviews is ready for next step which is building the bag of words, dividing the dataset of a bag of words into a training set, and testing set. After that machine learning naïve bays system or any other machine learning model would be trained to predict the positive sentiment and negative one from reviews as algorithm below:- This research also handles the sentiment analysis for the Arabic language, which has a specialty in the process because it's an oriental language. The word in the Arabic language contains the word's character and vowel diacritical marks that specify the pronounce of the character that formulates the word. The existence of vowel diacritical marks may cause under fitting for the machine learning system, so this vowel diacritical marks have to be removed in the normalization preprocessing, we can notice algorithm 3 , its begin with read the review and unify the characters than contains vowel diacritical marks like ٱ ,آ , أ, , و to ؤ , ى to ي , ا to be إ and ئ to ى and removing the noise (vowel diacritical marks) like ّ Tashdid, ّ Fatha, ّ |Damma, ّ Tanwin Damm, ّ Kasra, ّ Tanwin Kasr, ّ Sukun. Then complete the preprocess as in English preprocessing by removing stop words and stemming for all words in the normalized review. Algorithm 3:- Arabic reviews preprocessing Input: n Arabic reviews data set Output: corpus of normalized reviews I=0, n= number of reviews in dataset 1. While (i