INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 11(3):414-427, June 2016. Efficient Opinion Summarization on Comments with Online-LDA J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen Jun Ma, Senlin Luo School of Information and Electronics Beijing Institute of Technology Beijing, China {junma,luosenlin}@bit.edu.cn Jianguo Yao*, Shuxin Cheng School of Software Shanghai Jiao Tong University 800 Dongchuan Road Minhang, Shanghai 200240, China {jianguo.yao,reallytrue1262}@sjtu.edu.cn *Corresponding author: jianguo.yao@sjtu.edu.cn Xi Chen School of Computer Science McGill Universiy Montreal QC Canada xi.chen7@mail.mcgill.ca Abstract: Customer reviews and comments on web pages are important information in our daily life. For example, we prefer to choose a hotel with positive comments from previous customers. As the huge amounts of such information demonstrate the characteristics of big data, it places heavy burdens on the assimilation of the customer- contributed opinions. To overcoming this problem, we study an efficient opinion summarization approach for a set of massive user reviews and comments associated with an online resource, to summarize the opinions into two categories, i.e., positive and negative. In this paper, we proposed a framework including: (1) overcoming the big data problem of online comments using the efficient online-LDA approach; (2) selecting meaningful topics from the imbalanced data; (3) summarizing the opinion of comments with high precision and recall. This framework is different from much of the previous work in that the topics are pre-defined and selected the topics for better opinion summarization. To evaluate the proposed framework, we perform the experiments on a dataset of hotel reviews for the variety of topics contained. The results show that our framework can gain a significant performance improvement on opinion summarization. Keywords:Opinion summarization, Latent Dirichlet Allocation (LDA), online - LDA, imbalanced data, big data. 1 Introduction The rapid development of the Web 2.0 application makes tremendous and diverse information flood the web. We have to admit that the information shows a wide variety of the meanings which may hardly grasp without summarization. Even worse, the data contained this information shows the characteristics of big data and brings the challenge to the efficiency of the data processing. With more and more user-contributed reviews and comments on the Web, the corresponding websites can become more popular resources that reflect the attitudes and interests of the users in a way that depart from the advertisement and the content of the underlying information resource itself. Copyright © 2006-2016 by CCC Publications Efficient Opinion Summarization on Comments with Online-LDA 415 Many techniques have been developed to extract concise information from these contents, such as sentiment classification, text summarization and topic modeling [3] [4] [7]. Nevertheless, the comments on the web are updated unceasingly, it is hard to perform online opinion summarizing with these current techniques. Even though these comments are meant to be useful, the vast opinions summarized are still not easily digested and exploited by the users. When we want to make a comparison of electronic products such as cell phones and laptops, common attributes of the products under consideration include ease of use, battery life, sound quality, Add on s etc. Actually, on most of eCommerce websites, these attributes are pre-defined topics/features and mainly describe hardware performance. Let we say laptops, because of the system’s original configuration, the user’s experiences can be completely different even with the same hardware. And the after-sales services are also a major concern of the user which only can be reflected in the comments. Thus the pre-defined topics do not demonstrate much diversity on different products. The user’s comments are valuable information resource that needs to be summarized. On tripadvisor 1, in order to make an easy comparison of hotels, the scalar rating mechanism is built on the websites for users. But the scalar ratings, e.g. scores between 1 and 5, are not very helpful for hotel managers or tourists because the numeric value does not provide the subjectivities or opinions that come from customer experiences. Also, these scalar ratings are not comparable: for example, when a 3-star hotel receives a high score from 10 tourists while a 4-star hotel receives a medium score from 1 tourist, that does not imply that the former one is better than the latter. In this situation, how to obtain valuable information from users’ comments is more important. Furthermore, personal experiences about each hotel cannot totally the same. Consider two typical hotel comments shown in figure 1. These two comments discuss several different topics of the hotel, such as price, room, food etc. The same topics are also in the comments, such as room, breakfast. Apparently, the topics in hotel comments show more diverse topics than electronics products comments. It is impossible to list all the topics tourists may share. Extracting meaningful topics from the comments is not an easy task. Hotel comments show a very interesting phenomenon of imbalance. The hotels with more comments imply that this hotel is popular and the tourists posting the comments are more likely to share their good experience with others. So the positive comments are far more than the negative comments. The situation of the less popular hotel is quite different, in that less comments will be posted if the tourists had a bad experience. The imbalanced data is the big problem for summarization in form of binary classification. In our framework, we use scheme of online topic extraction in coping with big data problem. Online inference is employed to handily analyzes the huge number of comments in stream form. There is a superior advantage that makes online LDA process massive collection without heavy computational cost and memory necessity. Due to the imbalance of the hotel comments, the meaningful topic selection is another chal- lenging problem to opinion summarization. In our framework, topic selection is carried on with multi-facets of consideration. In comparing with three ROC based topic selection methods, FAST [19] is the best one in handling the extracted topics with the problem of the imbalance, and relative low computation is needed. Furthermore, better opinion summarization is obtained with any redundancy topics filtered out and accuracy in classification. In our evaluation, we observe that our framework avoids several problems faced by supervised classification approach. The aim of the present work is to study the manner in which hotel comments can be summa- rized into positive and negative opinions with meaningful selected topics, so that the obtained summary can be used in real life. Our main contributions are summarized as follows: 1www.tripadvisor.com 416 J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen Location is everything for this place. The view from the balcony made the room. Room was a standard hotel/motel room. It was clean and nice and quiet. My kids enjoyed the clubhouse which had ping pong, air hockey, and other fun stuff. The grounds were nice and we walked to the river. Places to sit so you can enjoy the view and the sound of the river. Very pleasant. Price was right, room was very nice, free breakfast outstanding! Not just a typical continental breakfast, mind you, delicious HOT food. We even had a view of the Snoopy Rock from our balcony. We were able to walk to all our destinations. I would imagine it's the slow time of the year in February, but that worked out well for us. Sedona is now one of my favorite places to visit. Figure 1: Different Topics on Hotel Comments • We present a framework of comments summarization and the online variational methods are used to handle huge amounts of comments from the web in coping with the big data. • We address the problem of data imbalance of hotel comments. Different from existing works on pre-defined topics, topic selection is performed with the consideration of the more positive comments and less negative comments. • The ratable topics can be a form of summary and the opinion summarization is performed with these topics for easy digest and exploitation. The experiments are conducted on comments crawled from tripadvisor. Several metrics are used for the evaluation, and ex- perimental results show that our proposed framework can summarize the comments in a good manner. The rest of the paper is organized as follows. Section II surveys existing studies on com- ments summarization in topic models. In Section III, we propose a framework of the opinion summarization and discuss the topics/features involved in this task and the challenges it implies, in comparison to other LDA based text summarization. In Section IV, we propose a different approach to analyze data imbalance. The evaluation results using several metrics are reported in Section V. In Section VI, we offer insights on the challenges of opinion summarization and point out clear directions in which further improvements can be made. 2 Related Work We first review the research works related to topic modeling. We then give a brief overview of opinion summarization using other techniques, and we discuss the difference between our framework and sentiment classification lastly. Efficient Opinion Summarization on Comments with Online-LDA 417 Topic modeling. Topic Sentiment Model (TSM) [17] is based on the pLSI model [7] which is used to extract the topics. While they set the topics into three sub-topics: neutral, positive and negative topics, the generation progress of documents is considered as first choosing the sub- topics, then choosing the topics in these sub-topics. Y. Lu et al. [13] used a two-step strategy to integrate the opinions with a pulse. The first step is to divide the opinion documents into expert opinions and ordinary opinions. They called it semi-supervised pLSA because the topics are found from expert opinions on the second step, then use as the defined aspect to cluster ordinary opinions. Latent Dirichlet Allocation (LDA) [3] is another representative topic model which provides a basis for textual-level summarization in an unsupervised way. Supervised latent Dirichlet allocation (sLDA) model [4] accommodates a response variable to make the LDA model work under a supervised condition with the facility of the classification. Multi-grain LDA (MG-LDA) model [10] manipulates the LDA model to induce multi-grain topics. The main idea of the model is to find a ratable aspect within texts on a given topic and use this rating information to identify more coherent aspects. Labeled LDA [5] model is a supervised model with the ability of the k-classification. Joint sentiment/topic (JST) model [11] is a four-layer probabilistic model with the extension of three hierarchical layers LDA which can perform sentiment classification under fully unsupervised way. Z. Ma et al. [21] proposed two topic models, MSTM and EXTM to extract the topics from the documents and its comments respectively, then select representative comments from comment clusters. Opinion summarization with the LDA related model is multi-faceted and very involved. These approaches can have scalability issues. Comments summarization. Comments summarization involves two major steps, topic identification and classification. Generally existing research is to classify the comments according to their polarity, which is positive or negative [1] [2] [?]. This kind of summarization on comments can give a very general notion of what the users feel about the product. The accuracy of the classification heavily depends on the identified topics and the distance measure. LDA models are one way of topic identification. NLP-based techniques are the other ways to identify topics in the text [13] [14] [15] [16] employed pointwise mutual information and cosine distance as distance measures to perform the binary classification and found that the latter one leads to better accuracy. Our proposed framework is different from the LDA with the employment of online inference in handling big data. And we also address the imbalance problem of hotel comments. Our work aims at improving the accuracy and scalability of opinion summarization model and inferring meaningful topics for better summarization of the comments. 3 A Better Way to Extract the Topics In this section, we demonstrate the framework of our model. A brief analysis is given to hotel comments. Then we compare different approaches of topic extraction and highlight the advantage of online inference LDA model for comments topic extraction. 3.1 The Framework Imagine booking a hotel on the web. We may not review every comment on each hotel and furthermore we could not find the sentiment changes within a long term. So how can we manage and digest the large information other tourists provided? LDA model is well-known algorithm for discovering the main themes of large and unstructured documents. Our framework is combining the LDA model and a ROC-based topic selection. 418 J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen Opinion summarization includes three steps. The first step is topic extraction, the online inference for the LDA model is used for improving its scalability. The potential topics of the comments may not be evaluated properly because the topic number k of LDA is pre-defined by the user. This means that not all the topics extracted are meaningful for opinion summariza- tion, or good for the classification of positive and negative. So we perform the topics selection (extracted topics are the features for classification) in a second step. The third step is opinion summarization or binary classification. As we can see in figure 2, the collected comments show Opinion Summarization Topics Extraction Ratable Topics Selection Positive Comments Negative Comments Imbalance Data Online-LDA Topic Extraction ROC-based Topic Selection One-Class Classification Figure 2: Framework of our model the characteristic of imbalance (We detail the reason in the next section) which pose a server problem on classification. The ROC-based topic selection is used in our framework for better classification. The relevant algorithm is described in section 3. 3.2 Topic Extraction on Comments There are several probabilistic models to extract the topics - unigram model, multi-gram model, policy model and LDA model. The fundamental idea of these models is that the comments analyzed are considered to have one or some pre-defined topics. The difference is that each one is based on different statistical assumptions. Probabilistic latent semantic indexing (pLSI) model introduces two probability layers to reduce the constrains on the number of topics and mixture weights of each topic. The probability of a comment is: p(d,wn) = p(d) ∑ z p(wn|z)p(z|d). (1) But these topics mixtures are only for those training comments and cannot be used for previously unseen comments. Furthermore, pLSI is a model prone to overfitting in training. So this model is not a well-defined generative model either. Latent Dirichlet Allocation (LDA) is an extension of the pLSI which introduce a Dirichlet prior on topics. Here we denote as θ. The generative process includes two steps: one is to choose θ for the Dirichlet prior on topics, then choose a word from p(wn|θ,β). This process is a continuous mixture distribution: p(ci|α,β) = ∫ p(θ|α) ( N∏ n=1 p(wn|θ,β) ) dθ, (2) Efficient Opinion Summarization on Comments with Online-LDA 419 p(θ|α) are the mixture weights on topics. In this work, LDA is used to extract topics from hotels comments collection. A unified topic model is trained on the integrated content by combining the multiple text fields within each comment together. Given a comments collection C = c1,c2, ,cN where N denotes the comments number, each comment ci is assigned a distribution over K topics learned from the comments collection where K denotes the pre-defined topic number. Hotels comments are updated frequently on their pages. A model using a supervised learning algorithm cannot well generalize profiles of new comments. LDA model can be used for new com- ments and can characterize the comments under the unsupervised form in term of the estimated posterior distribution. Usually this posterior cannot be computed directly [3], and is mostly approximated using Markov Chain Monte Carlo (MCMC) methods or variational inference. The realization of the particular MCMC method, the Gibbs sampling algorithm [18], is widely used to LDA based comments collection modeling. The applicability of Gibbs sampling depends on the ease with which the sampling process creates separate variables for each piece of observed data and fix the variables in question to their observed values, rather than sampling from those variables. Gibbs sampling generates a Markov chain of variables, each of which is correlated with nearby variables. Each step of the Gibbs sampling procedure involves replacing one of the variables with a value drawn from the distribution of that variable conditioned on the values of the remaining variables. Thus the algorithm converges much slowly when handling high-dimensional data. As we know, the variational method is a deterministic alternative to sampling-based algo- rithms. The only assumption made for variational method is the factorization between hidden variables and visible variables. Thus, the inference problem is transformed into an optimization problem as the equation shows: L(ω,ϕ,γ,λ) , Eq[log p(ω,z,θ,β|α,η] − Eq[log q(z,θ,β)] (3) ϕ, γ, are the parameters of z and θ, λ is the parameter of the topics β. The variational inference may converge faster than Gibbs sampling. However, it still requires a full pass through the entire collection each iteration. It can therefore be time and memory consuming in the application to large and stream coming comments collection. Hoffman et al. [8] proposed a much faster online algorithm for the variational inference of LDA. This time a fully factorized variables is used, then the lower bound is defined as L , Σdℓ(nd,ϕd,γd,λ), (4) The online variational inference comes from the best setting of the topics λ. After estimating the γ(nd,λ) and ϕ(nd,λ) on seen comments, then set λ to maximize L(n,λ) , Σdℓ(nd,γ(nd,λ),ϕ(nd,λ),ϕd), (5) The convergence of the online inference had been analyzed and proved much faster than other variational methods. The hotel comments on tripadvisor keep increasing dramatically as we have seen. The scal- ability is an unavoidable challenge for processing the data set in real time. Online variational inference for LDA can be much more useful in dealing with a high volume of data and it can handily analyze massive collections of comments. Moreover, online LDA need not locally store or collect the comments- each can arrive in a stream and be discarded after one look. Refer to [8] for a detail analysis of online variational inference for LDA. 420 J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen 4 Opinion Summarization from Topic Selection In this section we briefly illustrate the data imbalance problem and two topics selection methods, then we describe the algorithm, used in our framework for opinion summarization from imbalanced data topic selection. Further to the topic extraction described in the previous section, we explore the impact of the topics on the summarization performance. Intuitively, opinion summarization can be different from the summarization of factual data, as comments regarded as informative from the factual point of view may contain little or no sentiment. So, eventually, they are useless from the sentimental point of view. The main question we address at this point is: how can we determine the informative extracted topics for opinion summarization. The comments collection we clawed from tripadvisor demonstrates the imbalance problem of more positive and less negative. The data imbalance presents a unique challenge to classify the comments from the extracted topics. Precision and recall are widely used measurements for classification performance. The precision for a class is the number of true positives divided by the total number of elements labeled as belonging to the positive class. Recall is the number of true positives divided by the total number of elements that actually belong to the positive class. Consider the two topics sets on text classification, the first topic set may yield higher precision, but lower recall, than the second topics set. By varying the decision threshold, the second topic set may produce higher precision and lower recall than the first topic set. Thus, one single threshold cannot tell us which extracted topic set is better. The topic selection needs serious consideration. Commonly, there are two methods to select topics: the first is rank topics in descending order with the related criteria, such as ROC, then choose the top, say, l topics. The second is more complicated with computation of cross-correlation coefficient between topics. Scatter matrices are belong to the first method. J3 = trace{S−1w Sb} (6) where Sw = ∑c i=1 PiSi, Pi is the a priori probability of class w, Si is the mean vector of class w, Sw is the within-class scatter matrix, and Sb is the between-class scatter matrix. The cross-correlation coefficient is the second method to topic selection. Let i1 be the best topic selected using the first method. i2 = max j {a1Rj − a2|ρi1,j|}, j ̸= i1 (7) This equation considers the cross-correlation (ρi1,j)between the best topic and topic j ̸= i1. The rest of the topics are ranked according to ik = max j {a1Rj − a2 k − 1 k−1∑ r=1 |ρir,j|}, (8) j ̸= ir, r = 1,2, . . . ,k − 1 These two methods are designed for well-balanced data and if the data dimension is high, the effectiveness of the topic selection is a severe problem for classification. Most comments we crawled from tripadvisor are about the length of 150 words. In generalizing the comments with the LDA model, the pre-defined topic number k is set to 20, 30 rather than 100 because of the relative short length of each comment. Even this moderate number may produce high dimension problems for the topic selection. The detailed analysis is presented in section V. Meanwhile, the computational cost is another bottleneck, even we employ the topic extraction with online-LDA, Efficient Opinion Summarization on Comments with Online-LDA 421 the topic selection is a time-consuming task with these two methods because of the computing of matrix inverse. So in the proposed framework, we use FAST [19] to perform the topic selection for opinion summarization. The topic selection metric is based on an ROC curve generated on optimal simple linear discriminants. Then those topics with the highest AUC (Area Under Curve) are selected as the most relevant. This method is designed for the topics selection of imbalanced data classification. A ROC curve is a criterion for ranking the topics, FAST employs a new threshold determi- nation method which fixes the number of points to fall in each bin to obtain the threshold for ROC. Bin means the width of data separation. We use more bins in high density data areas and fewer bins in sparse data area, each bin containing the same number of data. Thus more thresholds computed from each bin are placed into the density area for the calculation of the ROC. On the opposite, fewer thresholds placed into the sparse area. This effective procedure can be described as following pseudo-code in Algorithm 1: Algorithm 1 Pseudo-code of effective procedure. K: number of bins N: number of comments T: number of topics InBin=0 to N with a step size T/20 for i = 1 to T do Sort Ti(Ti is the ith value of vector T) for j = 1 to K do Bottom=round(InBin(j))+1 Top=round(InBin(j+1)) Threshold=mean(Ti(Bottom to Top)) Classify Ti end for Calculate the AUC (Area Under Curve) end for The detailed analysis of the algorithm is in [19]. The benefit applied is not merely about selected topics for classification, the computational cost of the algorithm is relatively low because no matrix inverse is calculated. Because the area under the ROC curve is a strong predictor of performance, especially for imbalanced data classification problems, we can use this score as our topic selection: we choose those topics with the highest areas under the curve because they have the best predictive power for comment collection. 5 Experiments The experiments is to evaluate the model that produces opinion summaries of comments, in the context of which we assess the best manner to use summarization opinion for the users to quickly digest. In this section, we will present and discuss the experimental results of topic selection and opinion summarization on the hotel comments dataset. 5.1 Dataset We crawled 250,004 hotel comments from tripadvisor in one month period (from Nov, 2012 to Dec, 2012). The comments in the dataset are labeled according to 5 scales of ’star’ expressing the polarity of the opinion of the reviewers (1, 2 corresponding to negative comments and 4, 422 J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen 5 corresponding to positive comments). Since we summarize the opinions into two classes of sentiment (positive and negative), the neural comments (scale 3) are excluded from the comments collection. Figure 4 shows the statistical information about the comments collection. Most comments are within a length of 150 words. 0 50 100 150 200 250 300 350 400 450 0 10 100 1k 10k Comment length (number of words) N u m b e r o f c o m m e n ts Figure 3: The statistical information of the comments collection 5.2 Topic Selection The first experiment is performed to evaluate the generalization performance of the online- LDA model. As we pointed out in Section III that the LDA with online inference can handle massive datasets much faster than other methods such as variational inference, Gibbs sampling. We need to verify that there is no generalization performance degeneration using online inference for LDA. We compared online-LDA model with pLSI models and LDA model described in Section 3. In this experiment, we used all the comments crawled from tripadivor containing 250,004 comments. We held out 10% of the collection for test purposes and trained the models on the remaining 90%. We have found α = 50/T and β = 0.01 to work well with hotel comments collection for LDA model and online-LDA model. The perplexity [3] is used as the measurement for the evaluation of the models. As it is the standard metric and it measures the model’s ability of generalizing unseen data; lower perplexity indicates the higher likelihood and better model performance. We trained these three models using EM with the stopping criteria, that the average change in expected log likelihood is less than 0.001%. Figure 4 presents the perplexity for each model in terms of the comments analyzed. Three models were trained from the crawled comments without looking at the same comment twice. It can be seen that the online-LDA model have a lower perplexity than pLSI and LDA model after analyzsis of the same number of comments. This superior advantage comes from the fact that the online variational inference converged much faster than variational bayes used in LDA [8]. After analyzing the total comments collection, both LDA models reached the same level of perplexity about 1700. The generalization performance of online-LDA is as good as the LDA with an extra advantage of much faster fitting to the comments. The perplexity of pLSI is 1900. So the results show that online-LDA is more adapted to coming comments under online environment. Besides fast topics extraction, our summarization framework employed a lower cost topics selection method in coping with the imbalanced nature of hotel comments. The experiment performed next is to evaluate the performance of topic selection. The balanced error rate (BER) is the main judging criterion for the topic selection [20].BER is the average of the error rates of positive comments and negative comments. If these two classes are balanced, the BER is equal to the error rate as the rate of inverse recall [6]. We evaluated the performance of selected topics Efficient Opinion Summarization on Comments with Online-LDA 423 0 10 100 1k 5k 10k 20k 1700 1800 1900 2000 2100 2200 2300 Comments Seen P e rp le x it y pLSI LDA Online−LDA Figure 4: The perplexity results for pLSI, LDA and Online-LDA. 0 2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of Topics B E R SM CC FAST Figure 5: BER with k = 20 using SV M. in our framework (FAST) with the comparison of the topics selected by scatter matrices (SM) and cross-correlation coefficient (CCC). The main concern in our framework is the performance of the topic selection metric, so we simply choose the popular SVM classifier to evaluate the performance without the detail analysis of the difference with other classifiers. Table 2 presents the description of the comments used in BER evaluation. Table 1: Comments used in evaluation of BER Number of Topics Ratio Positive Comments 180,023 95.5% Negative Comments 8,053 4.5% These comments collection demonstrates the strong imbalanced nature that the negative comments are less than 5% of the total comments. From the previous analysis of data set, we know that most of the comment length is less than 260. So we set k = 20, 30 respectively for the online-LDA model, then the extracted topics are used for BER evaluation with the methods described in our work. Figure 5 and 6 show the result of the performance in terms of BER. We can see that the BER changes dramatically with the different numbers of the topics. We observe that BER decreases as the topics increase when the topics number is less than 9. And then the BER reaches to the relatively stable values of 0.15, 0.1 and 0.08 respectively to SM, CC and FAST. The explanation for this behavior is that the redundant topics have little impact on the performance of the classifier. This robustness might be useful to redundant topics for classification. But our goal is comment summarization rather than classification. Redundant topics can bury informative topics and make the user hard to exploit. The topics selected using FAST significantly outperformed SM and CCC topics with lower BER when using the SVM classifier. Several experimental results reveal that the lowest BER comes from the 9 selected topics. I.Tiov et al. [10] show that 9 topics out of 45 LDA topics correspond to ratable aspects. This is a quite interesting discovery that the topics we selected are same as the ratable aspects defined by a manual analysis of the documents, as the more topics used in classification, the more freedom we can have to distinguish the polarity of the comments in a finer granularity. But the performance of the classification remains stable to a certain level. The optimal topics come from selected topics with the emerging of this level. This suggests that topic extracted using LDA is not a sufficiently representative topic of the importance of comments for summarization purposes. Thus, using ROC-based topic selection that has proven useful for opinion summarization can yield better results. 424 J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen 0 3 6 9 12 15 18 21 24 27 30 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of Topics B E R SM CC FAST Figure 6: BER with k = 30 using SV M. 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 20 30 40 50 60 70 80 90 100 False Positive Rate(%) T u re P o s it iv e R a te (% ) 20 Topics 30 Topics Selected Topics Figure 7: Performance of Summarization in term of ROC. 5.3 Opinion Summarization As discussed before, a significant advantage of our framework over existing models in topic selection and classification is the lower computational cost in topic extraction and topic selection in imbalanced comments. We only consider the positive and negative comments given data set, with the neutral comments being ignored. There are two main reasons. Firstly, hotel comments opinion summarization in our case is effectively a binary classification problem, i.e. comments are being classified as either positive or negative, without the alternative of neutral. Secondly, the selected topics merely contribute to the positive and negative words, and consequently there will be much more influence on the summary results of positive and negative comments given data set. Furthermore, the classification with less negative comments shows the unique similarity of the outlier detection. As a result, we choose to evaluate the overall performance of the opinion summarization indirectly through outlier detection based on the selected topics. More specifically, we apply one-class SVM classifier on the comments with selected topics. f(x) = sign((ω · Φ(x)) − ρ). (9) The regularization parameters ω and ρ solve the quadratic programming problem of the ν- SVM. The classification using these methods is computationally simple and does not require significant memory. The main problem we encountered is that the lexicon is needed first in our proposed sum- marization framework. In our case, however, comments are often composed of ungrammatical sentences and, additionally, a high number of unusual combinations of escape characters (corre- sponding to the vivid sentiment expressed), which make the comments much noisier and harder to process than the standard data sets traditionally used for summarization evaluation. Never- theless, the online-LDA, being a generative model, proved to be quite robust to variations in the input data and, most importantly, to the change of the domain (Micro-blog etc.) [8]. There are 10,314 words in our lexicon for the experiments. Table 2: Performance of Opinion Summarization Topics Positive Comments Detection Rates Negative Comments Detection Rates Selected Topics 89% 91% 20 topics 80% 77% 30 topics 80% 79% The results of the opinion summarization are shown in Table 2. The first thing to note in Table 2 is that the opinion summarization model is doing a much better job at classifying the Efficient Opinion Summarization on Comments with Online-LDA 425 10 20 30 40 0 500 1000 1500 2000 2500 Number of Topics T im e P e rf o rm a n c e o f T o p ic E x tr a c ti o n LDA Online−LDA Figure 8: Time Performance of the Topic Ex- traction. 10 20 30 40 Selected 0 50 100 150 200 250 300 350 400 450 Number of Topics T im e P e rf o rm a n c e o f C la s s if ic a ti o n 170 194 254 403 198 Figure 9: Time Performance of The Classifica- tion. comments according to its polarity than the solo LDA model, the main problem with the latter being a relatively low precision. The main reason for this is an insufficient number of annotated negative examples when performing the topic selection. The results show that the model is capable of reliably identifying negative comments (Figure 7). It can be observed that there is a considerable improvement in classification accuracy after performing the topics selection with FAST, with 5.3% improvement for our framework. We evaluate the topic extraction time. We extract the 10, 20, 30 and 40 topics with LDA and online-LDA respectively. Results show that the online-LDA model outperforms the LDA model (Figure 8). We evaluate the framework’s time performance. We extract the 10, 20, 30 and 40 topics with online-LDA, and then perform the topic selection with two topics sets, 20 and 30 topics. We classify the comments in two ways: the first is to classify the comments with original topics, the second is to classify the comments with selected topics. The time performance of the second classification is averaged over two selected topics. The topic selection used more time when classifying but the classification time does not increase dramatically due to the lower dimension of the data. Results show an extra 28 seconds in comparison with 10 topics (Figure 9) due to the time-consumption of the topic selection. We believe, however, that for the best performance of summarization a 28 second period is considered low enough in handling over 200,000 comments so that our results indicate an acceptable time performance penalty. 6 Conclusion In this paper, we have presented a new framework for the summarization of hotel comments. The most useful usage of opinion summarization is a web application. While most of the existing approaches to opinion summarization have not put into much consideration of the scalability of the models. Scalability is the most important task in our proposed framework. The online-LDA model is used for extracting the topics from the huge and increasing comments collection. The generalization performance remains the same but the computational cost is lower in comparison with LDA model. We address the imbalance problem of the comments. And the topics selection method, FAST is used for better classification performance. The selected topics are informative and easy for the user to digest the comments. There are several directions we plan to investigate in the future. One is the best comments selection when the aim is to brief the comments collection. Another one is the ratable aspects regression analysis for certain kinds of reviews. 426 J. Ma, S. Luo, J. Yao, S. Cheng, X. Chen Bibliography [1] A. Divtt and K. Ahmad (2007); Sentiment polarity identification in financial news: A cohesion-based approach, In ACL’07, Prague, Czech Republic, June 2007, 1-8. [2] B. Pang, L. Lee and S. Vaithyanathan (2002); Thumbs up?: sentiment classification using machine learning techniques, EMNLP’02: Proc of the ACL’02 conference on Empirical Methods in Natural Language Processing, Morristown, NJ, USA, 10: 79-86. [3] D.M. Blei, A. Ng and M. Jordan (2003); Latent Dirichlet Allocation, Journal of Machine Learning Research, January 2003, 3:993-1022. [4] D.M. Blei and J.D. McAuliffe (2007); Supervised topic models, In NIPS’07, Vancouver, B.C., Canada, 1-8. [5] D. Ramage, D. Hall, R. Nallapati and C.D. Manning (2009); Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, In EMNLP’02: Proc. of the ACL’02 conference on Empirical Methods in Natural Language Processing,Stroudsburg, PA, USA, 2009. [6] D.M.W. Powers (2001); Evaluation: Precision, Recall and F-measure to ROC, Informedness, Markedness & Correlation, Journal of Machine Learning Technologies, 2(1):37-63. [7] T. Hofmann (1999); Probabilistic latent semantic indexing, In SIGIR’99: Proc. of the 22nd Annual Intl. ACMSIGIR Conf. on Research and Development in Information Retrieval, New York, NY, USA. [8] M.D. Hoffman, D.M. Blei and F. Bach (2010); On-line learning for Latent Dirichlet Allo- cation, NIPS2010, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Lake Tahoe, Nevada, USA, 50-57. [9] H. Wang, Y. Lu and CX. Zhai (2011); Latent Aspect Rating Analysis without Aspect Keyword Supervision, KDD’11, Proc. of the 17th ACM SIGKDD intl. conf. on Knowledge discovery and data mining, San Diego, California, USA, 618-626. [10] I. Titov and R. McDonald (2008); A Joint Model of Text and Aspect Ratings for Sentiment Summarization, Proc. of ACL’08, Columbus, Ohio, USA, 308-316. [11] C. Lin and Y. He (2009); Joint Sentiment/Topic Model for Sentiment Analysis, CIKM’09, Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China, 375-384. [12] L.-W. Ku, Y.-T. Liang and H.-H. Chen (2006); Opinion extraction, summarization and tracking in news and blog corpora, AAAI-CAAW’06, Proceedings ofAAAI-CAAW-06, the Spring Symposia on Computational Approaches to Analyzing Weblogs, Stanford, California, USA, 1-8. [13] Y. Lu, C. Zhai and N. Sundaresan (2009); Rated aspect summarization of short comments, WWW’09, Proceedings of the 18th international conference on World wide web, ACM, NY, USA, 131-140. [14] P.D. Turney (2002); Thumbs up or thumbs down?: semantic orientation applied to un- supervised classification of reviews, ACL’02, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, 417-424. Efficient Opinion Summarization on Comments with Online-LDA 427 [15] P.D. Turney and D.L. Littman (2003); Measuring praise and criticism: Inference of semantic orientation from assocation, ACM Trans. Inf. Syst., 21(4):315-346. [16] P. Stenetorp, S. Pyysalo, G. Topic, S. Ananiadou and J. Tsujii (2012); BRAT: a web-based tool for NLP-Assisted text annotation, EACL ’12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 102-107. [17] Q. Mei, X. Ling, M. Wondra, H. Su and C. Zhai (2007); Topic sentiment mixture: modeling facets and opinions in weblogs, WWW ’07 Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, 171-180 [18] B. Walsh (2002); Markov chain Monte Carlo and Gibbs sampling, Lecture notes for EEB 596z, 2002. [19] X. Chen and M. Wasikowski (2008); Fast: A roc-based feature selection metric for small samples and imbalanced data classification problems, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 124-132. [20] YW. Chen and CJ. Lin (2015); Combining SVMS with various feature selection strategies, Available: www.csie.ntu.edu.tw/ cjlin/papers/fearutes.pdf. [21] Z. Ma, A. Sun, Q. Yuan and G. Cong (2012); Topic-Driven reader comments summarization, CIKM’12, Maui, HI, USA, 265-274.