Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 1, No. 1, March 2017, pp. 26-33 26 https://doi.org/10.31763/businta.v1i1.36 Recommendation system for web article based on association rules and topic modelling Guntur Budi Herwanto a,1,*, Annisa Maulida Ningtyas b,2 a Departement of Computer Science and Electrical Gadjah Mada University, Yogyakarta, Indonesia b Departement of Computer and Informatics Engineering, Jakarta State Polytechnic, Depok, Indonesia 1 gunturbudi@ugm.ac.id *; 2 annisa.m.n@gmail.com * corresponding author 1. Introduction Currently, the World Wide Web (WWW) becomes a knowledge base for numerous information around the world. A lot of industry starts to utilize the benefit of the world wide web, including tourism industry. Gretzel said that the internet is the primary source of information in the domain of tourism [1]. With the in- creasing information of the tourism destination on the Internet, the traveler is no longer dependent on travel agents [2]. The tourists prefer to seek information over the internet, even its an itinerary or individual reviews of each place. However, the internet can be overwhelming for the novice traveler due to various sources of information. The information that comes out is often not quite what they want. Recommendation system can be a tool to resolve the issue and also provide useful information to help the user to make his choice [3]. To be able to make a good recommendation, the system should be able to identify the user interests based on other users who have the same preferences [4]. The analysis of the browsing pattern of users can provide valuable information to the website owner. Such analysis can be done by applying data mining technique into web data. Web mining is referred to as the application of data mining technique to the web data [5]. Web mining classified into three categories, namely the web usage mining, we content mining, and web structure mining [6]. Web usage mining is a process of picking up information from the user how to use websites [7]. The objective is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a website [8]. One way to achieve this goal is by using association rule mining. Web usage association rule mining has long been a traditional data mining method for automatic extraction of potentially interesting information about the behavior of the website visitors from the web usage log files [9]. The most commonly used algorithms for association mining are A R T I C L E I N F O A B S T R A C T Article history Received December 14, 2016 Revised January 10, 2017 Accepted February 1, 2017 The World Wide Web is now the primary source for information discovery. A user visits websites that provide information and browse on the particular information in accordance with their topic interest. Through the navigational process, visitors often had to jump over the menu to find the right content. Recommendation system can help the visitors to find the right content immediately. In this study, we propose a two-level recommendation system, based on association rule and topic similarity. We generate association rule by applying Apriori algorithm. The dataset for association rule mining is a session of topics that made by combining the result of sessionization and topic modeling. On the other hand, the topic similarity made by comparing the topic proportion of web article. This topic proportion inferred from the Latent Dirichlet Allocation (LDA). The results show that in our dataset there are not many interesting topic relations in one session. This result can be resolved, by utilizing the second level of recommendation by looking into the article that has the similar topic. This is an open access article under the CC–BY-SA license. Keywords Apriori Association rule LDA Recommendation Topic modelling http://creativecommons.org/licenses/by-sa/4.0/ http://creativecommons.org/licenses/by-sa/4.0/ ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 27 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) Apriori [10]. While this algorithm can discover meaningful association pattern, the problem of too much execution time and generating too many items due to URL variation exists [11]. This issue can be resolved by utilizing content information from the web content mining process. Web content mining is a process of picking up information from texts, images, and other contents [7]. Mostly, web content can be seen as a text data or documents. Several analysis such as topic modeling can reveal the theme from the documents [12]. Thematic information obtained from the content may also increase the understanding of the pattern generated from the web usage mining [13]. In this study, we propose two-level recommendation, using association rule mining and topic modelling. We also aimed to prevent the twin problems that exists in Apriori by reducing the variations of itemset by utilizing topic modeling. Itemset that usually formed as a set of URL changed to become a set of the topic. In addition to reducing generated association rule, this combination can make the rule more meaningful because of thematic information that contained in the rule. This paper is organized into 5 sections. Section 1 describes the background, current technique, and the case study. Section 2 describes the related work of web mining. Section 3 describes the proposed system of this study. Section 4 show the result and research findings. Finally, Section 5 summarizes the conclusion. 2. Related Work The growth of users and contents on the WWW can be an enormous potential for some website owner, to seek for interesting user behavior in their industry niche including tourism [1]. User behavior can be seen by analyzing the usage data obtained from web server. Examination of user actions in interacting with a website can offer insights causing to customization and personalization of a user’s web practice [14]. Web Usage Mining can provide online recommendation effectively [15]. Rao propose an architecture for online recommendation in Web Usage Mining System [15]. The author presents the architecture of online recommendation in Web usage mining (OLRWMS) for improving the accuracy of classification by the interaction between classifications, evaluation, and the current user activates and user profile in the online phase of this architecture. Another recommendation system proposed by Destyaputri [17] used three-level recommendation system based on association rule discovery, news articles in the same category, and similarity between news articles. By combining collaborative filtering approach and content-based filtering, experiment results show that the technique produces reliable news recommendation. Association rule mining especially Apriori algorithm have been studied to uncover potential user browsing behaviors and creating recommendation [7], [11], [16]. Rawat customizes Apriori Algorithm to become custom-built apriori [7]. The goal is to find effective pattern analysis. The author found that analyzing web logs can not only provide an interesting pattern but also help in creating an adaptive website. More trying to uncover the disadvantages of the apriori algorithm by comparing with another algorithm, the results appear that apriori has more execution time than the other algorithm. Lazcoretta analyzes the process of discovering association rules in this kind of big repositories and of transforming them into user-adapted recommendations by the two-step modified Apriori technique [11]. The results show that their approach can provide better recommendation services by analyzing the behavior of a single user by all other users of web-based information systems. Web log data is the primary source for analyzing user behavior in the web usage mining. Such models can be extended by adding web content as a source for analyzing user behavior especially for the website that has an extensive content like news portal. The combination of both sources can make a significant improvement on the recommendation of news articles [17]. Semantic analysis [18] or Topic Modeling [19] often be used to analyze the semantic meaning of the content, so it can be combined with log data to enrich the sense of the pattern. In this research, we aim to combine the association rule mining with topic modeling. We use apriori as the algorithm for association rule mining, and enrich the association rule with Latent Dirichlet Allocation (LDA) [12]. Apriori allows us to discover the association among user clicks. Meanwhile, 28 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) LDA makes it possible to identify several topics under what the user clicks. By combining these two, we can produce a good recommendation for the future users. 3. Proposed System Our recommendation system built consisting offline and online phase. Offline phase aims to generate the knowledge base that will be used as the basis for recommendations on the online phase. There are two levels of the knowledge base that will be utilized as the basis for the recommendation. The first one is the association rules and the second is the similarity of content based on topic modeling. Fig. 1. Offline phase Web content and web server logs are the primary sources for this recommendation system. These two sources are used to generate the first knowledge base which is association rules. Firstly, web server logs will be transformed into URL sessions. Then, web contents are processed in topic modeling to find the topic model. The result from both of these sources combined to generate the session topic. Each session topics is an item which becomes the input to the association rule mining. The output of the association rule mining is becoming the first knowledge base. The second knowledge base generates purely from the topic similarity of the document inferred by the result of topic modeling. The workflow of these offline process can be seen in Fig.1. Fig. 2. Online phase ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 29 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) The online phase then uses these knowledge base as the reference for the recommendation. When the user started to navigate the website, a set of topic item are shaped. Association rules have the priority for the recommendation. A rule that has the highest support and confidence will become the recommendation. If there are no rules that match, then the similar document based on the topic distribution will be selected as the recommendation. The workflow of these online process can be seen in Fig. 2. 3.1. Web Log Data Preprocessing The interaction between the users and the website is recorded on a web server logs. This log file has several relevant fields that were used for analysis such as the identity of the user and also the information on the accessed pages. Before these log files are ready for analysis, data cleaning is needed. There are four steps involved in the cleaning process. The first is eliminating request with the error status code (4xx and 5xx). The second step is to remove the request on the page that is not related to user activity such requests on images, videos, styles (CSS), and scripts (javascript). The next step is to eliminate requests on the page that are not directly related to content such as contact page, about, and sitemap. The last step is to eliminate the request made by robots, crawlers, or spiders. After the logs data are cleaned, user and session need to be identified. A user can be recognized by the identical of IP address and user agent [20]. Once the user is found, we need to identify the session. The simplest way can be achieved by looking at the timeout. Most commercial services using 30 minutes as the default timeout [21]. The output of this process is in the form of a URL session length n, and can be represented in Definition (1). 𝑆𝑆 = (𝜐𝑠 1 𝜐𝑠 1 … 𝜐𝑠 𝑙 )   Where the session Ss consist of URL that belongs to session s with the length of L. 3.2. Topic Modelling Topic modeling is a statistical method for discovering patterns and themes in the corpus of document [2], [12]. In this study, we used Latent Dirichlet Allocation (LDA) that Introduced by Blei [12]. LDA allows us to identify topics in web content. In this research, web content acquired by the web scraping technique recursively to the entire address of the website. This content needs to be transformed into LDA corpus. This transformation consists of tokenization, stopword removal and forming into the bag of words. The output of LDA is a model that contains the topic with the probability of the word. This model can be used to find the document topic probability vector that represents the distribution of topics from each of the document. Document topic probability vector can be described in Definition (2). 𝐷𝑇𝑖 = (𝑃𝜐𝑑 𝑇𝜊 , 𝑃𝜐𝑑 𝑇1 , … , 𝑃𝜐𝑑 𝑇𝑘 )   Where DTi is document topic probability vector is the affinity of topic k on document i. This vector can be used to see the similarity between documents based on topic distribution, and used as a knowledge base for the second recommendation. 3.3. Session and Topic Model Combination The output of the sessionization process is a series of URL. A URL included in the analysis contains an article. In gudegnet, there are more than 9.000 articles with its own topics. In the previous process, we have a document topic probability vector, so that we can get the topic with the highest probability. This one topic then substitutes with the URL in session to become a session topic. Session topics can be represented as follows. Definition (3). 𝑆𝑇𝑆 = (𝑇𝑠 1 𝑇𝑠 2 … 𝑇𝑠 𝐿 )   Where the session STs consist of topic that belong to session s with the length of L 30 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) 3.4. Association Rule Mining Session topic from (3) contains a list of topic that accessed together in a session. Therefore it can show the interest degree of a user on some topic [2]. To uncover the interesting relationship between the topics we used Association Rule Mining. Association Rule Mining is a fundamental data mining task. Association rule can be used to find all co-occurrence relationships and Web usage patterns [8]. We used the best known association rule mining algorithm which is the Apriori algorithm proposed in [11]. This relationship can be expressed into association rules. Given I = i1 , i2 ,...., im be a set of items, where each I is session topic (STs ) and i is an item or topics ( ). Let T= (t1, t2,…,tn) be a set of transactions or a set of session topics where each session topic is a set of topics such that ti ⊆ I. Support and confidence are used as the metrics for the rule. The main purpose of association rule mining is to discover all association rules in T that have support and confidence greater than or equal to the minimum support and minimum confidence given by user [8]. An association rule can be represented as, X →Y, where X ⊂ I, Y⊂ I, and X ∩ Y = ∅. 3.5. Recommendation System In the online phase, recommendation system generated based on two level of the knowledge base, association rules and topic similarity. 1) Recommendation Based on Association Rules An example of association rules is: Culinary → Accommodation [support = 20%, confidence = 80%] The rule says that 20% visitors read culinary article and accommodation article in one session, and those who read the culinary article also read accommodation article 80% of the time. The left part of the rule is known as the antecedent, and the right part known as the consequent. These co-occurrence relationship used as a recommendation for the web articles. During user surfing sessions, the system will match the sessions with the antecedent, and if such rule is found then the system will recommends the consequent item. 2) Recommendation Based on Topic Similarity It is possible that during the browsing of a user, no session is match the association rule. Therefore, it needs a knowledge base that assure recommendations. We used document similarity based on document-topic vector generated in (3). The intuition behind LDA is that document is a mixture of multiple topics with different probabilities [12]. Given that probabilities vector, we can measure the similarity of documents by using cosine similarity. 4. Research Findings We have collected a month of user request from gudegnet web server log. It was contained 15.795.173 requests, and by doing data cleaning, we can reduce it into 212.694 requests. Then we perform user identification and sessionization and we got 148.666 session with the average length is 1,4 URL per session. From the result, we can see that the bounce rate in gudegnet is very high. We decided to remove the session that contains only have one URL. After this elimination, we got 30.032 sessions with 3.1 URL per session. From the content side, we scraped 12.908 articles from gudegnet, with 10.791 in bahasa Indonesia, and 4.893 in English. On each language, we performed topic modeling so we got 2 topic model, with each of the model contain 10 topics. Then, we make document topic probability for each article based on particular language. The example of topic model in bahasa Indonesia can be seen in Table 1. ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 31 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) Table 1. Topic model Topic Words 1 0.022*desa + 0.014*pasar + 0.014*wisata + 0.008*restaurant + 0.008*rw + 0.007*menikmati + 0.006*sekar + 0.006*kerajinan + 0.006*pengunjung + 0.005*lokasi 2 0.015*indonesia + 0.014*mandiri + 0.013*sleman + 0.008*lingkungan + 0.007*kantor + 0.007*buku + 0.007*bantul + 0.007*bank + 0.006*bni + 0.006*perusahaan 3 0.039*atm + 0.025*seni + 0.019*indonesia + 0.018*tunai + 0.016*bca + 0.013*pameran + 0.011*tersedia + 0.011*gallery + 0.010*art + 0.008*karya 4 0.020*buka + 0.019*gudeg + 0.018*ayam + 0.017*warung + 0.016*menu + 0.015*nasi + 0.014*goreng + 0.012*restoran + 0.011*soto + 0.010*pariwisata 5 0.022*candi + 0.020*villa + 0.018*hotel + 0.012*fasilitas + 0.010*uang + 0.009*terletak + 0.009*restoran + 0.009*bangunan + 0.009*museum + 0.006*parkir 6 0.034*tour + 0.030*travel + 0.019*wisata + 0.015*pantai + 0.014*paket + 0.014*borobudur + 0.012*mobil + 0.010*parangtritis + 0.010*jam + 0.010*harga 7 0.036*rt + 0.033*smp + 0.018*negeri + 0.009*bantul + 0.008*harjo + 0.006*jepang + 0.006*cv + 0.006*umbul + 0.005*mangkubumi + 0.005*giwangan 8 0.020*pendidikan + 0.017*rumah + 0.015*pelayanan + 0.012*tk + 0.011*masyarakat + 0.011*sakit + 0.011*daerah + 0.011*kesehatan + 0.009*circle + 0.009*dokter 9 0.015*upacara + 0.014*masjid + 0.011*dusun + 0.009*desa + 0.008*gunungan + 0.007*jawa + 0.007*gunung + 0.007*sultan + 0.006*kyai + 0.006*makam 10 0.079*informasi + 0.054*kota + 0.051*terbaru + 0.029*istimewa + 0.027*kontak + 0.027*perkembangan + 0.026*gudang + 0.026*detail + 0.026*simak + 0.026*tertera We combine URL session and topic, to become the session topic. In the session, it is highly possible that a user can access the same topic throughout the navigation. The objectives of this association rule analysis are to seek the different topic that frequently access together. We make a distinct topic selection in every session and eliminate the itemset that only has 1 topic. This process resulted in 7.619 session topic. This would become the input for association rule generation with apriori. We perform apriori with minimum support 0,01 and min- imum confidence 0,6. With these parameters, there are 144 frequent itemset and 11 rule. The sample result of the frequent itemset can be seen in the following. The number means the topic number base on Table 1. (3, 1, 6), support: 0.025 (10, 1, 6), support: 0.029 (1, 6), support : 0.098 (10, 6), support: 0.103 Based on the frequent itemset, the association rules can be generated. The sample of association rule generated can be seen in the following. (1, 8, 3) ⇒ (6) , confidence : 0.698 (10, 8, 6) ⇒ (1) , confidence : 0.699 (1, 7, 3) ⇒ (6) , confidence : 0.734 (10, 1, 8) ⇒ (6) , confidence : 0.744 (10, 1, 3) ⇒ (6), confidence : 0.772. When the user navigate into article with the highest probability of topic is topic 10, the system will give recommendation several article on topic 6. We can see the support for this rule is very low in 1% but the confidence is pretty high in 77%. The second level of recommendation is based on the topic similarity between articles. We infer model from Table 1 into all web articles to create a document topic (2). The example of document topic probability can be seen in Table 2. Table 2. Document topic Document Topic Probability Pantai Drini Topic 1*0.0532096046423 + Topic 2*0.0457563449665 + Topic 3*0.144193425948 + Topic 4*0.198742440219 + Topic 5*0 + Topic 6*0.195690258319 + Topic 7*0 + Topic 8*0 + Topic 9*0.359529963764 + Topic 10*0 DA Transport Topic 1*0 + Topic 2*0 + Topic 3*0 + Topic 4*0 + Topic 5*0.044931817055 + Topic 6*0.71084746703 + Topic 7*0.239733171757 + Topic 8*0 + Topic 9*0 + Topic 10*0 Gulai Kepala Ikan Pak Untung Topic 1*0.144676810114 + Topic 2*0 + Topic 3*0 + Topic 4*0.605434609718 + Topic 5*0 + Topic 6*0 + Topic 7*0.122715187125 + Topic 8*0.10850285919 + Topic 9*0 + Topic 10*0 32 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) We can get topic similarity by comparing the probability of each document. The result of the top 3 similar item, based on articles in Table 2, can be seen in Table 3. Table 3. Document similarity Document Similar Document Similarity Pantai Drini Pantai Ngobaran Goa Rancang Kencono Pantai Sadeng 0.93888807 0.91741419 0.90404648 DA Transport Windu Rent Car AB Yogya Transport Nusa Santana Prima Tour dan Travel 0.99549359 0.99484509 0.9756009 Gulai Kepala Ikan Pak Untung Warung Bu Ageng Lusidus Vegetarian Keripik Belut 0.97529632 0.97240818 0.97150409 When the user navigate into ”Pantai Drini” , the system will give top 3 recommendation based on the similarity. The recommendation is “Pantai Ngobaran”, “Goa Rancang Kencono”, and “Pantai Sadeng” with the similarity over 90%. 5. Conclusion In this study, we proposed two level of recommendation based on association rules and topic similarity. We used Apriori as the association rule mining algorithm and LDA as the topic modeling algorithm. We implement this technique specifically for a website that has much content such as city directory that we used for this study. From the result, we can see that there are only 11 association rules generated with the confidence below 80%. The result says that there are not many associations between different topics. On the other hand, we can get much better results on the second level which is the topic similarity. The top three recommended documents have over 90% similarity. Acknowledgment . The authors would like to thank PT Citraweb Indonesia owner of gudegnet for providing the data and their support. References [1] U. Gretzel, “Intelligent systems in tourism: A social science perspective,” Ann. Tour. Res., vol. 38, no. 3, pp. 757–779, 2011. [2] O. Arbelaitz, I. Gurrutxaga, A. Lojo, J. Muguerza, J. M. Pérez, and I. Perona, “Web usage and content mining to extract knowledge for modelling the users of the Bidasoa Turismo website and to adapt it,” Expert Syst. Appl., vol. 40, no. 18, pp. 7478–7491, 2013. [3] D. Buhalis and R. Law, “Progress in information technology and tourism management: 20 years on and 10 years after the Internet—The state of eTourism research,” Tour. Manag., vol. 29, no. 4, pp. 609–623, 2008. [4] B. Pan and D. R. Fesenmaier, “Travel Information Search on the Internet: A Preliminary Analysis.” 2003. [5] J. Borges and M. Levene, “Data mining of user navigation patterns,” in International Workshop on Web Usage Analysis and User Profiling, 1999, pp. 92–112. [6] A. S. Lalani, “Data Mining of Web Access Logs,” Royal Melbourne Institute of Technology, 2003. [7] S. S. Rawat and L. Rajamani, “Discovering potential user browsing behaviors using custom-built apriori algorithm,” Int. J. Comput. Sci. Inf. Technol., vol. 2, no. 4, pp. 28–37, 2010. [8] B. Liu, Web data mining: Exploring hyperlinks, contents, and usage data, Second edi. Chicago: Springer, 2011. ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 33 Vol. 1, No. 1, March 2017, pp. 26-33 Herwanto and Ningtyas (Recommendation system for web article based on association rules and topic modelling) [9] M. Dimitrijevic and Z. Bosnjak, “Pruning statistically insignificant association rules in the presence of high-confidence rules in web usage data,” in 18th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems - KES2014, 2014, pp. 271–280. [10] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proc. 20th int. conf. very large data bases, VLDB, 1994, pp. 487–499. [11] E. Lazcorreta, F. Botella, and A. Fernández-Caballero, “Towards personalized recommendation by two- step modified Apriori data mining algorithm,” Expert Syst. Appl., vol. 35, no. 3, pp. 1422–1429, 2008. [12] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55, no. 4, pp. 77–84, 2012. [13] O. Arbelaitz, I. Gurrutxaga, A. Lojo, J. Muguerza, J. M. Pérez, and I. Perona, “Enhancing a web usage mining based tourism website adaptation with content information,” in KDIR, 2012, pp. 287–292. [14] J. Vellingiri and S. C. Pandian, “A survey on web usage mining,” Glob. J. Comput. Sci. Technol., vol. 11, no. 4, pp. 66–72, 2011. [15] T. S. Rao, Y. Ravindra, U. S. Kumar, S. Sandeep, and K. Srikanth, “An Effecive Framework for Identifying Personalized Web Recommender System by Applying Web Usage Mining,” Int. J. Eng. Res. Appl., vol. 2, no. 3, pp. 307–312, 2012. [16] N. More and N. P. More, “Recommendation of books using improved apriori algorithm,” Int. J. Innov. Res. Sci. Technol, vol. 1, no. 4, pp. 80–82, 2014. [17] H. S. Husin, J. A. Thom, and X. Zhang, “News recommendation based on web usage and web content mining,” in Data Engineering Workshops (ICDEW), 2013 IEEE 29th International Conference, 2013, pp. 326–329. [18] D. M. Desyaputri, A. Erwin, M. Galinium, and D. Nugrahadi, “News recommendation in Indonesian language based on user click behavior,” in Information Technology and Electrical Engineering (ICITEE), 2013, pp. 164–169. [19] N. Dave, K. Potts, V. Dinh, and H. U. Asuncion, “Combining association mining with topic modeling to discover more file relationships,” Int. J. Adv. Softw., vol. 7, no. 3&4, 2014. [20] V. Chitraa and D. A. S. Thanamani, “A novel technique for sessions identification in web usage mining preprocessing,” Int. J. Comput. Appl., vol. 34, no. 9, 2011. [21] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide web browsing patterns,” Knowl. Inf. Syst., vol. 1, no. 1, pp. 5–32, 1999.