Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 5, No. 2, September 2021, pp. 124-132 124 https:doi.org/10.31763/businta.v5i2.455 Topic modeling using latent dirichlet allocation (LDA) on twitter data with Indonesia keyword Edi Surya Negara 1,*, Dendi Triadi 2 a Data Science Interdisciplinary Research Center, Computer Science, Universitas Bina Darma, Palembang, Indonesia 1 e.s.negara@binadarma.ac.id*; 2 dendi.triadi@binadarma.ac.id * corresponding author 1. Introduction Social media is one of the factors that causes changes in people's social interactions. From social media, humans get a lot of information because the scope of social media is unlimited. Social media exists in many different forms, including social networks, internet forums, weblogs, social blogs, micro blogging, wikis, podcasts, images, videos, ratings and social bookmarking [1]. Of the many social media Twitter is one of the most popular. Twitter exists as a means of communication to exchange information about events in the real world, short messages on twitter in general Influence various events experienced by users in real-time [2]. The huge Twitter data can be used as a data source for online news webs. However, with the many topics extracted from Twitter data, the incoming data has a variety of topics which causes difficulties in identifying the topics from the data set taken and will require a lot of time if it has to be done manually by humans. Meanwhile, the data is potentially needed to provide information as quickly as possible [3]. Summarization is a frequent biomedical text mining activity that heavily relies on information extraction. Summarization is the process of automatically finding the most salient features of one or more papers and representing them in a logical manner. It has recently attracted significant interest as a result of the massive growth of unstructured data in the biomedical arena, such as scholarly papers and clinical data. [4]. A large volume of news stories presents a potential difficulty in the work of automatic classification. The debates over how to classify English news stories have been extensively A R T I C L E I N F O A B S T R A C T Article history Received August 4, 2021 Revised August 20, 2021 Accepted September 2, 2021 Digital transformation causes an increase in the volume of information in the form of text such as news. On social media, a lot of news is uploaded in such a fast time and one of them is Twitter. Twitter is a social media service that has served many users, making it one of the social media that has very large data. From this very large data, it can be used as a news source for online news web. However, with the many topics extracted from Twitter data, the incoming data has a variety of topics which causes difficulties in identifying the topics from the data set taken and will require a lot of time if it has to be done manually by humans. Meanwhile, the data is potentially needed to provide information as quickly as possible. This study aims to classify topics on data taken from Twitter automatically so that it can make a classification on the news taken, can be more effective and efficient and does not take as much time as done manually by humans. The research was conducted using the Latent Dirichlet Allocation (LDA) method. News documents that will be classified are Indonesian news documents and will be classified into topics to be determined. The results of the research using topic modeling using the LDA method concluded that the number of topics formed from 9094 tweet data was 10 topics. This is an open access article under the CC–BY-SA license. Keywords Classification Text Mining News Documents Natural Language Processing Latent Dirichlet Allocation (LDA) http://creativecommons.org/licenses/by-sa/4.0/ http://creativecommons.org/licenses/by-sa/4.0/ ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 125 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) examined. This differs from the automatic categorisation of Indonesian news articles. The implemented classification method is confined to classical methods such as Naive Bayes and Support Vector Machine. Both techniques are rigorous in their classification of documents into a single topic. As a result, we employ one of the Topic Modeling techniques, in which a document is represented as a distribution of topics, each of which is represented by a set of words. Latent Dirichlet Allocation is the technique used. The experimental investigation is conducted using a 10-fold cross validation technique and numerous parameters, including the number of subjects (5, 10, and 15) and both LDA hyperparameters (0.001, 0.01, and 0.1). The result indicates that the best overall accuracy is approximately 70% when categorizing documents of Indonesian news stories into five categories, including economic, tourist, criminal, sport, and politics. [5]. Classification using independent variables has been widely implemented using the Naive Bayes method and the results depend on the features used, and do not apply if the conditional probability is zero. while classification using the SVM method there are many texts that cannot be classified correctly due to high dimensional characteristics, still rigid and performance depending on the selection of kernel functions causing data sparseness problems [6]. This article presents text classification with a topic modeling approach with the aim of classifying news topics for online news webs using the Latent Dirichlet Allocation (LDA) method. The news texts that will be used in this study include the results of crawling Twitter data using the Application Programming Interface (API) provided by Twitter to produce text data sets based on updates uploaded by Twitter users. The crawling process is successful on Twitter data using the Application Programming Interface and has produced informative data through the Crawling process [7]. Research topic modeling using the Latent Dirichlet Allocation (LDA) method has been carried out by several previous researchers. But this article presents topic modeling using the keyword "Indonesia" by crawling the data create_at, id, id_str, text, entity, metadata, source, user, geo, coordinates, retweet which will be used for online news web development with the topic of information modeling, sentiment analysis and geolocation graph mapping. 2. Literature Review 2.1. Social Media Social media is a type of media that is comprised of three components: an information infrastructure and tools for producing and disseminating media content; and Individuals, organizations, and industries make and consume media information in digital form in the form of personal communications, news, ideas, and cultural items. [8]. Social media and other online media are a place for information and a place for disseminating information that can be accessed for anyone who wants to find information easily through the internet, either through smartphones or with computers/laptops. Social media is defined as an online information technology tool that enables users to connect easily via the internet in the form of text messages, audio, video, and photographs [9]. 2.2. Twitter Twitter is an online social networking and microblogging website that enables users to send and read text-based messages of up to 140 characters, which was raised to 280 characters on November 7, 2017 and is referred to as tweets. Jack Dorsey founded Twitter in March 2006, and the social networking platform launched in July. Since its inception, Twitter has grown to become one of the top ten most visited websites on the Internet and has been termed the "internet's short message service." Since its inception in 2006, Twitter has grown to be one of the most popular social media platforms on the planet. Indonesia has been a Twitter user since the service's launch and is one of the most productive users. Indonesia has approximately 29 million Twitter users. In 2014, Indonesia was rated sixth in terms of the number of tweets. Numerous pieces of information can be gleaned from the social media platform Twitter, which is becoming increasingly popular. It is used for a variety of purposes, including public, government, and business purposes. Twitter users share a range of different types of tweets. Users can express their opinions on shared tweets. The term "topic" refers to the substance of numerous tweets that cover the same subject. By running topic analysis on these tweets, you may determine the primary themes being discussed by people at the moment [10]. 126 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) 2.3. Topic Modelling The notion of topic modeling is made up of three entities: "word," "document," and "corpora." The term "word" refers to the fundamental unit of discrete data in a text, which is defined as a piece of vocabulary that is indexed for each unique word in the document. The term "Document" refers to an array of N words. Corpora is the plural version of corpus, which is a collection of M documents. While "subject" refers to the distribution of a specific vocabulary word. To put it simply, each document in the corpus contains a unique fraction of topics mentioned based on the words included within [11]. The basic idea of Topic Modeling is a topic consisting of words certain words that make up the topic, in a document may be composed of several topics and probabilities. While human thought documents are something that can be observed but topics, topic distribution per document, and labeling on each topic are hidden structures, therefore topic modeling helps to find topics and words contained in the document [12]. The concept of topic modeling is aimed at right in Fig. 1. Fig. 1. Topic Modeling Concept 2.4. Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) is a topic modeling and analysis method that is currently in great demand. In analyzing large documents, LDA is present as one of the methods that can be used. One of the functions of LDA is that it can be used to summarize, cluster, and connect or process large data, this is because LDA can produce a list of topics that are weighted in each document [13]. LDA is an analysis method on very large documents. LDA can be used to summarize, cluster, connect or process very large data because LDA produces a list of topics that are weighted for each document [13]. The Dirichlet distribution is used to obtain the topic distribution for each text, and the LDA generating process uses the Dirichlet results as a proposal for allocating words in the document to different subjects. However, from a human perspective, documents are visible elements, whereas topics, their distribution within documents, and the classification of words inside topics are hidden structures; thus, this approach is named Latent Dirichlet Allocation (LDA) [11]. Latent Dirichlet Allocation (LDA) is a method in topic modeling that provides flexibility in organizing, understanding, searching and summarizing electronic archives that have proven to be successfully implemented quite well in the text field and information retrieval. Blei represents the LDA method as a probabilistic model visually as shown in Fig. 2. 3. Method The method used in this study is to use a Text Mining approach with the main objective of classifying text based on modeling topics. Iteratively, the research steps are arranged based on the framework shown in the sub-chapter of the research stages. In summary, the development of the topic modeling method began with the creation of the Latent Semantic Indexing (LSI) algorithm, which attempted to address issues with the tf-idf scheme's dimension reduction. Then, the LSI method was developed utilizing the maximum likelihood or Bayesian approach, dubbed the Probabilistic LSI (PLSI) method, which incorporates the concept of probability [14]. Although the PLSI method can be useful for probabilistic modeling of topics, it does not fully produce probabilistic models at the document level. To solve this problem, the Latent Dirichlet Allocation method emerged [11]. While the LDA method can perform effectively at the ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 127 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) document level for documents that contain a large number of subjects, it cannot be used optimally for topic modeling in datasets that contain compact language, such as the twitter dataset. To accomplish this, we require the appropriate method for modeling topics on a small dataset, which is why LDA was developed into twitter-LDA for the purpose of modeling topics on the twitter dataset [15]. Fig. 2. Topic modeling visualization with LDA method [11]. 3.1. Research Stages This research was conducted by following the main framework of Text Mining which is divided into four stages. The framework is shown in Fig. 3. Fig. 3. Research stages In Fig. 3 it is explained that the research was carried out by preparing the data in advance, if the data has been prepared it will proceed to the topic modeling stage with the Latent Dirichlet Allocation method, after the data is processed with this method, it will proceed to visualization to display the results of the data that has been processed. If the visualization has been completed, the data obtained will enter the results and discussion stage [16]. 3.2. Modeling Topics with Latent Dirichlet Allocation At the Stages, the modeling topic with Latent Dirichlet Allocation (LDA) consists of several stages, namely: Document, Pre-Processing, Modeling Topic, Latent Dirichlet Allocation, Output Document, these stages are shown in Fig. 4. In general, the classification of text based on modeling topics using the Latent Dirichlet Allocation algorithm is carried out using five main stages. The first stage is to prepare a document or dataset that has followed the format in data processing. The second stage is pre-processing, where at this stage the data is cleaned. The third stage is modeling with the aim of classifying text by preparing training data as the main model for topic classification. The fourth stage is to run the Latent Dirichlet Allocation algorithm to calculate and execute the test data that has been prepared. The fifth stage is the result or output of the resulting modeling topic [17]. 128 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) Fig. 4. Modeling Topic Stage with Latent Dirichlet Allocation (LDA) 3.3. Corpus Pre-Processing In conducting modeling topics with LDA, steps are needed to prepare the data so that it can be processed at the next stage, this stage is called the corpus pre-processing stage [18]. The sub-activities of the corpus preprocessing stage are shown in the Fig. 5. Fig. 5. Sub-activity of the corpus pre-processing stage [19] The inputted document will be formed into a lovercase with the intention that the same word but different in capitals and not, is considered the same word. Furthermore, the Takenization process is the process of separating the data in the sentence into single word pieces or termed words. Stopwords are common words that have no meaning and usually occur in large numbers [20]. Given the high frequency of occurrence of stopwords in a document, making the step of eliminating stopwords is very important, because it will make the topic not interpreted properly. In the stemming process it is used to change the form of a word into the root word of the word. The stemming process will eliminate all affixes, which consist of prefixes, insertions, suffixes and combinations of prefixes and suffixes on derived words, because text data needs to be formed into basic words so that there are no words that are the same but different because of affixes [21]. ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 129 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) 3.4. Modeling Topics The Latent Dirichlet Allocation stage aims to ensure that the topic model generated from the topic modeling results carried out on the document is correct, both in the form of topics and words in the topic. In the Topic Model Validation stage, the topic's level of truth is adjusted according to the Perplexit method and based on the level of coherence [22]. Here is how the LDA algorithm works, which will be shown in the (1). 𝑃(𝑊, 𝑍, ∅, 𝜑; 𝛼, 𝛽) = ∏ 𝑝(𝜑𝑖; 𝛽) ∏ 𝑃(∅𝑗 ; 𝛼) ∏ 𝑃 (𝑍𝑗,𝑡 |∅𝑗 )𝑃(𝑊𝑗,𝑡 |𝜑𝑍𝑗,𝑡 ) 𝑁 𝑡=1 𝑀 𝑗=1 𝐾 𝐼=1 (1) Note: M : Shows the number of documents N : The number of words in a particular document (document i N_words) Α : Dirichlet prior parameter on the topic distribution per document Β : Dirichlet prior parameter in the word-by-topic distribution ∅i : Topic distribution for document i φk : Word distribution for the topic k zij : Topic for the jth word in document i wij : Specific word. 4. Results and Discussion In this section, we will discuss the results of the Twitter text mining trial for Topic classification on the online news web with a text mining approach using Latent Dirichlet Allocation (LDA). The trial was carried out with the aim of facilitating decision making in choosing topics to be made for news on the online news web [23]. 4.1. Implementation In conducting this research, it has been carried out in several stages, the first stage is Crawling Twitter data. At this stage the crawling of twitter data is carried out with the keyword "Indonesia" as the hashtag. Then the results of the crawl are changed from json to csv format before going to the next stage. The next stage is topic modeling with the Latent Dirichlet Allocation method, at this stage it will be managed through the Text preprocessing process, where there will be two actions to be taken, namely Lowercase and Tokenizing before finally arriving at the Latent Dirichlet Allocation (LDA) stage. From the results of the implementation of the Latent Dirichlet Allocation (LDA) stage, it will be visualized in the form of a word cloud to make it easier to understand. With visualization, it is hoped that it will make it easier for decision makers what topic to take in making news on the online news web [24]. 4.2. Crawling Implementation This twitter data crawl uses the twitter search API as the processed source text. In the process of crawling data using the python programming language. How to use it is to run the source code that has been prepared. In the management it takes a Consumer key, consumer secret as authentication to access steam twitter that has been provided by the search API [25]. In downloading tweets, the keyword "Indonesia" is written in the source code. After running it will download tweet data and then save it in json form and change it into csv form to continue to the next process. From the process carried out when the source code is finished, the data obtained is complete information containing information about the text of the tweet itself in the form of created_at, id, id_str, text, entities, metadata, source, user, geo, coordinates, retweeted, and others concerned. with full profile of wteet author or twitter user and saved in json form. Fig. 6 shows the small description of 9094 twitter data in json form obtained. 130 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) Fig. 6. Crawling results in json form 4.3. Implementation of Latent Dirichlet Allocation At this stage the topic modeling with Latent Dirichlet Allocation is the stage where the Latent Dirichlet Allocaion process is called. This process is unsupervised learning where machine learning looks for previously undetected patterns in a data set without pre-existing labels and with minimal human supervision. So that the results of this method will be taken by the researcher through the results of the LDA and it is concluded that the topic belongs to a certain category. Topics that will be formed from the processed documents are ten topics. To get maximum results, the document will be trained with vector data that has been previously input. Fig. 7 presents the result of the topic call source code. Fig. 7. Results of topic calling 4.4. Visualization At the visualization stage, it will be done using wordcloud. Wordcloud are words that exist, depicted in a visualization where the size of the letters depends on the frequency with which the word appears in the text. From the results of the previous process will be directly displayed by means of wordcloud. The results of the visualization can be seen in Fig. 8. ISSN 2614-0047 Bulletin of Social Informatics Theory and Application 131 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) Fig. 8. Display of wordcloud 5. Conclusion After doing the analysis and testing that was done previously, the resulting data shows that Latent Dirichlet Allocation can be used for text mining. The data generated from the twitter API crawling process is 9094 documents. From the 9094 document data produced, it was cleaned using text processing and generated tweet data from 9094 to 2909 and resulted in 10 main topics. From the process of this medeling topic, it can be concluded that Latent Dirichlet Allocation can be used for text mining. Acknowledgment We thank the Data Science Interdisciplinary Research Center Universitas Bina Darma for providing laboratory facilities in conducting this research. References [1] A. M. Kaplan and M. Haenlein, “Users of the world, unite! The challenges and opportunities of Social Media,” Bus. Horiz., vol. 53, no. 1, pp. 59–68, 2010, doi: 10.1016/j.bushor.2009.09.003. [2] H. Becker, M. Naaman, and L. Gravano, “Beyond Trending Topics: Real-World Event Identification on Twitter,” 2011. [3] S. Hong, “Online news on Twitter: Newspapers’ social media adoption and their online readership,” Inf. Econ. Policy, vol. 24, no. 1, pp. 69–74, Mar. 2012, doi: 10.1016/j.infoecopol.2012.01.004. [4] M. Allahyari et al., “A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques,” arXiv Prepr. arXiv1707.02919, Jul. 2017, [Online]. Available: http://arxiv.org/abs/1707.02919. [5] R. Kusumaningrum, M. I. A. Wiedjayanto, and S. Adhy, “Classification of Indonesian news articles based on Latent Dirichlet Allocation,” in 2016 International Conference on Data and Software Engineering (ICoDSE), 2016, pp. 1–5. [6] P. Haffner, G. Tur, and J. H. Wright, “Optimizing SVMs for complex call classification,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., 2003, vol. 1, pp. I-632-I–635, doi: 10.1109/ICASSP.2003.1198860. [7] E. S. Negara, D. Triadi, and R. Andryani, “Topic Modelling Twitter Data with Latent Dirichlet Allocation Method,” in 2019 International Conference on Electrical Engineering and Computer Science (ICECOS), Oct. 2019, pp. 386–390, doi: 10.1109/ICECOS47637.2019.8984523. 132 Bulletin of Social Informatics Theory and Application ISSN 2614-0047 Vol. 5, No. 2, September 2021, pp. 124-132 Negara & Triadi (Topic modeling using latent dirichlet allocation) [8] P. N. Howard and M. R. Parks, “Social Media and Political Change: Capacity, Constraint, and Consequence,” J. Commun., vol. 62, no. 2, pp. 359–362, Apr. 2012, doi: 10.1111/j.1460- 2466.2012.01626.x. [9] M. O. Odewole, “The Role of a Librarian in Using Social Media Tools to Promote the Research Output of HIS/ HER Clienteles,” J. Educ. Pract., vol. 8, no. 27, 2017. [10] A. Ju, S. H. Jeong, and H. I. Chyi, “Will Social Media Save Newspapers?,” Journal. Pract., vol. 8, no. 1, pp. 1–17, Jan. 2014, doi: 10.1080/17512786.2013.794022. [11] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993– 1022, 2003, doi: doi/10.5555/944919.944937. [12] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55, no. 4, pp. 77–84, Apr. 2012, doi: 10.1145/2133806.2133826. [13] J. C. Campbell, A. Hindle, and E. Stroulia, “Latent Dirichlet Allocation,” in The Art and Science of Analyzing Software Data, Elsevier, 2015, pp. 139–159. [14] L. Chen, N. Tokuda, and A. Nagai, “A new differential LSI space-based probabilistic document classifier,” Inf. Process. Lett., vol. 88, no. 5, pp. 203–212, Dec. 2003, doi: 10.1016/j.ipl.2003.09.002. [15] W. X. Zhao et al., “Comparing Twitter and Traditional Media Using Topic Models,” in Advances in Information Retrieval, Berlin, Heidelberg: Springer, 2011, pp. 338–349. [16] K. L.Sumathy and M. Chidambaram, “Text Mining: Concepts, Applications, Tools and Issues An Overview,” Int. J. Comput. Appl., vol. 80, no. 4, pp. 29–32, Oct. 2013, doi: 10.5120/13851-1685. [17] Z. A. Guven, B. Diri, and T. Cakaloglu, “Classification of TurkishTweet emotions by n- stage Latent Dirichlet Allocation,” in 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT), Apr. 2018, pp. 1–4, doi: 10.1109/EBBT.2018.8391454. [18] Y. Lu, Q. Mei, and C. Zhai, “Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA,” Inf. Retr. Boston., vol. 14, no. 2, pp. 178–203, Apr. 2011, doi: 10.1007/s10791- 010-9141-9. [19] J. Ucherek et al., “Auto-Suggestive Real-Time Classification of Driller Memos into Activity Codes Using Natural Language Processing,” Feb. 2020, doi: 10.2118/199593-MS. [20] V. Jha, N. Manjunath, P. D. Shenoy, and K. R. Venugopal, “HSRA: Hindi stopword removal algorithm,” in 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Jan. 2016, pp. 1–5, doi: 10.1109/MicroCom.2016.7522593. [21] J. R. Méndez, E. L. Iglesias, F. Fdez-Riverola, F. Díaz, and J. M. Corchado, “Analyzing the impact of corpus preprocessing on anti-spam filtering software,” Res. Comput. Sci., vol. 17, pp. 129–138, 2005. [22] D. Maier et al., “Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology,” Commun. Methods Meas., vol. 12, no. 2–3, pp. 93–118, Apr. 2018, doi: 10.1080/19312458.2018.1430754. [23] N. S. Purohit, A. B. Angadi, M. Bhat, and K. C. Gull, “Crawling through web to extract the data from Social networking site - Twitter,” in 2015 National Conference on Parallel Computing Technologies (PARCOMPTECH), Feb. 2015, pp. 1–6, doi: 10.1109/PARCOMPTECH.2015.7084522. [24] R. Kusumaningrum, S. Adhy, and S. Suryono, “WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation,” TELKOMNIKA (Telecommunication Comput. Electron. Control., vol. 16, no. 4, p. 1752, Aug. 2018, doi: 10.12928/telkomnika.v16i4.8194. [25] J.-H. Lee, “Building an SNS Crawling System Using Python,” J. Korea Ind. Inf. Syst. Res., vol. 23, no. 5, pp. 61–76, 2018, doi: https://doi.org/10.9723/jksiis.2018.23.5.061.