Punlikasi Jurnal JAETS (Eng)


Journal of Applied Engineering and Technological Science 
                       Vol 4(1) 2022 : 390-399                                             

 
390 

 
TWITTER SOCIAL MEDIA CONVERSION TOPIC TRENDING ANALYSIS 

USING LATENT DIRICHLET ALLOCATION ALGORITHM 

 
Musliadi KH

1
, Hazriani Zainuddin

2
, Yuyun Wabula

3 

Department of Computer System, STMIK Handayani Makassar, Makassar, Indonesia 123 

musliadi.esqway165@yahoo.co.id 1, hazriani@handayani.ac.id 2, 

yuyunwabula@handayani.ac.id 3 

 
Received : 28 October 2022, Revised: 06 December 2022, Accepted : 06 December 2022 

*Corresponding Author 

 
ABSTRACT  

In Indonesia, Twitter is one of the most widely used social media platforms. Because of the diverse and 

frequently shifting message patterns on this social media, it is extremely challenging and time-consuming 

to manually identify topics from a collection of messages. Topic modeling is one method for obtaining 

information from social media. The model and visualization of the results of modeling topics that are 

discussed on social media by the Makassar community are the goals of this study. The Latent Dirichlet 

Allocation (LDA) algorithm is used to model and display the results of this study. The modeling results 

indicate that the eighth topic is the most frequently used word in a conversation. In the meantime, the 7th 

and 6th topics emerged as the conversation's core based on the spread of the words with the highest term 

frequency. The study's findings led the researchers to the conclusion that in the Makassar community's 

social media discussions, capitalization and visualization using the LDA method produced the words with 

the highest trend and the topic with the highest term frequency. 

 
Keywords : Topic Analysis, LDA, Trending Twitter Topics, Twitter Conversation Topics 

 
1. Introduction  

The pace at which internet technology is developing has changed the way people live their 

lives. The way people communicate with one another in their day-to-day activities has changed 

as a result of the development of internet technology. While internet technology was initially 

thought to be complicated, it has since become something that the majority of people are familiar 

with (Yatabe et al., 2021). 

Twitter is a smartphone application that has an impact on social interaction and culture. that 

Twitter is one of the social media platforms that can connect individuals from all over the world. 

The community's use of social media has a direct impact, both positive and negative. People who 

use social media frequently at certain times may disrupt their daily activities. For instance: 

Suddenly receiving a message from another person while working, which the recipient reads and 

responds to, can obviously disrupt their work (Ayora et al., 2021). 

Twitter and other social media conversations among members of the community can provide 

data that can be used to examine how information changes over time. Analyses based on this 

information can make predictions about the Makassar community's events (Fraiwan, 2022). 

Topic modeling is a clustering method which is included in unsupervised learning. No labels 

are used in unsupervised learning for an object. In unsupervised learning, there are three types of 

clusters that can be used to model data, namely hard clustering, hierarchical clustering, and 

soft/fuzzy clustering. To model the topic, you can use the soft/fuzzy clustering category, where 

each object can have more than one cluster with a certain level. Topic modeling with soft/fuzzy 

clustering category can use Latent Dirichlet Allocation (LDA) technique or algorithm. LDA is a 

method used to analyze very large documents. In addition to document analysis, LDA can also be 

used to summarize grouping, linking and processing data (Chauhan & Shah, 2021; Gurcan et al, 

2021; Sharma & Sharma, 2022). 

An observational study on the use of social media in determining the trend of discussion 

topics for the Makassar people from 12 to 27 September 2020 using the LDA algorithm is required 

based on the description of the problem's background. The study's title is research “Twitter Social 

Media Conversion Topic Trending Analysis Using Latent Dirichlet Allocation Algorithm”. 


KH et al…                                    Vol 4(1) 2022 : 390-399 

391 

 
2. Literature Review 

2.1 Grow of Internet Technology 

The development of Information and Communication Technology has an impact and 

influence on community culture, both positive and negative impacts. The aspect of life that is 

most affected is in terms of cultural aspects (Irgashevich et al., 2022). From time to time, the 

development of communication technology continues to increase so that it affects the way humans 

communicate. From the results of research conducted by (Cook & Sayeski, 2022), it shows that, 

there is an influence of the use of Smartphone technology on the social interaction of adolescents 

in high school. 

2.2 Social Media 

Social media is digital media or the internet that has the potential as a medium for community 

empowerment. The presence of social media followed by the growing number of users every day 

provides interesting facts about how influential the internet is for life (Valkenburg et al., 2022). 

Social media changes people's attitudes and behavior a lot because social media is used to 

create lies in society. Twitter is one of the social media that functions to find old friends which is 

applied by sending photos, videos, playing games, discussing, and much more (Singh et al., 2022). 

This social media was first founded by Mark Zuckerberg with his roommates and fellow Harvard 

University students, namely Eduardo, Saverin, Danrew McCollum, Dustin Moskovits and Chris 

Hughes (Haupt, 2021). 

2.3 Data Mining 

Data Mining is a scientific discipline that aims to find, explore, or mine knowledge from data 

or information. Data mining is an analytical step to find new knowledge from a database or 

knowledge discovery in a database, where knowledge can be in the form of valid data or 

relationships between data. Data mining can be applied in various fields that have a number of 

data so that data mining can be interpreted as a mixture of statistics, artificial intelligence, and 

database research. The application of data mining in various fields will certainly employ one or 

more computer learning techniques in analyzing or extracting knowledge automatically (Regin et 

al., 2021; Ageed et al., 2021; Oatley, 2022; Haoxiang & Smys, 2021). 

Data mining has an important function to help obtain useful information in increasing 

knowledge for users. Basically, data mining has six functions which refer to Larose quoted, 

namely (Rusydiyah et al., 2021; Ewieda et al., 2021): 

a. Description; aims to identify patterns that appear repeatedly in data and change these patterns 
into rules and criteria that are easy to understand so that they can be easily and effectively 

understood by the application domain so as to increase the level of knowledge in the system. 

This method is a data mining method that is needed by postprocessing techniques in validating 

and explaining the results of the data mining process. Postprocessing is a process used to 

ensure valid and useful results for use by interested parties. 

b. Prediction; This method is used to predict what will happen in the future in a certain time based 
on examples of processed data. 

c. Estimates; this method is similar to prediction, which distinguishes only the variable that is 
the target of the estimate is more in the numerical direction than in the categorical direction. 

The records used to perform estimates must be complete and provide the value of the target 

variable as the predicted value. Then, a review of the estimated value of the target variable is 

made based on the value of the predictive variable. 

d. Classification; This method is a method used to describe and distinguish data into certain 
classes. This method performs the process of checking the characteristics of the object and 

then entering the object into one of the predefined classes. 

e. Clustering; This method is a method of grouping data into the same object class without paying 
attention to certain data classes. Cluster is a collection of records that have similarities with 

each other and have dissimilarities with records in other clusters. The purpose of this process 

is to produce groupings of objects that are similar to each other. The greater the similarity of 

objects grouped in a cluster, the greater the difference between each cluster and the better the 

quality results from cluster analysis. 


KH et al…                                    Vol 4(1) 2022 : 390-399 

392 

 
f. Association; The task of the association method in data mining is to find attributes that appear 
at a time. In the business world, this method is more often called a shopping basket analysis 

(market basket analysis). 

2.4 Text Mining 

Text mining is a process of exploring and analyzing large amounts of unstructured text data 

assisted by software in order to identify concepts, patterns, topics, keywords, and other attributes 

in the data. Text mining usually involves the process of structuring text input, where the input text 

is usually parsed together with the addition of linguistic features and deletion of words which then 

inserts them into the database and then derives patterns in structured data and finally evaluates 

and interprets the output. Text mining capabilities incorporated into AI chatbots and virtual agents 

are increasingly being used by companies in providing automated responses to customers as part 

of their marketing, sales, and customer service operations (Kumar et al., 2021; Hudaefi et al., 

2021; Carracedo et al., 2021). In general, the stages carried out in text mining can be drawn as 

follows: 

a. Tokenizing 
The tokenizing stage is the stage of cutting the input string based on each word that composes 

it from the data source used. An example of the input string truncation stage is as follows: 
 

Fig. 1. Tokenization Stage 

 
b. Filtering 
The next step after tokenizing is taking important words from the token results and discarding 

unimportant words and storing important words. The removal of unimportant words can use the 

stop list algorithm or the word list algorithm. An example of the filtering stage can be seen in the 

following figure: 
 

Fig. 2. Filter Stage 

 
c. Stemming 
The stemming stage is the stage carried out to find the root word of each filtered word. An 

example of the stemming process is more or less as follows: 
 

Fig. 3. Stemming Stage 
 

d. Tagging 
The tagging stage is the process of finding the initial or root form of each past word or word 

from stemming results. The results of the tagging process from the data taken from stemming are 

more or less as follows: 


KH et al…                                    Vol 4(1) 2022 : 390-399 

393 

 
Fig. 4. Tagging Stage 

 
e. Analyzing 
The analyzing stage is the stage carried out to determine how far the connection between 

words between existing documents is. An example of the results of the analyzing stage is more or 

less like the following picture: 
 

Fig. 5. Analysis Stage 

 
2.5 Latent Dirichlet Allocation (LDA) 

Latent Dirichlet Allocation (LDA) is a generative probabilistic model of the corpus. The 

basic idea of LDA is that each document is represented as a random mixture of hidden topics, 

where each topic is characterized by a distribution through the words in it (Gupta & Katarya, 

202). Blei represents the LDA method as a probabilistic model visually (Habibi et al., 2021; Ning 

et al., 2022) : 
  

Fig. 6. Visualization of LDA according to Blei 

 
Visualization of LDA according to Blei has three levels of LDA representation. Parameters 

and are corpus level parameters which are assumed to be sampled once in the process of producing 

a corpus. Variable is document level variable (M). Variables Zdn and Wdn are word level 

variables (N) and are sampled once for each word in a document. The parameter is used to 

determine the distribution of topics in a document, the greater the value of the greater the mix of 

topics in a document. On the other hand, the smaller the value of the smaller the mix of topics in 

a document. The parameter is used to determine the distribution of words in a topic, the greater 

the value of the more words there are in a topic. On the other hand, the smaller the value of the 

smaller the number of words in a topic. This variable represents the distribution of topics in a 

document, the greater the value of the more topics there are in the document. Conversely, the 

smaller the value of the fewer topics there are in the document. 

In general, the way LDA works is by entering a document set and some specified parameters. 

Then the LDA process is carried out to produce a model consisting of weights that can be 

normalized to probability. Probability appears in 2 types: (a) the probability that a certain 


KH et al…                                    Vol 4(1) 2022 : 390-399 

394 

 
document produces a certain topic in a position and (b) the probability that a certain topic produces 

a certain word from a collection of vocabulary. 

2.6 Python 

Python is a multipurpose interpretive programming language with a design philosophy that 

focuses on code readability. Python is an open source programming language. Python itself was 

launched in the community since 1991 by Guido van Rossum under the name of the Python 

Software Foundation vendor. Python is claimed to be a programming language that combines 

capabilities, capabilities with a clear code syntax and is equipped with a complete and 

comprehensive library. 

 
3. Research Methods 

Research methodology is a procedure or method used in conducting research along with 

steps that are systematically arranged to solve the problem being studied using a certain scientific 

basis. The research methodology framework used can be seen in Figure 7 below: 
 

Fig. 7. Research methodology framework 

 
4. Results and Discussions  

The application of the Latent Dirichlet Allocation (LDA) algorithm to community twitter 

data on 12-27 September 2020 was carried out by following the research methodology framework 

with the following results: 

a. Data collection 
The results of data collection carried out on 12-27 September 2020 obtained data of 224,515 

records from 29,668 users. Sample data from the results of data collection carried out can be seen 

in Table 1. 
Table 1 – Sample Data. 

Nama User Username Text Date Time Location 

Agnesia 

Hartono 

agnesia_harton

o 

Keberhasilan adalah 

buah dari kerja keras 

+ pantang menyerah, 

Bukan dari sekedar 

mimpi 

2020-11-14 13.49.29 Makassar, 

Indonesia 


KH et al…                                    Vol 4(1) 2022 : 390-399 

395 

 
Andi 

Muhamma

d Irham 

Andirham Risih juga di telpon 

terus.. 

2020-11-14 13.49.21 Makassar 

Online 

Shop 

Makassar 

TerkiniGaul Hey\(â€™âˆ‡â€™)/ 

ranni24_ AyoBantu 

Retweet & Cekidot 

TokoTamz 

pinBB:2BB19B17 

Jual Sepatu Baju 

Aksesoris Cewek di 

Makassar 

2020-11-25 16.44.19 Kota 

Makassar 

Makassar 

Event 

#Makassar

Event 

MakassarAcar

a 

Hey(ã•£^â–¿^)ã•£ 

Pengurus_masjid 

AyoBantu Retweet & 

Cekidot TokoTamz 

pinBB:2BB19B17 

Jual Sepatu Baju 

Aksesoris Cewek di 

Makassar 

25/11/2020 16.44.18 Makassar 

 
b. Data Cleaning 
Data cleaning aims to remove parts of the data that are not used in the analysis process 

carried out to model the data. The focus of data analysis will only use data in the Text column 

which is the status of each user. Based on the results of data collection, in the data there are several 

columns that are not needed in the analysis process, then these columns will be cleaned. 
 

Fig. 8. Data Cleaning 

 
c. Data Preposing 
The implementation of data preposing is carried out before analyzing the topic using LDA 

with the aim of structuring, tidying, and preparing the data before the data is analyzed. Data 

preprocessing is done sequentially, namely removing punctuation marks, removing numbers 

between spaces, case folding and removing stopwords. After preposing the data, punctuation 

marks, numbers between spaces, case folding and stopwords such as the word "yang" are 

removed. 
 

Fig. 9. Before Preprocessing Data 

 
Fig. 10. After Preprocessing Data 

d. Wordcloud 


KH et al…                                    Vol 4(1) 2022 : 390-399 

396 

 
After preposing the data, punctuation marks, numbers between spaces, case folding and 

stop words in the data are removed. To see the results of the data preposing process, the data is 

displayed in the form of a wordcloud visual representation of the most common words, so the 

results are as follows. 
 

Fig. 11. Wordcloud Visual Representation 

 
e. LDA Topic Modeling 
Before doing topic modeling with LDA, it is necessary to determine the number of topic 

models. Determination of the number of topic models is done by looking at the coherence score. 

The coherence score is a measure used to evaluate topic modeling. Good modeling will produce 

topics with high topic coherence scores (Lestari, 2019) 
Table 2 – Coherence Score 

Topic Alpha Beta Corehence 

9 0.90 0.90 0.6821430186753112 

10 0.90 0.90 0.6813112648237454 

8 0.90 0.90 0.6485276773680742 

9 0.90 0.90 0.6211660697995236 

10 0.90 0.90 0.6185417503103652 

4 asymmetric 00.01 0.6071873122344424 

8 0.90 0.90 0.6031553436565154 

2 00.01 00.01 0.5954692961848775 

4 asymmetric 00.01 0.5910008238074393 

5 asymmetric 0.90 0.5908647353392914 

 
The coherence score generated in determining the number of topics is on topic 9 with a 

coherence score of 0.6821430186753112. The greater the coherence score obtained, the better the 

interpretation of the modeling topic will produce (Lestari, 2019). Based on the best results on the 

coherence score, this study used 9 (nine) topics. 
 

Fig. 12. Display coherence value 

 
Table 3 - Topic Division and Its Coherence Value. 

Topic Terms 

T0 
0.016*"mas" + 0.016*"banget" + 0.016*"gaa" + 0.016*"abis" + 0.016*"seneng" + 

0.016*"cust" + 0.016*"hasilnya" + 0.016*"adek" + 0.016*"trus" + 0.016*"ngejoki" 

T1 

0.033*"bales" + 0.017*"dendam" + 0.017*"chat" + 0.017*"sengaja" + 

0.017*"draw" + 0.017*"tiddies" + 0.017*"adding" + 0.017*"bneran" + 

0.017*"learn" + 0.017*"curut" 

T2 

0.017*"door" + 0.017*"kendaraan" + 0.017*"balas" + 0.017*"visor" + 

0.017*"brader" + 0.017*"bupati" + 0.009*"trus" + 0.009*"orang" + 

0.009*"lagunya" + 0.009*"ovo" 


KH et al…                                    Vol 4(1) 2022 : 390-399 

397 

 
T3 

0.024*"pagi" + 0.024*"indah" + 0.024*"bales" + 0.024*"hidup" + 0.013*"orang" + 

0.013*"melepas" + 0.013*"cab" + 0.013*"eksport" + 0.013*"pangan" + 

0.013*"soppeng" 

T4 

0.034*"berhenti" + 0.023*"pijat" + 0.023*"rakyat" + 0.012*"badan" + 0.012*"real" 

+ 0.012*"bersahabat" + 0.012*"privasi" + 0.012*"fresh" + 0.012*"sehat" + 

0.012*"terjamin" 

T5 
0.044*"lagu" + 0.030*"enak" + 0.016*"pas" + 0.016*"army" + 0.016*"banget" + 

0.016*"jujur" + 0.016*"muncul" + 0.016*"kurus" + 0.016*"kek" + 0.016*"obatnya" 

T6 

0.023*"ways" + 0.020*"makassar" + 0.019*"cekidot" + 0.019*"aksesoris" + 

0.019*"las" + 0.019*"ayobantu" + 0.019*"pinbb" + 0.018*"bb" + 0.018*"cewek" + 

0.018*"baju" 

T7 

0.020*"tokotamz" + 0.020*"jual" + 0.019*"retweet" + 0.019*"sepatu" + 

0.019*"baju" + 0.019*"cewek" + 0.019*"bb" + 0.019*"pinbb" + 0.019*"ayobantu" 

+ 0.018*"aksesoris" 

T8 

0.043*"orang" + 0.015*"wheels" + 0.015*"album" + 0.015*"mood" + 

0.015*"niggas" + 0.015*"cintailah" + 0.015*"st" + 0.015*"thursday" + 

0.015*"percayai" + 0.015*"back" 

 
Each topic generated shows the cohorence value of each word in each topic with a different 

value. Based on these results, we can see some examples of words with the highest 

cohorence/probability values, namely: “0.044_lagu” on topic T5, “0.043_orang” on topic T8, 

“0.034_berhenti” on topic T4, and on T1 there is the word “bales” with cohorence value 0.034. 

 
f. Topic Modeling Visualization 
The modeling visualization in the research after completing the modeling using LDA is 

saved in the form of pyLDAvis which can form a visualization of each topic and the most 

frequently occurring words. 
 

Fig. 13. pyLDAvis visualization 
 

The pyLDAvis visualization results display 30 important words that appear in the corpus 

and display the dominant words discussed from 9 topics. In the right panel, the visualization 

displays the terms song, bales, makassar, you, sell, accessories and other words. 

In addition to displaying 30 important words on all topics, the visualization results also 

display 30 important words from each topic. One topic with another topic may have the same 

word so that it overlaps with each other. For example, Topic 2 and Topic 5. The two topics overlap 

each other, because Topic 2 has words that are also found in Topic 5 or vice versa. 


KH et al…                                    Vol 4(1) 2022 : 390-399 

398 

 
Based on the topic distribution data based on the cohorence value and the visualization 

results show the T8 topic to be the topic of conversation with the highest cohorence value because 

of the appearance of the word "People" in the visualization results. 

Meanwhile, based on the results of matching the visualization with the words in the 

"Terms" column, the trending topics of conversation gathered on one topic, namely topics T7 and 

T6. In the T7 topic there are the words "tokotamz", "selling", "retweet", "shoes", "clothes", "girls", 

"bb", "pinbb", "ayobantu", and the word "accessories" which is a word that there are 30 important 

words, while on topic T6 there are the words "ways", "makassar", "cekidot", "accessories", 

"ayobantu", "pinbb", "bb", "girls" and the word "clothes". 

 
5. Conclusion  

Based on the results of the study, it was concluded that the capitalization and visualization 

with the LDA method produced the words with the highest trend and the topic with the highest 

term frequency in the discussion of the Makassar community on social media was on topic 8. 

women's accessories. 

 
References 

Ageed, Z. S., Zeebaree, S. R., Sadeeq, M. M., Kak, S. F., Yahia, H. S., Mahmood, M. R., & 

Ibrahim, I. M. (2021). Comprehensive survey of big data mining approaches in cloud 

systems. Qubahan Academic Journal, 1(2), 29-38. 

Ayora, V., Horita, F., & Kamienski, C. (2021, January). Profiling Online Social Network 

Platforms: Twitter vs. Instagram. In Proceedings of the 54th Hawaii International 

Conference on System Sciences (p. 2792). 

Carracedo, P., Puertas, R., & Marti, L. (2021). Research lines on the impact of the COVID-19 

pandemic on business. A text mining analysis. Journal of Business Research, 132, 586-

593. 

Chauhan, U., & Shah, A. (2021). Topic modeling using latent Dirichlet allocation: A survey. ACM 

Computing Surveys (CSUR), 54(7), 1-35. 

Ewieda, M., Shaaban, E. M., & Roushdy, M. (2021). Customer Retention: Detecting Churners in 

Telecoms Industry using Data Mining Techniques. International Journal of Advanced 

Computer Science and Applications, 12(3). 

Fraiwan, M. (2022). Identification of markers and artificial intelligence-based classification of 

radical Twitter data. Applied Computing and Informatics.  

Gupta, A., & Katarya, R. (2021). PAN-LDA: A latent Dirichlet allocation based novel feature 

extraction model for COVID-19 data using machine learning. Computers in biology and 

medicine, 138, 104920. 

Gurcan, F., Ozyurt, O., & Cagitay, N. E. (2021). Investigation of emerging trends in the e-learning 

field using latent Dirichlet allocation. International Review of Research in Open and 

Distributed Learning, 22(2), 1-18. 

Haoxiang, W., & Smys, S. (2021). Big data analysis and perturbation using data mining algorithm. 

Journal of Soft Computing Paradigm (JSCP), 3(01), 19-28. 

Haupt, J. (2021). Facebook futures: Mark Zuckerberg’s discursive construction of a better world. 

New Media & Society, 23(2), 237-257. 

Hudaefi, F. A., Caraka, R. E., & Wahid, H. (2021). Zakat administration in times of COVID-19 

pandemic in Indonesia: a knowledge discovery via text mining. International Journal of 

Islamic and Middle Eastern Finance and Management. 

Irgashevich, S. T., Odilovich, O. A., & Mamadaliyevich, G. E. (2022). Internet Technologies In 

The Tourism Industry. Web of Scientist: International Scientific Research Journal, 3(9), 

57-64. 

Kumar, S., Kar, A. K., & Ilavarasan, P. V. (2021). Applications of text mining in services 

management: A systematic literature review. International Journal of Information 

Management Data Insights, 1(1), 100008. 

Ning, W., Liu, J., & Xiong, H. (2022). Knowledge discovery using an enhanced latent Dirichlet 

allocation-based clustering method for solving on-site assembly problems. Robotics and 

Computer-Integrated Manufacturing, 73, 102246. 


KH et al…                                    Vol 4(1) 2022 : 390-399 

399 

 
Oatley, G. C. (2022). Themes in data mining, big data, and crime analytics. Wiley 

Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(2), e1432. 

Regin, R., Rajest, S. S., & Singh, B. (2021). Spatial data mining methods databases and statistics 

point of views. Innovations in Information and Communication Technology Series, 103-

109. 

Sharma, C., & Sharma, S. (2022). Latent DIRICHLET allocation (LDA) based information 

modelling on BLOCKCHAIN technology: a review of trends and research patterns used in 

integration. Multimedia Tools and Applications, 1-27. 

Valkenburg, P. M., Meier, A., & Beyens, I. (2022). Social media use and its impact on adolescent 

mental health: An umbrella review of the evidence. Current opinion in psychology, 44, 58-

68. 

Yatabe, J., Yatabe, M. S., & Ichihara, A. (2021). The current state and future of internet 

technology-based hypertension management in Japan. Hypertension Research, 44(3), 276-

285.