Journal of Applied Engineering and Technological Science Vol 4(2) 2023: 855-863 855 TWITTER DATA ANALYSIS AND TEXT NORMALIZATION IN COLLECTING STANDARD WORD Arif Ridho Lubis1*, Mahyuddin K M Nasution2 Department of Computer Engineering and Informatics, Politeknik Negeri Medan, Indonesia1 Faculty of Computer Science and Information Technology, Universitas Sumatera Utara, Indonesia2 arifridho@polmed.ac.id Received : 18 April 2023, Revised: 07 May 2023, Accepted : 08 May 2023 *Corresponding Author ABSTRACT This study discusses the Twitter data analysis and text normalization in standard word collection. Twitter is one of the most important data sources in social data analysis. However, the text contained on Twitter is often unstructured, resulting in difficulties in collecting standard words. Therefore, in this research, we analyze Twitter data and normalize text to produce standard words that can be used in social data analysis. The purpose of this research is to improve the quality of data collection on standard words on social media from Twitter and facilitate the analysis of social data that is more accurate and valid. The method used is natural language processing techniques using classification algorithms and text normalization techniques. The result of this study is a set of standard words that can be used for social data analysis with a total of 11430 words, then 4075 words with structural or formal words and 7355 informal words. Informal words are corrected by trusted sources to create a corpus of formal and informal words obtained from social media tweet data @fullSenyum. The contribution to this research is that the method developed can improve the quality of social data collection from Twitter by ensuring the words used are standard and accurate and the text normalization method used in this study can be used as a reference for text normalization in other social data, thus facilitating collection. and better-quality social data analysis. This research can assist researchers or practitioners in understanding natural language processing techniques and their application in social data analysis. This research is expected to assist in collecting social data more effectively and efficiently. Keywords: Natural Language processing, Word, Formal, Analysis, Twitter 1. Introduction The development of social media has now become a trend in terms of communication between on line users (Anandhaun et al., 2018; Zeng et al., 2010), social media is an online media with a gathering of users who can easily communicate, share and participate one another (Schreck & Keim, 2012; Middleton et al., 2013; A.R. Lubis et al., 2019). Data in social media represents research and manageable challenges in natural language processing (Young et al., 2018; Liang & Dai, 2013). Twitter is a widely popular social media platform around the world, where users can post short messages known as "tweets"(Lubis, Prayudani, Lubis, et al., 2022; Lubis, Prayudani, Nugroho, et al., 2022). Due to the brevity of tweets and the fact that they sometimes do not follow proper grammar or spelling rules, text normalization is required to collect correct and consistent words(A.R. Lubis et al., 2019; Arif Ridho Lubis et al., 2020). In this context, text normalization refers to the process of converting non-standard text to standard text by correcting spelling, grammar, and other errors(Neto et al., 2020; Dirkson et al., 2019). The results of text normalization can help improve the accuracy of collecting correct and consistent words for further analysis. The data obtained from social media was still unstructured which still needed to be improved (H. Zheng et al., 2020; X. Zheng et al., 2015). Several studies had been carried out on social media data in many languages such as Indian (Tanna et al., 2020; Roshini et al., 2019; Kumar et al., 2021) Chinese (Xuanyuan et al., 2021; Liu & Chen, 2019). The conducted research focused on improving the technique of the preprocessing process and the completion of non- standard and unstructured words due to the use of words and phrases in communication in Indonesian-language social media (Chen et al., 2020; Alhaj et al., 2022), Preprocessing is a stage in the natural language preprocessing method intended for documents in the form of text (Pano & Kashef, 2020; Villavicencio et al., 2021). The goal was to prepare data or text obtained from Lubis & Nasution … Vol 4(2) 2023 : 855-863 856 unstructured social media into good data and could be easily processed for further processing(Jimenez-Marquez et al., 2019;Shu et al., 2017; Iskandar & Marjuki, 2022). In the preprocessing technique there were several processes such as parsing, case folding, tokenizing, stemming, filtering/stop words, normalization (Chen et al., 2020; Baccouche et al., 2020; Sarimole & Fadillah, 2022). In the text normalization stage, it was very important to be able to help parse Indonesian language that could understand lexical meaning well, performance in processing structural and unstructured words could be improved if the preprocessing stages were carried out properly, especially normalization for unstructured words (Izonin et al., 2022; Basan et al., 2022). Research by (Göker & Can, 2018) In this study, two approaches were carried out for the Turkish text i.e. the contextual normalization approach and the sequence-to-sequence normalization approach using a neural encoder model. Other studies (Nguyen et al., 2017) carried out normalization which could extract information related to data accurately. Based on the conclusions and objectives of previous research, there are still weaknesses in normalizing text, so this study will apply text normalization which will use Indonesian language data originating from social media Twitter, which has unstructured and non-standard words, normalization techniques used to carry out text normalization process so that data can be processed and analyzed further. This normalization stage is adjusted to the data that has been obtained from social media (Jose & Raj, 2014;bin Sazali & Idris, 2022). The nature of the data obtained from social media is that there are users who generally communicate and express expressions (Sebastian & Nugraha, 2019), for example they say "kota in sunggh sjuk setiap hari", in the tweet snippets obtained by naked eye you can see several errors like "in" which should "ini", "sunggh" that should be "sungguh", and "sejk" harus "sejuk". If the words are processed and analyzed, such as sentiment, and classification, the results obtained are not accurate, so text processing analysis is needed. The purpose of this study is to use a dictionary-based normalization approach so that it can have superior differences from previous research. This study divides the process into 3 parts, namely text normalization, statistical words or lexicon, and non-standard words that appear in tweet data. In related research, standard Indonesian is specifically discussed. This study also provides slang and formal words to understand the characteristics of the tweet data. 2. Literature Review Many researchers had applied the model of deep learning to many cases such as (Maghfur et al., 2021) Text-to-Speech (TTS) is widely used for both academic/non-commercial and industrial/commercial purposes. However, in some cases, text normalization is added to improve TTS performance. In this study, a rule-based approach is proposed to create a normalized Indonesian text dataset that has raw text and spoken form to improve Indonesian TTS performance. This approach shows good performance for text normalization for Indonesian TTS with a Word Error Rate (WER) of 0.0805. Another study conducted by (Khan & Lee, 2021) concluded that In this research, it is proposed to develop an application called Textual Variations Handler (TVH), which is a generic application that works in a variation-independent manner to handle various types of noise in textual data originating from various social media (SM) applications to improve text analysis. The aim of this research is to introduce a hybrid normalization technique that is effective in ensuring that information obtained from noisy text data can be utilized in the desired form. This study integrates the TVH application with a deep- learning state-of-the-art (SOTA) based text analysis method to improve its performance in analyzing noisy SM text data. The simulation results show that the proposed scheme is promising in terms of precision, recall, accuracy, and F1 scores in the analysis of informal texts on social media. This research (Sebastian & Nugraha, 2019), this research normalized the Indonesian language with data having several words from data consisted of unstructured sentences and non- standard words. The normalization method was carried out to analyze data for the next process. This research (Rahman et al., 2019) analyzed the data of Indonesian tweets consisting of unstructured text in order to complete the word processing process and clean up tweet data from unstructured text. In processing text by using case folding, filtering stages, tokenizing. Then the normalization process was carried out so that words having excess letters, abbreviation, slang and all documents were converted into standard words and if there were words without meaning, they Lubis & Nasution … Vol 4(2) 2023 : 855-863 857 were deleted. Other research conducted by (Javaloy & García-Mateos, 2020) concluded that this study proposes a new method for the encoder in the encoder-decoder architecture called Causal Feature Extractor (CFE) for text normalization as the first step in a text-to-speech system ( TTS). CFE is compared to other encoding methods and shows better results in terms of accuracy, number of parameters, convergence time, and the use of the attention matrix based on the attention mechanism. The proposed method is general in nature and can be applied to various input types such as text, audio and images. This research (Sebastian & Nugraha, 2019) collected data in conducting research in the field of text mining. Data collection consists of many languages which were then processed to obtain normalized data in the word processing. Abbreviated words were a problem in text mining which resulted in the system not being able to process the text optimally due to differences in the meaning of the abbreviated words. The purpose of this study was to develop and obtain a set of Indonesian abbreviations. This research applied Crowdsourcing method in developing the dataset. Based on previous research that has been described there are deficiencies that can be resolved in this study which are listed in Table 1 below: Table 1 - Examples of the Use of Normal and Normal Words No Researcher Title Lack 1 (Maghfur et al., 2021) Text Normalization for Indonesian Text-to-Speech (TTS) using Rule-Based Approach: A Dataset and Preliminary Study This research is rule-based only and does not involve machine learning, so it may not be able to handle some of the more complex cases of text normalization. 2 (Khan & Lee, 2021) Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text There is no comparison with other text normalization methods that have been used in previous studies. This study only compares the performance of the proposed method with the same method without normalization. 3 (Rahman et al., 2019) Normalization of Unstructured Indonesian Tweet Text For Presidential Candidates Sentiment Analysis The text normalization dataset created in this study has not been verified considering regional variations in Indonesian and may need to be expanded. 4 (Javaloy & García- Mateos, 2020) Text normalization using encoder–decoder networks based on the causal feature extractor This study only discusses text normalization in English. Thus, it is not yet known how effective the proposed method is when applied to other natural language processing (NLP) problems. 5 (Gunawan et al., 2019) Normalization of abbreviation and acronym on Microtext in Bahasa Indonesia by using dictionary-based and longest common subsequence (LCS) This research has a weakness in filtering text, so it is necessary to add data and normalize it 6 (Kusumawardani et al., 2018) Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings This research only uses 1000 dictionary data so it needs to be added so that word tokens can represent all text data The solution offered was the use of statistical matching translation to carry out the process of normalizing Indonesian text by utilizing translation at the phrase and character level in text data. In this case, using an external corpus for the data to be used in the normalization model. Indonesian language has a colloquial structure that could be found in the tweet data. In some cases, the everyday language words contained in the tweet data were still difficult to understand. The following examples of categories and samples could be seen in Table 2. On the linguistic side, you could also use the lexicon to search and observe slang in social media. In general, a comparison of the appearance of slang words from tweet data on social media was then carried out. Table 2 - Examples of the Use of Normal and Normal Words No Non-Formal formal 1 sjuk Sejuk 2 Dtng datang 3 Brlri berlari 4 Plng pulang 5 skolah sekolah Lubis & Nasution … Vol 4(2) 2023 : 855-863 858 3. Research Methods The quantitative research method will assist in systematically and objectively collecting data from Twitter. The obtained data will be processed using data analysis techniques such as descriptive statistics and text classification. Additionally, this research will also employ text normalization methods to correct spelling and grammar errors in the text obtained from Twitter. Text normalization methods can include the removal of punctuation marks, the use of word- breaking algorithms, and the combination of separated words. This study will use a sample of data from the Twitter account @fullsenyum, which will be taken through a random sampling process. The data will then be processed using data analysis techniques and text normalization methods. Figure 1 shows the research flowchart. Fig. 1. Distribution of formal word Frequency Based on Figure 1, there are two stages in collecting standardized words as follows: a) Collecting Data To obtain tweet data from the @fullsenyum community on the Twitter social media platform, this study used crawling techniques. The collected tweet data covers the period from 2018 to 2021, and a total of 11,430 tweets were successfully gathered. Table 3 shows the results of the crawling process. b) Preprocessing The preprocessing stage was needed in this research because the tweet data obtained from Twitter social media had unstructured data so that preprocessing was carried out to remove punctuation marks, emoticons, and make changes to uppercase letters to lowercase letters and eliminate words considered not important. The following was the result of preprocessing data. Table 3 - The Preprocessing Result No Tweet 1 bener2 perjuangan, ngoding di hp dia.. KEREN! Ekspektasi: Banyak harta Kenyataan: Banyak pikiran 2 anak perempuan milik ayahnya sampai ia menikah, tetapi anak laki-laki milik ibunya sampai ia mati 3 Udah seneng banget punya sahabat cowok. Eh, dianya malah nembak Golongan orang yg kalo udah ga respect, udah ga mau kenal lagi. 4 Mau ngumpulin orang yang makin hari makin males buka WhatsApp. 5 my mood can change from 6 take me back to the days where i sleep without over thinking. 7 "kok bisa putus sih?" "ya bisa" 8 Gimana ya kalo aku bukan kriteria yang diinginkan keluarganya Lubis & Nasution … Vol 4(2) 2023 : 855-863 859 9 ni gw dah ngerjain tugas mati2an dari sd ampe kuliah awas ae gw gede ga tajir 10 hidup sudah menyebalkan, gausah -penting banget ok 10000 …….. 11429 Makin kesini makin nguras mental banget ya. 11430 sebaik baiknya mood booster dan support system adalah uang, duit, cuan, dan money c) Method In this study, the text normalization method was used, which is the process of converting non-standardized text into standardized text, so that it is easier for computers to understand and process. There are several techniques commonly used in text normalization, such as removing non-alphanumeric characters, replacing slang words with standard words, adjusting the use of capital and small letters, and handling abbreviations and abbreviations. More sophisticated methods of text normalization use natural language processing (NLP) technologies and machine learning to recognize and correct errors in text, such as misspellings or incorrect use of words in certain contexts 4. Results and Discussions This study built a lexicon by processing the words contained in the tweet data, totaling 11,430 data obtained from social media Twitter on the @fullsenyum account which had a public figure account. The tweet data obtained were preprocessed with several steps to improve the sentence structure of the tweet data. After the preprocessing stage, the daily Indonesian language lexicon produced contained 4075 formal words and 7355 informal words. Most of them were informal words in Indonesian. Each record had 2 columns: • Non-formal words: words with non-formal meaning • Formal: a formal word suited the Indonesian dictionary Table 4 presented basic information about total words, total formal words and total informal words. There were 7355 informal words then there were 4075 formal words. Table 5 showed ten examples of informal words and then changes were made to informal words so that the quality of the data was better. Among 4075 unique formal words, 1,159 (67%) words occurred only once. Furthermore, Figure 1 showed that the distribution of words in the data to the occurrence of formal words applied Zipf law. Table 4 - Formal and nonformal words Total Data Formal word Non-Formal Word 11430 4075 7355 Table 5 - Example Of informal words and then changes were made to informal words Non-Formal Word URL formal word Formal Word yajangan https://kbbi.web.id/jangan jangan temanyang https://kbbi.web.id/teman teman cuplikanya https://lambeturah.id/arti-kata-cuplikan-adalah/ cuplikan enih https://www.litbang.pertanian.go.id/info-aktual/3962/ benih setidanya https://kbbi.web.id/tidak setidaknya tonight https://www.babla.co.id/bahasa-inggris-bahasa-indonesia/tonight malam waroeng https://kbbi.web.id/warung warung dhuafa https://dompetdhuafa.org/id/berita/detail/pengertian-dhuafa- menurut-islam duafa megamal http://p2k.unkris.ac.id/id3/2-3065-2962/Mega-Mall-Batam- Center_69203_p2k-unkris.html tempat guestar https://tr-ex.me/terjemahan/bahasa+inggris- bahasa+indonesia/guest+star tamu From the analyzed tweet data, this study found that all the words contained in the @fullsenyum community tweet data had bad structures so after preprocessing, 10221 words were obtained in the Indonesian Dictionary. This study comprehended only the types of AOA words contained in the data, this study presented 10 examples in table 6. Most of the informal words that were not listed in the Indonesian Dictionary had word excess or lack of words. Thus, it was difficult to represent the word and difficult to process https://kbbi.web.id/jangan https://kbbi.web.id/teman https://lambeturah.id/arti-kata-cuplikan-adalah/ https://www.litbang.pertanian.go.id/info-aktual/3962/ https://kbbi.web.id/tidak https://www.babla.co.id/bahasa-inggris-bahasa-indonesia/tonight https://kbbi.web.id/warung https://dompetdhuafa.org/id/berita/detail/pengertian-dhuafa-menurut-islam https://dompetdhuafa.org/id/berita/detail/pengertian-dhuafa-menurut-islam http://p2k.unkris.ac.id/id3/2-3065-2962/Mega-Mall-Batam-Center_69203_p2k-unkris.html http://p2k.unkris.ac.id/id3/2-3065-2962/Mega-Mall-Batam-Center_69203_p2k-unkris.html https://tr-ex.me/terjemahan/bahasa+inggris-bahasa+indonesia/guest+star https://tr-ex.me/terjemahan/bahasa+inggris-bahasa+indonesia/guest+star Lubis & Nasution … Vol 4(2) 2023 : 855-863 860 Table 6- Example Of not listed analyzed words KBBI Non-Formal Word Formal Sentence advantages and disadvantages yajangan jangan ya temanyang teman yang cuplikanya cuplikan ya enih benih E setidanya setidaknya k tonight malam malam waroeng warung e dhuafa duafa duafa megamal tempat megalmal guestar tamu Guestar From this study could see the distribution of words in the data shown in Figure 2, formal words tend to be more structured and had longer words than non-formal words with the average number of characters per word being 7 for the first and 6 for the last. Fig. 2. The number of normalized characters for non-formal and formal words. At this stage analyzed the frequency of words contained in the tweet data that had been preprocessed. Frequency was done to see how many repetitions of words in the data. At this stage it focused on making changes manually to words with poor structures including excess words, word deficiencies and errors in the preprocessing process. Table 7 showed examples of words with bad structure, which were then corrected manually so that they could be detected by the Indonesian dictionary. Table 7 - Correction of Unstructured non-formal words Non-Formal Word URL formal word Formal Word yajangan https://kbbi.web.id/jangan jangan temanyang https://kbbi.web.id/teman teman cuplikanya https://lambeturah.id/arti-kata-cuplikan-adalah/ cuplikan enih https://www.litbang.pertanian.go.id/info-aktual/3962/ benih setidanya https://kbbi.web.id/tidak setidaknya tonight https://www.babla.co.id/bahasa-inggris-bahasa-indonesia/tonight malam waroeng https://kbbi.web.id/warung warung dhuafa https://dompetdhuafa.org/id/berita/detail/pengertian-dhuafa- menurut-islam duafa megamal http://p2k.unkris.ac.id/id3/2-3065-2962/Mega-Mall-Batam- Center_69203_p2k-unkris.html tempat Based on Table 7, the results of the 11430 Tweet data were then preprocessed which resulted in data with a total of 10,221 then separated standard and non-standard sentences according to the data of the Indonesian language dictionary obtained by the author. The obtained results were 4075 standard words and non-standard words. The standard number was 4778, after a lexical analysis had been carried out that informal words had poor word structures. Thus, to get words with good structures, the author conducted a manual word identification according to the sources that had been given. Words with poor structure contained excessive words, repeated words and no space words. https://kbbi.web.id/jangan https://kbbi.web.id/teman https://lambeturah.id/arti-kata-cuplikan-adalah/ https://www.litbang.pertanian.go.id/info-aktual/3962/ https://kbbi.web.id/tidak https://www.babla.co.id/bahasa-inggris-bahasa-indonesia/tonight https://kbbi.web.id/warung https://dompetdhuafa.org/id/berita/detail/pengertian-dhuafa-menurut-islam https://dompetdhuafa.org/id/berita/detail/pengertian-dhuafa-menurut-islam http://p2k.unkris.ac.id/id3/2-3065-2962/Mega-Mall-Batam-Center_69203_p2k-unkris.html http://p2k.unkris.ac.id/id3/2-3065-2962/Mega-Mall-Batam-Center_69203_p2k-unkris.html Lubis & Nasution … Vol 4(2) 2023 : 855-863 861 5. Conclusion In this study, the researcher presented a lexicon of formal words contained in the Twitter social media data with the normalized @fullsenyum account. The resulting corpus data was useful for natural language preprocessing processes or tasks in Indonesian. In this case, the Indonesian language corpus data were added to the data obtained from Twitter. the data were obtained available on GitHub under the MIT license. In this study there were several techniques in making formal Indonesian word corpus data as a dictionary for the stages in text normalization, and as a data collection as a model of natural language preprocessing but this corpus data could be used in developing science in the field of natural language preprocessing in detail. This research should be developed more widely to utilize tweet data with many characteristics of the sentences contained in the tweet data. Many studies had been carried out in the field of natural language preprocessing. This research was expected to improve the performance of social media analysis, especially Twitter in Indonesia This research can contribute to the development of a more standardized and structured Indonesian lexicon. In this context, the text normalization method used in this research can be the basis for collecting standard words that are often used in Indonesian based on Twitter data. Another implication is that this research can assist in the development of a more effective and accurate text mining system for analyzing social media data, especially in the Indonesian context. References Alhaj, Y. A., Dahou, A., Al-qaness, M. A. A., Abualigah, L., Abbasi, A. A., Almaweri, N. A. O., Elaziz, M. A., & Damaševičius, R. (2022). A novel text classification technique using improved particle swarm optimization: A case study of Arabic language. Future Internet, 14(7), 194. Anandhan, A., Shuib, L., Ismail, M. A., & Mujtaba, G. (2018). Social media recommender systems: review and open research issues. IEEE Access, 6, 15608–15628. Baccouche, A., Ahmed, S., Sierra-Sosa, D., & Elmaghraby, A. (2020). Malicious text identification: deep learning from public comments and emails. Information, 11(6), 312. Basan, E., Basan, A., Nekrasov, A., Fidge, C., Abramov, E., & Basyuk, A. (2022). A Data Normalization Technique for Detecting Cyber Attacks on UAVs. Drones, 6(9), 1–21. https://doi.org/10.3390/drones6090245 bin Sazali, M. A. H., & Idris, N. B. (2022). Neural Machine Translation for Malay Text Normalization using Synthetic Dataset. 2022 10th International Conference on Information and Communication Technology (ICoICT), 386–390. Chen, W., Xu, Z., Zheng, X., Yu, Q., & Luo, Y. (2020). Research on sentiment classification of online travel review text. Applied Sciences, 10(15), 5275. Dirkson, A., Verberne, S., Sarker, A., & Kraaij, W. (2019). Data-driven lexical normalization for medical social media. Multimodal Technologies and Interaction, 3(3), 60. Göker, S., & Can, B. (2018). Neural text normalization for turkish social media. 2018 3rd International Conference on Computer Science and Engineering (UBMK), 161–166. Gunawan, D., Saniyah, Z., & Hizriadi, A. (2019). Normalization of abbreviation and acronym on Microtext in Bahasa Indonesia by using dictionary-based and longest common subsequence (LCS). Procedia Computer Science, 161, 553–559. Iskandar, D., & Marjuki, M. (2022). Classification of Melinjo Fruit Levels Using Skin Color Detection With RGB and HSV. Journal of Applied Engineering and Technological Science (JAETS), 4(1), 123–130. https://doi.org/10.37385/jaets.v4i1.958 Izonin, I., Tkachenko, R., Shakhovska, N., Ilchyshyn, B., & Singh, K. K. (2022). A Two-Step Data Normalization Approach for Improving Classification Accuracy in the Medical Diagnosis Domain. Mathematics, 10(11), 1–18. https://doi.org/10.3390/math10111942 Javaloy, A., & García-Mateos, G. (2020). Text normalization using encoder–decoder networks based on the causal feature extractor. Applied Sciences, 10(13), 4551. Jimenez-Marquez, J. L., Gonzalez-Carrasco, I., Lopez-Cuadrado, J. L., & Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content. International Journal of Information Management, 44, 1–12. Lubis & Nasution … Vol 4(2) 2023 : 855-863 862 Jose, G., & Raj, N. S. (2014). Noisy SMS text normalization model. International Conference for Convergence for Technology-2014, 1–6. Khan, J., & Lee, S. (2021). Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text. Applied Sciences, 11(17), 8172. Kumar, A., Tyagi, V., & Das, S. (2021). Deep Learning for Hate Speech Detection in social media. 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), 1–4. Kusumawardani, R. P., Priansya, S., & Atletiko, F. J. (2018). Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings. Procedia Computer Science, 144, 105–117. https://doi.org/10.1016/j.procs.2018.10.510 Liang, P.-W., & Dai, B.-R. (2013). Opinion mining on social media data. 2013 IEEE 14th International Conference on Mobile Data Management, 2, 91–96. Liu, K., & Chen, L. (2019). Medical social media text classification integrating consumer health terminology. IEEE Access, 7, 78185–78193. Lubis, A.R., Lubis, M., & Azhar, C. D. (2019). The effect of social media to the sustainability of short message service (SMS) and phone call. Procedia Computer Science, 161. https://doi.org/10.1016/j.procs.2019.11.172 Lubis, Arif Ridho, Nasution, M. K. M., Sitompul, O. S., & Zamzami, E. M. (2023). A new approach to achieve the users’ habitual opportunities on social media. IAES International Journal of Artificial Intelligence, 12(1), 41–47. https://doi.org/10.11591/ijai.v12.i1.pp41- 47 Lubis, Arif Ridho, Prayudani, S., Lubis, M., & Nugroho, O. (2022). Sentiment Analysis on Online Learning During the Covid-19 Pandemic Based on Opinions on Twitter using KNN Method. 2022 1st International Conference on Information System & Information Technology (ICISIT), 106–111. Lubis, Arif Ridho, Prayudani, S., Nugroho, O., Lase, Y. Y., & Lubis, M. (2022). Comparison of Model in Predicting Customer Churn Based on Users’ habits on E-Commerce. 2022 5th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 300–305. Lubis, Arif Ridho, Utara, U. S., Sitompul, O. S., Utara, U. S., Nasution, M. K. M., Utara, U. S., Zamzami, E. M., & Utara, U. S. (2020). Obtaining Value From The Constraints in Finding User Habitual Words. 8–11. Maghfur, N. M., Ibrohim, M. O., Fahmi, J., Putera, A. S., & Riandi, O. (2021). Text Normalization for Indonesian Text-to-Speech (TTS) using Rule-Based Approach: A Dataset and Preliminary Study. 2021 4th International Conference of Computer and Informatics Engineering (IC2IE), 129–134. Middleton, S. E., Middleton, L., & Modafferi, S. (2013). Real-time crisis mapping of natural disasters using social media. IEEE Intelligent Systems, 29(2), 9–17. Neto, A. F. de S., Bezerra, B. L. D., & Toselli, A. H. (2020). Towards the natural language processing as spelling correction for offline handwritten text recognition systems. Applied Sciences, 10(21), 7711. Nguyen, L. H., Salopek, A., Zhao, L., & Jin, F. (2017). A natural language normalization approach to enhance social media text reasoning. 2017 IEEE International Conference on Big Data (Big Data), 2019–2026. Pano, T., & Kashef, R. (2020). A complete VADER-based sentiment analysis of bitcoin (BTC) tweets during the era of COVID-19. Big Data and Cognitive Computing, 4(4), 33. Rahman, T., Agustin, F. E. M., & Rozy, N. F. (2019). Normalization of Unstructured Indonesian Tweet Text For Presidential Candidates Sentiment Analysis. 2019 7th International Conference on Cyber and IT Service Management (CITSM), 7, 1–6. Roshini, T., Sireesha, P. V., Parasa, D., & Bano, S. (2019). Social media survey using decision tree and Naive Bayes classification. 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), 265–270. Sarimole, F. M., & Fadillah, M. I. (2022). Classification Of Guarantee Fruit Murability Based on HSV Image With K-Nearest Neighbor. Journal of Applied Engineering and Technological Science (JAETS), 4(1), 48–57. https://doi.org/10.37385/jaets.v4i1.929 Lubis & Nasution … Vol 4(2) 2023 : 855-863 863 Schreck, T., & Keim, D. (2012). Visual analysis of social media data. Computer, 46(5), 68–75. Sebastian, D., & Nugraha, K. A. (2019). Text normalization for Indonesian abbreviated word using crowdsourcing method. 2019 International Conference on Information and Communications Technology, ICOIACT 2019, 529–532. https://doi.org/10.1109/ICOIACT46704.2019.8938463 Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. Tanna, D., Dudhane, M., Sardar, A., Deshpande, K., & Deshmukh, N. (2020). Sentiment analysis on social media for emotion classification. 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), 911–915. Villavicencio, C., Macrohon, J. J., Inbaraj, X. A., Jeng, J.-H., & Hsieh, J.-G. (2021). Twitter sentiment analysis towards covid-19 vaccines in the Philippines using naïve bayes. Information, 12(5), 204. Xuanyuan, M., Xiao, L., & Duan, M. (2021). Sentiment classification algorithm based on multi- modal social media text information. IEEE Access, 9, 33410–33418. Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. Ieee Computational IntelligenCe Magazine, 13(3), 55–75. Zeng, D., Chen, H., Lusch, R., & Li, S. (2010). Social Media Analytics and Intelligence. DEcEMbEr. Zheng, H., Lin, F., Feng, X., & Chen, Y. (2020). A hybrid deep learning model with attention- based conv-LSTM networks for short-term traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems, 22(11), 6910–6920. Zheng, X., Chen, W., Wang, P., Shen, D., Chen, S., Wang, X., Zhang, Q., & Yang, L. (2015). Big data for social transportation. IEEE Transactions on Intelligent Transportation Systems, 17(3), 620–630.