JURNAL RISET INFORMATIKA Vol. 5, No. 3. June 2023 P-ISSN: 2656-1743 |E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i3.556 Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 425 CLICKBAIT DETECTION IN INDONESIA HEADLINE NEWS USING INDOBERT AND ROBERTA Muhammad Edo Syahputra-1*, Ade Putera Kemala-2, Dimas Ramdhan-3 Computer Science Department1*), 2, Data Science / School of Computer Science3 Bina Nusantara University Jakarta, Indonesia muhammad.syahputra002@binus.ac.id1 , ade.kemala@binus.ac.id2, dimas.ramdhan@binus.ac.id3 (*) Corresponding Author Abstract This paper explores clickbait detection using Transformer models, specifically IndoBERT and RoBERTa. The objective is to leverage the models specifically for clickbait detection accuracy by employing balancing and augmentation techniques on the dataset. The research demonstrates the benefit of balancing techniques in improving model performance. Additionally, data augmentation techniques also improved the performance of RoBERTa. However, it resulted differently for IndoBERT with slightly decreased performance. These findings underline the importance of considering model selection and dataset characteristics when applying augmentation. Based on the result, IndoBERT, with a balanced distribution, outperformed the previous study and the other models used in this research. This study used three dataset distribution settings: unbalanced, balanced, and augmented with 8513, 6632, and 15503 total data counts, respectively. Furthermore, by incorporating balancing and augmentation techniques, the research surpasses previous studies, contributing to the advancement of clickbait detection accuracy, contributing to the advancement of clickbait detection accuracy with 95% accuracy in f1-score with unbalanced distribution. However, the augmentation method in this study only improved the RoBERTa model. Moreover, performance might be boosted by gathering more varied datasets. This work highlights the value of leveraging pre-trained Transformer models and specific dataset-handling techniques. The implications include the necessity of dataset balancing for accurate detection and the varying impact of augmentation on different models. These insights aid researchers and practitioners in making informed decisions for clickbait detection tasks, benefiting content moderation, online user experience, and information reliability. The study emphasizes the significance of utilizing state-of-the-art models and tailored approaches to improve clickbait detection performance. Keywords: Clickbait Detection; Transformer; Deep Learning; Data Augmentation Abstrak Makalah ini mengeksplorasi pendeteksian clickbait menggunakan model Transformer, khususnya IndoBERT dan RoBERTa. Tujuannya adalah untuk memanfaatkan model khusus untuk akurasi deteksi clickbait dengan menggunakan teknik penyeimbangan dan augmentasi pada dataset. Penelitian menunjukkan manfaat teknik penyeimbangan dalam meningkatkan kinerja model. Selain itu, teknik augmentasi data juga meningkatkan kinerja RoBERTa. Namun hasil yang didapatkan berbeda bagi IndoBERT dengan sedikit penurunan kinerja. Temuan ini menggarisbawahi pentingnya mempertimbangkan pemilihan model dan karakteristik dataset saat menerapkan augmentasi. Berdasarkan hasil tersebut, IndoBERT dengan distribusi berimbang mengungguli penelitian sebelumnya serta model lain yang digunakan dalam penelitian ini. Penelitian ini menggunakan tiga skema dataset untuk melakukan eksperimen yaitu distribusi unbalanced sebanyak 8513 total data, balanced 6631 data, dan augmented data dengan total 15503 data. Selain itu, dengan menggabungkan teknik penyeimbangan dan augmentasi, penelitian ini melampaui penelitian sebelumnya, dengan mendapatkan akurasi sebesar 95% pada model IndoBERT yang divalidasi menggunakan metode f1 - score. Namun, berdasarkan eksperimen metode augmentasi tidak memberikan kenaikan pada model IndobERT. Sebaliknya metode augmentasi cukup efektif pada model RoBERTa. Selain itu, dengan menambahkan jumlah data yang lebih bercvariasi dapat meningkat performa model secara signifikan. Pekerjaan ini menyoroti nilai memanfaatkan model Transformer yang telah dilatih sebelumnya dan teknik mailto:muhammad.syahputra002@binus.ac.id mailto:ade.kemala@binus.ac.id2 mailto:dimas.ramdhan@binus.ac.id P-ISSN: 2656-1743 | E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i3.556 JURNAL RISET INFORMATIKA Vol. 5, No. 3. June 2023 Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 426 penanganan kumpulan data tertentu. Implikasinya termasuk perlunya penyeimbangan dataset untuk deteksi yang akurat dan berbagai dampak augmentasi pada model yang berbeda. Wawasan ini membantu peneliti dan praktisi dalam membuat keputusan yang tepat untuk tugas deteksi clickbait, yang menguntungkan moderasi konten, pengalaman pengguna online, dan keandalan informasi. Studi ini menekankan pentingnya memanfaatkan model canggih dan pendekatan yang disesuaikan untuk meningkatkan kinerja deteksi clickbait. Kata kunci: Clickbait Detection; Deep Learning; Deep Learning; Data Augmentasi INTRODUCTION Clickbait has become a mainstream approach in online media, where the headline is not aligned with the content. Usually, the headline uses catchy or exaggerated words to attract the reader’s attention. This approach considerably negatively impacts society (Bondielli & Marcelloni, 2019; Zhou et al., 2022). In a survey (Bondielli & Marcelloni 2019), the clickbait approach potentially leads to a polarized society. The existence of social media with its algorithm can amplify the distribution of information, including news, which could shape society’s discourse in a public space (ShuKai et al., 2017). Such phenomenon also appears in the United States, UK, and arguably worldwide social media. Many studies have been conducted exploring methods to tackle this clickbait problem. An early approach using a Machine Learning classifier is explored (Abbas et al., 2019; Chakraborty et al., n.d.; Manjesh et al., 2018; Potthast et al., 2016). However, the studies mentioned still lack performance in capturing meaning and validating using Fleiss’ within the context of headline news (Zheng et al., 2021). However, recent studies using the Deep Learning approach have shown promising results (Agrawal, n.d.; Kim, 2014; Zhou et al., 2022). A study (Aju et al., 2022) conducted an empirical study comparing the performance of the Machine Learning and Deep Learning approaches. The result shows that BERT provides maximum efficiency by outperforming the Machine Learning method with 10% accuracy. Also, in the study (Oliva et al., 2022), the divergence measures technique in Deep Learning to tackle dataset availability in clickbait detection outperforms the Machine Learning approach in accuracy. However, there is still a gap to be filled in Indonesian clickbait detection. Since there are still only a few studies focusing on Bahasa Indonesia using the Deep Learning approach. Moreover, the availability of the dataset in Bahasa Indonesia is also one of the gaps in this research and in Deep Learning Research in general. Also, one of the disadvantages of the Deep Learning approach is that it takes more time to train since a large dataset is needed to extract and learn features (Sirusstara et al., 2022a). Therefore, this research aims to explore and leverage a pre-trained model to build a classifier model for Indonesian Headline news compared to the previous research (Sirusstara et al., 2022a) using the same dataset from (Hadiyat, 2019). In addition, this research utilized the dataset for balancing and augmentation to improve model accuracy, which was not utilized in the previous study. RESEARCH METHODS The methodology proposed in this research consists of three phases: obtaining the dataset, pre- processing, and training using deep learning models. Dataset This research obtained the dataset from (William & Sari, 2020). The dataset contains more than 15000 Indonesian headline news from 12 news publishers. Every headline is annotated as clickbait and non-clickbait by the 3 annotators method (Fleiss & Cohen, 1973). Based on annotation results, the dataset has several versions. The version used in this research is mainly from the inter-annotator agreement score of 0.42, the highest Fleiss’ K score with distribution, as shown in Table 1. Table 1. Dataset Label Distribution Class Total Count Clickbait 3316 Non-clickbait 5297 Total 8513 JURNAL RISET INFORMATIKA Vol. 5, No. 3. June 2023 P-ISSN: 2656-1743 |E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i3.556 Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 427 Pre-processing Considering the amount of headline news dataset, which is relatively small, this research used augmentation techniques such as EDA (Wei & Zou, n.d.) and Bootstrapping (Stine, 2016) to balance the distribution. Since the Deep Learning technique requires a large amount of data to achieve great results, constructing the dataset will most likely affect the model's performance dramatically. In addition, data augmentation has been proven to be effective and improve performance in many tasks, including computer vision (Perez & Wang, 2017) and speech recognition (Park et al., 2019). Data augmentation is a technique that utilizes the available data to synthesize new similar data. Therefore, it allows researchers to overcome the scarcity of the dataset available. Furthermore, adding more variety to the data also leads to avoiding overfitting. Bootstrapping In this research, a simple balancing method is conducted using resample method. In this method, a random resampling is proceeded by generating extra data points to the dataset based on bootstrapping procedure (Stine, 2016). Table 2 is shown the difference in dataset distribution after resampling. Table 2. Before and After Resampling Class Total Count Clickbait 3316 Non-clickbait 3316 Total 6632 EDA Another approach to data augmentation used in this research is EDA (Easy Data Augmentation). In general, EDA consists of four approaches Synonym Replacement (SR), Random Insertion (RI), Random Swap (RW), and Random Deletion (RD). In the first step, SR’s function is initialized to randomly selects n number of words and replaces these words randomly based on selected synonyms from IR’s function. IR generates synonyms in n times randomly, excluding stop words to be replaced in SR with random positions. Also operated in n times, RS randomly selects two words in a sentence to swap their positions. Finally, RD measures these selected words in a sentence in a probability of p. Table 3 shows the dataset distribution after EDA. Table 3. Before and After Augmentation Class Total Count Clickbait 9539 Non-clickbait 5964 Total 15503 Deep Learning Technique This paper focuses on deep learning algorithms, specifically using large models from Indonesian BERT (Koto et al., 2020; Wilie et al., 2020) or called IndoBERT. IndoBERT uses the same architecture as the original BERT with different pre-training datasets, which are unlabeled text- corpus. BERT uses a multi-headed attention mechanism as its method (Vaswani et al., n.d.) and has been proven to outperform other methods in many NLP benchmarks. In other words, BERT is currently known as the state-of-the-art in NLP research. A paper from (Koto et al., 2020) proposed IndoBERT, which heavily pre-trained the model using mostly news datasets such as Kompas, Tempo, Liputan6, Wikipedia, etc. In total, IndoBERT trained over 220M corpus words. IndoBERT trained purely as a masked language model using the Huggingface framework. They also followed the default configuration for BERT-base, which has 12 hidden layers, 12 attention heads, and feed-forward hidden layers of 3,072d. The model used an Adam optimizer and linear scheduler in pre-training. Moreover, the paper also shows the model's high performance in many tasks, such as summarization, sentiment analysis, named entity recognition, etc. Unlike (Koto et al., 2020), (Wilie et al., 2020) also proposed IndoBERT with a larger and more general pre-training dataset. The model trained over 4 billion corpus Indonesian words from many sources such as Wikipedia, webpage articles, and Twitter. Therefore, this research implemented a model from (Wilie et al., 2020). However, the model from (Koto et al., 2020) is also considered to see the impact of specifically pre-training datasets from news headlines that are dominantly used in the research, which is suitable for this research. For training both IndoBERT models, this research used Adam optimizer with a learning rate 2e-5 and batch size of 64 and 10 epochs using one GPU NVIDIA RTX 3080Ti. Furthermore, this research added more than 3000 headline news to the dataset and trained on both models with the same configuration. Then, examine the result for further analysis. P-ISSN: 2656-1743 | E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i3.556 JURNAL RISET INFORMATIKA Vol. 5, No. 3. June 2023 Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 428 RESULTS AND DISCUSSION Model Comparison Table 4 shows the experimental results of IndoBERT (Wilie et al., 2020) and RoBERTa model from hugging face by Cahya, in which the model has been trained with the Indonesian dataset. Overall, IndoBERT architecture trained with a balanced dataset and without augmentation outperformed other models with (95%) accuracy of f1-score. This accuracy also outperformed previous research using XLM-RoBERTa with (91%) accuracy of f1-score. Table 4. Experiment Results Model Train Test Split 8:2 Average Precision Average Recall F1-score Train Test indobenchmark/indobert-base-p1 Imbalance 0.99 0.93 0.96 0.86 0.94 Balance 0.9882 0.9396 0.9259 0.9176 0.9508 Imbalance + Augment 0.9964 0.98 0.907 0.87 0.93 Balance + Augment 0.9952 0.936 0.904 0.934 0.94 cahya/roberta-base-indonesian-522M Imbalance 0.98 0.92 0.87 0.92 0.93 Balance 0.9839 90.36 0.84 0.9182 0.91 Imbalance + Augment 0.9925 0.7766 0.6935 0.7589 0.8119 Balance + Augment 0.993 0.9164 0.8764 0.9131 0.9308 This study also experimented with the same model as the previous study, adding the balancing and augmentation method to the dataset. The augmentation method combined with a balanced dataset improved the accuracy by 4%. Thus, this study's model performed better than the previous one. Also, based on the experiment results shows that balanced data distribution could improve the accuracy of both models. Whereas the augmentation method only boosts performance on the RoBERTa model. Data Distribution Results Table 4 also shows the experiment results based on the type of dataset presented in the table. According to the results, they are revealing distinct trends for the IndoBERT and RoBERTa. The experiment using a dataset with a balanced distribution exhibited improved accuracy for IndoBERT and RoBERTA. In contrast with the augmentation method, the model only improves RoBERTA. However, the augmentation method only enhances the performance of the RoBERTa model while slightly decreasing the performance of IndoBERT. The EDA augmentation method dramatically increases dataset distribution, as Table 3 illustrates. This method leverages the Indonesian WordNET, incorporating operations such as synonym replacement, insertion, swap, and deletion, which introduce noise into the training data. Although the EDA method employs simple operations, the unilingual Indonesian WordNet is better suited for our dataset, which consists of Indonesian headline news. The amount of augmented dataset could be tuned and equated for each augmentation method. However, this paper used the default parameters, thus producing a different amount of augmented dataset. Comparison with Previous Study Table 5 compares the results of this study with the best results from the previous study. Table 5 showcases the performance of IndoBERT and RoBERTa models using the EDA augmentation method and the previous research's best method, including RoBERTa and XLM-RoBERTa models without data augmentation. The results indicate that when combined with balanced data distribution and the augmentation method, the models in this study outperformed the previous research in all testing scenarios. JURNAL RISET INFORMATIKA Vol. 5, No. 3. June 2023 P-ISSN: 2656-1743 |E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i3.556 Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 429 Table 5. Model Comparison with Previous Study Model Augmentation Type Precision Recall F1-score IndoBERT (This study) normal 0.96 0.86 0.94 IndoBERT (This study) Eda 0.92 0.91 0.95 RoBERT (Sirusstara et al., 2022b) normal 0.87 0.874 8738 RoBERTA eda 0.87 0.91 0.93 xlmRoBERTa (Sirusstara et al., 2022b) normal 0.91 0.91 0.91 CONCLUSIONS AND SUGGESTIONS This research employed two architectural models and three dataset settings: imbalance, balance, and augmented. Despite utilizing the EDA augmentation method, IndoBERT demonstrated superior performance with 95% accuracy in the f1- score when the distribution was balanced. However, as observed in previous studies, augmentation improved accuracy in the RoBERTa model without balancing an augmentation method towards the dataset. To further advance clickbait detection, exploring additional deep learning architectures for future research would be beneficial. Furthermore, collecting more diverse datasets could enhance performance, and investigating the parameters in the augmentation models could be another avenue for future investigation. REFERENCES Abbas, M., Ali Memon, K., & Aleem Jamali, A. (2019). Multinomial Naive Bayes Classification Model for Sentiment Analysis. IJCSNS International Journal of Computer Science and Network Security, 19(3), 62. Agrawal, A. (n.d.). Clickbait Detection using Deep Learning. Retrieved September 21, 2022, from https://www.reddit.com/r/news Aju, D., Kumar, K. A., & Lal, A. M. (2022). Exploring News-Feed Credibility using Emerging Machine Learning and Deep Learning Models. Journal of Engineering Science and Technology Review, 15(3), 31–37. https://doi.org/10.25103/JESTR.153.04 Bondielli, A., & Marcelloni, F. (2019). A survey on fake news and rumour detection techniques. Information Sciences, 497, 38–55. https://doi.org/10.1016/J.INS.2019.05.035 Chakraborty, A., Paranjape, B., Kakarla, S., & Ganguly, N. (n.d.). Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media. Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3), 613–619. https://doi.org/10.1177/001316447303300 309/ASSET/001316447303300309.FP.PNG_ V03 Hadiyat, Y. D. (2019). Clickbait on Indonesia Online Media. Journal Pekommas, 4(1), 1. https://doi.org/10.30818/jpkm.2019.20401 01 Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1746–1751. https://doi.org/10.3115/V1/D14-1181 Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. 757–770. https://doi.org/10.18653/V1/2020.COLING- MAIN.66 Manjesh, S., Kanakagiri, T., Vaishak, P., Chettiar, V., & Shobha, G. (2018). Clickbait Pattern Detection and Classification of News Headlines Using Natural Language Processing. 2nd International Conference on Computational Systems and Information Technology for Sustainable Solutions, CSITSS 2017. https://doi.org/10.1109/CSITSS.2017.84477 15 Oliva, C., Palacio-Marín, I., Lago-Fernández, L. F., & Arroyo, D. (2022). Rumor and clickbait detection by combining information divergence measures and deep learning techniques. 1–6. https://doi.org/10.1145/3538969.3543791 Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: P-ISSN: 2656-1743 | E-ISSN: 2656-1735 DOI: https://doi.org/10.34288/jri.v5i3.556 JURNAL RISET INFORMATIKA Vol. 5, No. 3. June 2023 Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 430 A Simple Data Augmentation Method for Automatic Speech Recognition. https://doi.org/10.21437/Interspeech.2019- 2680 Perez, L., & Wang, J. (2017). The Effectiveness of Data Augmentation in Image Classification using Deep Learning. Undefined. Potthast, M., Köpsel, S., Stein, B., & Hagen, M. (2016). Clickbait Detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9626, 810–817. https://doi.org/10.1007/978-3-319-30671- 1_72 ShuKai, SlivaAmy, WangSuhang, TangJiliang, & LiuHuan. (2017). Fake News Detection on Social Media. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597.3137600 Sirusstara, J., Alexander, N., Alfarisy, A., Achmad, S., & Sutoyo, R. (2022a). Clickbait Headline Detection in Indonesian News Sites using Robustly Optimized BERT Pre-training Approach (RoBERTa). 2022 3rd International Conference on Artificial Intelligence and Data Sciences: Championing Innovations in Artificial Intelligence and Data Sciences for Sustainable Future, AiDAS 2022 - Proceedings, September, 248–253. https://doi.org/10.1109/AiDAS56890.2022. 9918678 Sirusstara, J., Alexander, N., Alfarisy, A., Achmad, S., & Sutoyo, R. (2022b). Clickbait Headline Detection in Indonesian News Sites using Robustly Optimized BERT Pre-training Approach (RoBERTa). 2022 3rd International Conference on Artificial Intelligence and Data Sciences: Championing Innovations in Artificial Intelligence and Data Sciences for Sustainable Future, AiDAS 2022 - Proceedings, 248–253. https://doi.org/10.1109/AIDAS56890.2022. 9918678 Stine, R. (2016). An Introduction to Bootstrap Methods. Http://Dx.Doi.Org/10.1177/00491241890180 02003, 18(2–3), 243–291. https://doi.org/10.1177/004912418901800 2003 Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (n.d.). Attention Is All You Need. Wei, J., & Zou, K. (n.d.). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. 6382–6388. Retrieved September 23, 2022, from http://github. Wilie, B., Vincentio, K., Indra Winata, G., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., Purwarianti, A., & Bandung, I. T. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding (pp. 843–857). https://aclanthology.org/2020.aacl-main.85 William, A., & Sari, Y. (2020). CLICK-ID: A novel dataset for Indonesian clickbait headlines. Data in Brief, 32, 106231. https://doi.org/10.1016/J.DIB.2020.106231 Zheng, J., Yu, K., & Wu, X. (2021). A deep model based on Lure and Similarity for Adaptive Clickbait Detection. Knowledge-Based Systems, 214, 106714. https://doi.org/10.1016/J.KNOSYS.2020.106 714 Zhou, M., Xu, W., Zhang, W., & Jiang, Q. (2022). Leverage knowledge graph and GCN for fine- grained-level clickbait detection. World Wide Web, 25(3), 1243–1258. https://doi.org/10.1007/S11280-022- 01032-3