Lontar - Template LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 46 Spam Comments Detection on Instagram Using Machine Learning and Deep Learning Methods Antonius Rachmat Chrismantoa1, Afiahayatib2, Yunita Sarib3, Anny Kartika Sarib4, Yohanes Suyantob5 aFakultas Teknologi Informasi, Universitas Kristen Duta Wacana Yogyakarta, Indonesia 1anton@ti.ukdw.ac.id (Corresponding author) 2afia@ugm.ac.id bDepartment Computer Science and Electronics, Universitas Gadjah Mada Yogyakarta, Indonesia 3yunita.sari@ugm.ac.id 4a_kartikasari@ugm.ac.id 5yanto@ugm.ac.id Abstract The more popular a public figure on Instagram (IG), the number of followers also increase. When a public figure posts something, there are many comments from other users. In fact, from all the comments, not all of them are relevant to the post, such as advertising, links, or clickbait comments. The type of comments that are irrelevant to the post is usually called spam comments. Spam comments will interfere with information flow and may lead to misleading information. This research compares machine learning (ML) and deep learning (DL) classification methods based on our collected Indonesian IG spam comment dataset. This research was conducted in the following steps: dataset preparation, pre-processing, simple normalization, features generation using TF-IDF and word embedding, application of ML and DL classification methods, performance evaluation, and comparison. The authors compare accuracy, F-1, precision, and recall from ML and DL results. This research shows that ML and DL methods do not significantly differ. The Linear SVM, Extreme Tree (ET), Regression, and Stochastics Gradient Descent algorithms can reach the accuracy of 0.93. At the same time, the DL method has the highest accuracy of 0.94 using the SimpleTransformer BERT architecture. The difference between ML and DL methods is not significantly different. Keywords: Spam Comments Detection, Instagram (IG), Deep Learning, Machine Learning 1. Introduction Social media allows users to carry out various activities like society's real world. Social media will enable users to relate, make friends, convey ideas, convey aspirations, comment on each other, collaborate, and more. Social media users can also transact, raise funds, follow and be followed, do promotions/campaigns, and many other things. Social media is becoming very popular today because of these advanced features. Facebook (FB), YouTube (YT), Tik-Tok (TK), Instagram (IG), and Twitter (TW) are some popular social media used globally and in Indonesia [1]. These social media have many registered users with advanced features. Public figures, such as politicians, actors, and artists, use these social media to increase their popularity on the Internet. Actors and artists (from now on called artists) usually use social media to promote their activities, increase their popularity, interact with their fans, and many more. The more popular some artists are, the more followers they have. The more followers they have, the more content can also be delivered. TW, YT, and IG are popular social media that contain much spam. Both are spammer or content, and these social media are also widely used for spam detection research. The more famous an artist is, the higher the spam content of comments on each post [2][3]. The more famous the artist LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 47 and the more followers they have, the more likely they have spam comments on their social media accounts [2]. Spam comments disrupt information flow from a particular post/status information [4]. This paper uses the study case in IG because of its vast spam comments data. The posting and comments on IG have the following characteristics: using informal language, using lots of emoticons/emojis, lots of abbreviations and many typos, using a lot of code-mix data language, and they have varying lengths of comments long (usually 1-3 sentences, and in 1 sentence consists of five words). IG also has a reply-response structure and no hierarchy and only can use mentions with @ sign [5]. Solutions to deal with spam comments in IG can be done in some techniques, but most are done manually. IG users can manually delete the spam comments, which takes much time and should be checked. IG also provides a feature to report a comment as spam manually, but it must be done one by one. Another solution to minimize spam comments is to change the IG account to private. Making the IG private is difficult for artist / public figure accounts because if they make their IG account private, the other users cannot immediately follow them. The final solution is to set and activate the IG feature to automatically block comments containing certain words (the users must enter those keywords by themself, which they consider spam). Blocking spam comments using keywords has disadvantages because they can only be used in a few languages, such as English, and cannot be applied in Indonesian until now [6]. Researchers have researched spam comments detection in IG previously, most in English and some in other languages, including Indonesia [7], [8], [9], [10], [11], [5]. In the previous research, the authors tried to develop a spam comment detection service on IG based on the REST web service [12]. The authors implemented it in a Firefox extension by ordinary users in the real world [13]. The actual implementation using browser extensions proves that the accuracy results are not good. There are several reasons related to the low accuracy such as, 1) the IG comment is very unstructured and even abnormal, 2) many comments have a lot of symbols and emojis/emoticons, 3) there are many typos or uncommon abbreviations, slang words, and also code-mixing (mixed languages), 4) some comments are deliberately disguised not to be detected by spam, such as the use of the “\ /” characters to write the letter ‘V’ which cause the system cannot read the original character, and 5) the system cannot know the meaning/semantics information of the relationship between a post and comments on IG. Another problem is some posts have only images/pictures and no text caption at all. Spam detection in social media actually are a vast research area and require a lot of solving methods that support one another. Various machine learning methods, combined with natural language processing (NLP), can be used as social media spam comment detection techniques. In the article [14], there are 11 best detection methods in the classification used and compared, namely Gradient Boosting Trees (GBDT), Random Forest (RF), Extreme Learning Machine (ELM), Support Vector Machine (SVM), C.45, Sparse Representation based Classification (SRC), KNN, Logistic Regression (LR), AdaBoost (AB), Naïve Bayes (NB), and Feed Forward Neural Network (FFNN). Article [14] studies indicate that GBDT has almost the same performance, exceeds SVM and RF's prediction performance, and is the fastest algorithm in time efficiency during prediction. ELM, GBDT, RF, SVM, and C4.5 have adequate accuracy, but this performance varies widely across all datasets. The FFNN method has the worst accuracy but the second-fastest prediction efficiency after GBDT. SRC shows good accuracy performance but is the slowest in training and testing [14]. Deep learning has recently used well-known methods such as Convolutional Neural Network (CNN) for image classification and RNN (LSTM) for text classification. For example, CNN was used in signature detection and gave an excellent accuracy of 99.4% [15]. CNN is also used in EEG to detect excessive daytime sleepiness and get a good answer [16]. In contrast to CNN, widely used for image processing, spam detection uses a lot of RNN and its variants. Spam detection appears on SMS [17], which uses the SMS UCI dataset using CNN based on hand- engineered features. SMS spam detection was also performed using LSTM [18], [19] RNN- LSTMand, then compared with machine learning methods. Spam in social media can appear as spammer and spam content. Spam content appears on social media, such as TW [20], FB, and IG. Chrismanto et al.; researchers detect spam posts by spammers on IG using the English language [6]. They use the Random Forest (RF) machine learning method to prioritize the special hand- engineered features on a dataset of 1983 posts content and 953808 media posts. The hand- LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 48 engineered features are: whether or not mentioned to other accounts, the number of hashtags, the use of hashtags unrelated to the post, repeated words, specific spam keywords defined by the researcher, and the post image that contains a watermark or not. Based on these hand- engineered features, the detection results are relatively high, reaching 96.27% with k-fold validation k = 10. The weakness of [6] is using hand-engineered features that require human labor for the extraction. Research [21] differs from [7] because it uses the Indonesian language, not English, and detects spam posts but spam comments. The dataset comes from the dataset of Indonesian public figures. Unlike this research, spam comments in [21] have appropriate promotional objectives (such as advertising merchandise). The features used are a combination of hand-engineered features, keyword features, and text features themselves. The hand-engineered features are the comment's length, the number of capital letters, and emojis. The text features used are 1) Bag of Words (BOW), 2) TF-IDF, and 3) Fasttext, which is used in various combinations. The classification methods used are NB, SVM, and XGBoost. The result was that using all features (1, 2, and 3) resulted in an F1 of 0.96. This study states that the features used are highly dependent on the dataset and cannot be generalized to all other new data, especially for keywords retrieved using regular expressions. Not much research has been done on detecting spam comments on IG in the Indonesian language. NB algorithm uses the classification method and has an accuracy of 72% [8]. In contrast, the opposite NB algorithm, namely Complementary Naïve Bayes (CNB) in an unbalanced dataset (non-spam comments outnumber spam comments) between spam and non- spam comments, has better accuracy [9]. The CNB algorithm can produce 92% accuracy, while the SVM algorithm gets 87%. The pre-processing technique is almost identical to various studies using text data for the classification problem, including the IG dataset. All pre-processing techniques used to detect spam posts or comments must be done using the NLP method. The pre-processing text has a significant impact before the further stage, features generation and selection [22]–[24]. The pre- processing technique uses tokenization or n-gram tokenization (split sentence into words), case- folding, stop words removal, post tagging, normalization, spelling correction, and stemming. The least effective pre-processing technique is stemming [24]. The authors have also conducted various studies related to this topic before. The first research on the collection of the 2017 IG dataset and the use of the NB [6], SVM [10], KNN [4], and DW- KNN [11] used the RapidMiner and PHP. The best two methods based on our previous studies are the KNN and DW-KNN. However, the accuracy is still not good enough and can still be improved using appropriate deep learning methods. This research tries to compare the performance of some machine learning based on [14] and several deep learning methods on the IG comments dataset obtained from 10 artists with more than 10 million followers [25]. This research aims to contribute experimental results and comparisons of accuracy, precision, recall, and F-1 on the Indonesian IG spam comments dataset. To our knowledge, this comparison from the IG spam in the Indonesian language case study has not been made before. These results will be the initial part of conducting more in-depth research and analysis to improve detection and search for gaps/improvements in detecting spam comments on IG with various unique characteristics. This research wants to contribute to get the comparison performance of some ML and DL algorithms. 2. The Research Methods The primary process of this research is carried out in five major steps. The research methodology of this research is carried out as follows: 1) Data gathering, 2) Data cleaning, pre-processing, and normalization, 3) Implementation of spam comment detection using machine learning, 4) Implementation of spam comment detection algorithm using deep learning, and 5) Comparison results, discussions, and analysis. The details of each step will be described more clearly in the following subsection. LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 49 2.1. Data Gathering The primary data source is obtained from Instagram. Dataset is built from Indonesian artists' postings, and comments in the Indonesian language, with more than 10 million followers collected in 2017 [6]. The profile of the dataset before cleaning can be seen in Table 1. Table 1. Instagram Annotated Dataset Before Cleaning, Preprocessing, and Normalization No. Artist Class name and total 1. Ayu Tingting Spam (1262), Not Spam (584) 2. Julia Perez Spam (1362), Not Spam (739) 3. Nagita Slavina Spam (1435), Not Spam (610) 4. Syahrini Spam (922), Not Spam (448) 5. Laudya Cinthia Bella Spam (902), Not Spam (688) 6. Prili Latuconsina Spam (437), Not Spam (1091) 7. Chelsea Olivia Spam (1625), Not Spam (293) 8. Luna Maya Spam (965), Not Spam (275) 9. Raisa Spam (666), Not Spam (621) 10. Agnes Monica Spam (1143), Not Spam (940) Total Spam 10.719 Total Not Spam 6.288 General Total 17.007 The dataset can be accessed from https://ig-repo.fti.ukdw.ac.id/ in JSON, XML, or plain text (non- Unicode) format. The dataset profile consists of not-spam of 5187 and spam of 9313. The total number of data is 14500 data. The dataset's characteristics are 1) it consists of duplicate letters and punctuation, 2) there are a lot of Unicode symbols, 3) it contains emoticons/emoji, 4) there are a lot of non-standard abbreviations, 5) it has lots of misspelled words (typo), 6) it contains custom symbols, and 7) it contains code-mixing language (a mixture of Indonesian and others). 2.2. Data Cleaning, Pre-Processing, And Normalization This research used all the datasets for the experiment. From 14500 data, it splits into data training and testing using Pareto 80:20. The amount of training data used is 11600, 2900 data for data testing with K-Fold validation in ML method and random split in DL method. The data pre- processing and normalization step are carried out to clean and prepare the dataset for the next step. The pre-processing steps are 1) case folding, 2) tokenization, 3) punctuation removal, 4) emoji removal, 5) double characters removal (etc.: sayaaaa!! (in English: me), cobaa…. (in English: try)), 6) Stop words removal, and 7) Stemming using Python Sastrawi library. In the case of folding, all comments will be changed to lowercase letters. Tokenization will break sentence text into word tokens. The next step is removing all punctuation marks and normalizing each word's letters. Emoji also be removed from the data because emoji is categorized as a symbol, not text. Stop word removal means removing words that are not important using the stop word list. Stemming changes the word into a root word if it has not in the form of a root word—stemming aims to reduce the number of tokens that appear. Simple normalization is also done to reduce typos (writing errors). Some techniques can be used to overcome the writing error (typos), such as dictionary-based, edit distance, similarity key, rule- based, and probabilistic [43],[44]. In this research, the authors use some normalization steps modified from [45] done as follows: 1. The system accepts dataset input in CSV format. 2. The system loads the KBBI dictionary into memory. 3. The system loads the abbreviation dictionary into memory. 4. The system prepares results in a txt file for the final normalized dataset and an evaluation.txt file to store tokens that were replaced due to spelling correction to evaluate their accuracy. 5. For each row in the dataset, pre-processing is carried out, such as lowering the case, entering normalization, and normalization of adjacent twin letters to be reduced to 2 letters LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 50 6. The system prepares a modified Peter Norvig-based spelling correction function using a word embedding from Wikipedia. 7. The system then tokenizes sentences on the dataset with the NLTK library. 8. For each word/token in the tokenization results, the following process is carried out: a) Handling for the letter x representing the string "katax" (“katanya”). For example, katax is ‘they said” in English, b) handling for letter 2, which describes repeated words. For example, ‘kupu2’ = butterflies, c) Check if the token is in the KBBI dictionary. If there is, then it is considered valid and then saved to the result_token d) If it is not in the KBBI, then the subsequent examination is continued, namely the examination of the abbreviation / ‘alay’ dictionary, for the non-standard words dictionary, e) If there is, then save it to the result_token, e) If there is none, then the results will be published and proceed to stem, f) If the stemming result is different from the previous input token, then proceed to the spelling correction process using the Peter Norvig algorithm based on Wikipedia's Word2Vec, g) If the result of the spelling correction is still the same as the previous token, which means the correction was not successful, then it is stored in the result_token as is, h) If the correction result differs from the previous token, there is a correction process. The result is rechecked in the KBBI dictionary, i) If it is found in the KBBI dictionary and the word class is the same as the original token, the correction process was successful, and the result_token = results_final_correction. The results are also recorded in the evaluation.txt file, and, j) The token will be left as if it is not in the KBBI dictionary. 9. For one row of a processed dataset, the result is stored in a Python list to be used later (result.txt file) per row until all rows are processed in the dataset. 10. Finally, the results.txt and evaluation.txt files are saved on the hard disk by Python, and the process is declared complete. 2.3. Implementation of The Machine Learning Algorithm Machine learning methods consist of two types: supervised learning and unsupervised learning. Detection/classification problems are included in the supervised learning category, although some references also said it could be done with semi-supervised or weak supervised models. The weakly supervised method uses the concept that by using a few labels on the dataset, the classification process can be done by using learning outcomes with a few labels to classify/label other data that have not been labeled [26], [27]. This research uses the methods used in the article [14] (NB, SVM, KNN, AdaBoost, DT, RF, XGBoost, and LR methods). It will compare each performance in IG’s spam comments detection case with the IG dataset. For the first method, Naïve Bayes (NB) is based on the Bayes theorem with the naïve assumption of each feature pair in each class [28]. Bayes' theorem where y is class and x1 to xn can be formulated in Formula (1). 𝑃( 𝑦 ∣∣ 𝑥1, … , 𝑥𝑛 ) = 𝑃(𝑦)𝑃( 𝑥1, … , 𝑥𝑛 ∣∣𝑦 ) 𝑃(𝑥1,…,𝑥𝑛) (1) Assuming the naïve condition is free as in Formula (2). 𝑃(𝑥𝑖 |𝑦, 𝑥1, … , 𝑥𝑖−1, 𝑥𝑖+1, … , 𝑥𝑛 ) = 𝑃(𝑥𝑖 |𝑦), (2) NB will predict whether x is categorized as class y based on all data, which has the highest posterior probability as in Formula (3). 𝑃( 𝑦 ∣∣ 𝑥1, … , 𝑥𝑛 ) = 𝑃(𝑦) ∏ 𝑃( 𝑥𝑖 ∣∣𝑦 ) 𝑛 𝑖=1 𝑃(𝑥1,…,𝑥𝑛) (3) As can be seen in Formula (3), 𝑃(𝑥1, … , 𝑥𝑛 ) It is constant, so Formula (3) can be simplified into LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 51 Formula (5) and Formula (6). 𝑃( 𝑦 ∣∣ 𝑥1, … , 𝑥𝑛 ) ∝ 𝑃(𝑦) ∏ 𝑃( 𝑥𝑖 ∣∣ 𝑦 ) 𝑛 𝑖=1 (4) �̂� = arg max 𝑦 𝑃 (𝑦) ∏ 𝑃( 𝑥𝑖 ∣∣ 𝑦 ) 𝑛 𝑖=1 (5) The SVM method is a well-known method that is very good in classifying two (binary) classes. It is still valid when the data dimensions are high, including when the number of dimensions is greater than the number of samples. It is memory efficient and has many kernel tricks used in various cases [29]. The SVM algorithm is a classifier algorithm based on Vapnik's supervised learning model in 1992. The SVM method will search and find a hyperplane and measure x-1 dimensions to separate training data based on categories or classes in several training data with several x attributes (the vector has a size of x dimensions), t. Finding a hyperplane is done by maximizing the distance between classes (margin). SVM can guarantee high generalizability for future data [30]. The SVM will calculate the optimization problem with Formula (6) [31]. Suppose the training data is the data that has been labeled and has several x attributes (commonly known as tuples) (xi, yi). Where i = 1, 2,…, n, n is the amount of training data, xi is the set of attributes in the ith, and yi is the class of the ith training data. min 𝑤,𝑏,𝜉 1 2 𝑤 𝑇 𝑤 + 𝐶 ∑ 𝜉𝑖 𝑥 𝑖=1 (6) By the provisions in accordance with Formula (7). 𝑦𝑖 (𝑤 𝑇 𝜙(𝑥𝑖 )) + 𝑏 ≥ 1 − 𝜉𝑖 dan 𝜉𝑖 > 0 (7) KNN is a supervised learning method where the new data are classified based on most k-nearest neighbor categories. The KNN algorithm uses neighborhood classification to predict new data. KNN for text classification can be seen in [32] with an average accuracy level, reaching 95%. The principle of the k-NN is to find the closest distance between the data to be evaluated and the closest neighbors in the training data, where k is the number of closest neighbors. The steps in the KNN algorithm are 1). determine k, 2). calculate the distance (similarity) between new and other data, 3). sort the distance by the threshold k, and 4). use the closest distance to most class members. The distance formula can be seen in Formula (8). 𝑑 = √(𝑥2 − 𝑥1) 2 + (𝑦2 − 𝑦1) 2 (8) Gradient boosting algorithms are used for regression and classification problems. There are three gradient boosting elements: the weak function, weak learner, and adaptive model. The loss function is highly dependent on the dataset, the weak learner can make predictions, and the additive model minimizes the loss function by adding a weak learner. The Ada Boost (Adaptive Boosting) algorithm is a meta-algorithm that tries the classifier on the original dataset and then adjusts it from the classifier on the same dataset. The incorrectly classified data's weight is adjusted again so the next classifier can classify it better [33]. Another boosting algorithm is also found in the XGBoost algorithm (eXtreme Gradient Boosting). This algorithm combines models with low accuracy to create models with higher accuracy. XGBoost is based on a decision tree developed by Tianqi Chen. Because XGBoost was born from a library, its development is implemented in many programming languages such as C ++, Python, R, Julia, and Java. The models supported by XGBoost are the regular Gradient Boosting model, SGD (Stochastic Gradient Boosting), and Regularized Gradient Boosting with L1 and L2 regularization [34]. The Extra Tree (ET) algorithm is also developed based on a decision tree and an ensemble with a random forest in its development. The extra tree classifier, such as RF, makes arbitrary decisions and randomizes specific data subsets to minimize over-fitting and over-learning [35]. Some parameters that can be changed are the number of trees, features, and the minimum LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 52 sample per split [36]. Some best machine learning classification methods are applied to process datasets at this stage. The machine learning algorithms used are 1) NB, 2) CNN, 3) Linear SVM, 4) SVM Radial Basis Function (RBF), 5) KNN, 6) Ada Boost, 7) DT, 8) RF, 9) eXtreme Gradient Boosting (XGBoost), 10) Stochastic Gradient Descent (SGD), 11) Extra tree and 12) Multi-Layer Perceptron (MLP). The methods are implemented using Python and Scikit-learn Library. 2.4. Implementation Of The Deep Learning Algorithm This research also uses suitable deep learning methods for sequential text data processing and shallow learning machine learning. The deep learning methods used in this research are Recurrent Neural Network (RNN), Long-short Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (Bi-LSTM), and Transformers (using SimpleTransformers library). The text classification process using deep learning accepts input in a word embedding. Word embedding can be in the form of Word2Vec by Google [37], Fasttext by Facebook [38], or Glove [39]; after the word embedding is created through the training process from the dataset itself, the word embedding is used as input from the classification architecture layer. RNN architecture is a deep learning architecture that processes sequential data based on time steps. RNN is suitable for text data processing or other time series such as forecasting/prediction [40]. Especially for text data, The RNN architecture can process sequences of sentences by word/token processing. RNNs have several drawbacks, such as 1) the operations are sequential, so the processing cannot be done in parallel, 2) vanishing gradient may occur, and 3) the training process takes too long [39]. LSTM architecture tries to overcome the weaknesses in RNN in terms of vanishing gradient [41]. LSTM is usually used in text processing and time series data [42] for predicting sea level. LSTM uses different gates in its architecture, consisting of an input gate, a forget gate and an output gate. The LSTM architecture does not cause vanishing Gradient and makes the system forget less important information. Some LSTM variant are GRU (Gated Recurrent Unit) [43] and Bi- LSTM [44]. A Bi-LSTM is an LSTM that uses two layers of LSTM, one that receives input in the forward direction and the other in the reverse direction. Bi-LSTM effectively increases the information available to the network and the processing context. GRU is an extension of the standard LSTM with some modification gates. GRU has two gates (reset and upgrade gates), while the LSTM has three gates (input, output, and forget ports). RNN, LSTM, GRU, and Bi-LSTM, previously discussed, still have weaknesses such as 1) they cannot be processed in parallel, 2) there is always a chance that a vanishing gradient will occur, and 3) the training process is slower [45]. Google Brain created a new architecture called Transformer to overcome the previous problems. Transformer architecture relies only on the attention mechanism [46]. LSTM makes training faster, has no vanishing gradients, and the process can be done in parallel. Transformer achieves state-of-the-art (SOTA) in Neural Machine Translation (NMT) processing [45]. 2.3. Performance Evaluation The last stage of this research is performance evaluation to see the performance comparison between ML and DL methods in spam comments detection based on the IG dataset. The evaluation matrix used in this experiment is the confusion matrix, as shown in Table 2. A confusion matrix (CM) is a simple matrix to evaluate classification performance by machine/computer. CM is a table with a minimum of 4 different combinations of predicted values by machine and the actual values. It supports binary or more classification. Tabel 2. Confusion Matrix Predicted Class Negative Positive Negative True Negative (TN) False Negative (FN) LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 53 Actual Class Positive False Positive (FP) True Positive (TP) Where: • True-negative = the number of negative data that is correctly categorized as a negative class • False-negative = the number of negative data that is categorized as the positive class • False-positive = the number of positive data that is categorized as a negative class • True-positive = the number of positive data that is true that is categorized as a positive class Further calculations can be carried out from the confusion matrix in Table 2 to get accuracy, recall, precision, and F-measure with Formula (8) to Formula (11). 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑁+𝑇𝑃 𝑇𝑁+𝐹𝑃+𝐹𝑁+𝑇𝑃 (9) Recall or Sensitivity = 𝑇𝑃 (𝐹𝑃+𝑇𝑃) (10) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 (𝐹𝑁+𝑇𝑃) (11) 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2∗ 𝑇𝑃 (2∗𝑇𝑃+𝐹𝑃+𝐹𝑁) (12) 3. Result and Discussion 3.1. Result of the Machine Learning Methods This experiment is done using some configurations, as seen in Table 3. For the ML methods, every method runs the script in batch mode. The dataset is split into train and test in 80:20. ML features using TF-IDF normalize vectors with a 1-gram token. This ML experiment shows that the best performance is achieved by linear SVM, linear Regression, SGD, and extra tree in both accuracy and F1 score of 0.93 (see details in Table 4). Table 3. Table. The Machine Learning Parameters Experiment Parameter Value Python libraries TensorFlow, sci-kit-learn, pandas, NumPy, matplotlib, seaborn, gensim, tqdm, simpletransformers, nltk, string, itertools, xgboost MultinomialNB, ComplementNB() default LinearSVC() random_state=42, tol=1e-5 SVM() C=1.0, gamma='auto' KNN n_neighbors=3 AdaBoost n_estimators=100, random_state=42 DecisionTree random_state=42 RandomForest max_depth=2, random_state=42 Logistics Regression multi_class='multinomial',solver='saga', max_iter=100 eXtreme Gradient Boosting objective='binary:hinge' Stochastic Gradient Descent max_iter=1000, tol=1e-3 ExtraTree n_estimators=100, random_state=42 Multi Layer Perceptron random_state=42, max_iter=300 Table 4. Machine Learning Methods Performance Result Method / Algorithm Accuracy Precision Recall F1 Score NB 0.91 0.89 0.89 0.89 CNB 0.92 0.89 0.92 0.90 Linear SVM 0.93 0.92 0.93 0.93 SVM Radial Basis Function 0.64 0.32 0.50 0.39 KNN 0.82 0.85 0.76 0.78 AdaBoost 0.90 0.89 0.92 0.9 Decision Tree 0.91 0.9 0.92 0.91 LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 54 RF 0.64 0.32 0.5 0.39 Linear Regression 0.93 0.92 0.94 0.93 XGBBoost 0.89 0.88 0.91 0/89 SGD 0.93 0.92 0.94 0.93 Extreme Tree 0.93 0.92 0.93 0.93 Multi Layer Perceptron 0.91 0.9 0.9 0.9 3.3. Result of the Deep Learning Methods In the deep learning method’s evaluation, we use some scenarios, as seen in Table 5. The scenario is divided into six scenarios to see the effect of applying stemming, different word embedding vectors, and using NLP’s state-of-the-art Google’s Transformer. Transformer implementation in this research is made by using the Simpletranformers Python library. The Simpletransformers is set using configuration as follows: Table 5. Deep Learning Scenario Configurations Configuration Name Parameters Configuration 1 using stopwords removal, stemming, and Fasttext word embedding Configuration 2 using stopwords removal, without stemming, and Fasttext embedding Configuration 3 using stopwords removal, stemming, and Word2Vec embedding Configuration 4 using stopwords removal, without stemming, and Word2Vec embedding Configuration 5 using stopwords removal and Transformer Configuration 6 using stopwords removal, stemming, and Transformer The architecture used in this experiment consists of an input layer that has 128 dimensions, then followed by an embedding layer formed by word embedding (built based on Word2Vec vectors from the dataset and using 300 dimensions). Right after the embedding layer, the model is followed by a stack of SimpleRNN Keras layers for the RNN model and an LSTM standard layer for the LSTM model. The GRU layer is also a standard GRU Keras layer, while for the BiLSTM layer, a Bidirectional layer of LSTM with return_sequences is used. The stacked model was then followed by some Dense Keras layer using the relu activation function. The last Dense layer is for the classification decision maker using the sigmoid activation function because it is categorized as a binary classification problem (‘spam’ and ‘not spam’ class). All the details hyperparameters configuration of all DL layers can be seen in Table 6, while the results of accuracy, precision, recall, and F1 scores are written in Tables 7 and 8. Table 6. The Hyperparameters in The Experiments DL Layer Variable Values DL Layer Variable Values Embedding layer min count 1 Model optimizer adam size (dimension) 300 loss binary cross- entropy Iteration 100 Metrics evaluation metrics accuracy, precision, recall, the area under the curve, and F-1 max features 10000 early stopping val_loss (minimal) training validation 80% / 20% (11600 data / 2900 data) epoch 50 input length 128 batch size 32 input dimension 128 Computer/software specs processor Core I5 LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 55 SimpleRNN, LSTM, GRU layer return sequences true RAM 16 GB Dense layer activation relu and sigmoid Tensorflow 2.3 input length of training (11600) GPU NVIDIA 2 GB input 30 % Table 7. Result of Performance Evaluation using Deep Learning Method (Configuration 1) DL Method Acc Loss Prec Recall AUC F1 Score RNN 0.63 0.65 0.63 0.99 0.5 0.77 LSTM 0.91 0.31 0.95 0.9 0.94 0.93 BI-LSTM 0.9 0.3 0.92 0.92 0.95 0.92 GRU 0.91 0.4 0.96 0.89 0.94 0.92 Table 8. Result of Performance Evaluation using Deep Learning Method (Configuration 2) DL Method Acc Loss Prec Recall AUC F1 Score RNN 0.63 0.63 0.63 1 0.5 0.7 LSTM 0.89 0.3 0.94 0.89 0.94 0.91 BI-LSTM 0.9 0.4 0.93 0.91 0.94 0.92 GRU 0.89 0.38 0.91 0.91 0.93 0.91 Table 9. Result of Performance Evaluation using Deep Learning Method (Configuration 3) DL Method Acc Loss Prec Recall AUC F1 Score RNN 0.9 0.4 0.93 0.9 0.94 0.92 LSTM 0.9 0.4 0.93 0.92 0.94 0.93 BI-LSTM 0.9 0.4 0.93 0.92 0.94 0.92 GRU 0.9 0.4 0.93 0.91 0.94 0.92 Table 10. Result of Performance Evaluation using Deep Learning Method (Configuration 4) DL Method Acc Loss Prec Recall AUC F1 Score RNN 0.9 0.3 0.95 0.89 0.94 0.92 LSTM 0.9 0.4 0.95 0.89 0.94 0.92 BI-LSTM 0.9 0.4 0.93 0.91 0.95 0.92 GRU 0.9 0.4 0.93 0.91 0.95 0.92 Table 11. Result of Performance Evaluation using Deep Learning Method in Configuration 5 and 6 Transformer DL Variant Acc Loss Prec Recall AUC F1 Score SimpleTrans Bert : cahya/bert-base- indonesian-522M (Configuration 5) 0.94 0.15 0.97 0.93 0.94 0.96 SimpleTrans Roberta: cahya/roberta-base- indonesian-522M (configuration 5) 0.93 0.17 0.96 0.92 0.92 0.95 SimpleTrans Bert: cahya/bert-base- indonesian-522M (configuration 6) 0.94 0.16 0.97 0.96 0.94 0.96 SimpleTrans Roberta: cahya/roberta-base- indonesian-522M (configuration 6) 0.93 0.17 0.92 0.96 0.95 0.94 Table 6-11 shows that implementation of ML and DL algorithms can achieve well with reasonably good accuracy results. All are above 88%, except SimpleRNN. ML methods achieve an accuracy of 0.93 by Linear SVM, Linear Regression, SGD, and Extreme Tree. Bi-LSTM and GRU only LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 56 achieved an accuracy of 0.9 and an F1 score of 0.92, but The Transformer (Simpletransformers) method outperformed the others. The results of ML and DL are not significantly different, but deep learning methods are still better in all the performance: accuracy, precision, recall, and F1 score. The best deep learning algorithm is obtained by Transformers (RoBERTa-based) with an accuracy of 0.94. Based on Table 9, using configuration 6 (stopwords and stemming), the accuracy is still the same, but the recall is up and reaches 0.9. The scenario using Fasttext in the embedding layer results in low accuracy, only 63%. This result needs to be investigated further; however, the system can classify the spam class better than the non-spam class with higher accuracy, recall, and F1 score. The difference between balanced and unbalanced datasets in accuracy is just 0.05. The time elapsed for ML and DL training varies from 5 minutes to 8 hours. The DL methods are prolonged in training time because of the computational complexity, but the ML is fast. ML methods are pretty old compared to their peers, the Extra Tree and eXtreme Gradient Boosting methods. While the MLP method, although included in the ML method, is a basic DL, so the process also takes a very long time compared to other ML methods. The Transformers model has the longest training time and evaluation, but the performance is the best. ML algorithm has an average training time (acceptable) with good results (on average, 0,86). The limitation of this study is that all experiments only used the comment text and did not use emojis. In this experiment, comment text is only treated as stand-alone data and is not related to the posting data. The post text has not been used to view the context of comments on a particular post. In future works, emojis will still be explored, and the combined use of post and comment text as a single data unit will be carried out. A comment is called spam (irrelevant) to post data if the detection process is carried out in the post's context. The spam detection process will be treated as a sub-task classification called sentence-pair classification in further research to get the context. 4. Conclusion This research has analyzed the importance of detecting spam content on social media, mainly focusing on social media Instagram in its comments. The spam comment in question is when the comment is not related/related to the post status. This research experimented with applying ML dan DL methods to detect spam comments using the IG 2017 dataset. The accuracy from ML and DL is still in the range of 0.89 – 0.94. The best machine learning methods are Linear SVM, Extra Tree, Regression, and SGD, which have an accuracy of 0.93, while deep learning architectures have the highest accuracy of 0.94 using SimpleTransformer BERT (cahya/bert- base-Indonesian-522M). The limitation of this study is that all experiments only used the comment text and did not use emojis. In this experiment, comment text is only treated as stand- alone data and is not related to the posting data. The post text has not been used to view the context of comments on a particular post. Future works will be done by developing deep learning architecture for spam comment detection using sentence-pair classification between post and comment and emoji feature, which has been rarely used in the detection/classification of text on social media. References [1] Databooks, “Ini Media Sosial Paling Populer Sepanjang April 2020,” Databooks, 2020. https://databoks.katadata.co.id/datapublish/2020/05/25/ini-media-sosial-paling-populer- sepanjang-april-2020 (accessed Nov. 04, 2020). [2] S. Aiyar and N. P. Shetty, “N-Gram Assisted Youtube Spam Comment Detection,” Procedia Computer Science., vol. 132, pp. 174–182, 2018, doi: 10.1016/j.procs.2018.05.181. [3] A. R. Chrismanto, A. K. Sari, and Y. Suyanto, “CRITICAL EVALUATION ON SPAM CONTENT DETECTION IN SOCIAL MEDIA,” Journal of Theoretical and Applied Information Technology (JATIT), vol. 100, no. 8, pp. 2642–2667, 2022, [Online]. Available: http://www.jatit.org/volumes/Vol100No8/29Vol100No8.pdf [4] A. Chrismanto and Y. Lukito, “Klasifikasi Komentar Spam Pada Instagram Berbahasa Indonesia Menggunakan K-NN,” in Seminar Nasional Teknologi Informasi Kesehatan LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 57 (SNATIK), 2017, pp. 298–306. [5] F. Prabowo and A. Purwarianti, “Instagram online shop’s comment classification using statistical approach,” in Proceedings - 2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering, ICITISEE 2017, 2018, pp. 282–287. doi: 10.1109/ICITISEE.2017.8285512. [6] A. Chrismanto and Y. Lukito, “Deteksi Komentar Spam Bahasa Indonesia Pada Instagram Menggunakan Naive Bayes,” Jurnal Ultima, vol. 9, no. 1, pp. 50–58, 2017, doi: 10.31937/ti.v9i1.564. [7] W. Zhang and H.-M. Sun, “Instagram Spam Detection,” in 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), Jan. 2017, pp. 227–228. doi: 10.1109/PRDC.2017.43. [8] B. Priyoko and A. Yaqin, “Implementation of naive bayes algorithm for spam comments classification on Instagram,” in 2019 International Conference on Information and Communications Technology, ICOIACT 2019, 2019, pp. 508–513. doi: 10.1109/ICOIACT46704.2019.8938575. [9] N. A. Haqimi, N. Rokhman, and S. Priyanta, “Detection Of Spam Comments On Instagram Using Complementary Naïve Bayes,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems, vol. 13, no. 3, p. 263, Jul. 2019, doi: 10.22146/ijccs.47046. [10] A. Chrismanto and Y. Lukito, “Identifikasi Komentar Spam Pada Instagram,” Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, vol. 8, no. 3, p. 219, 2017, doi: 10.24843/lkjiti.2017.v08.i03.p08. [11] A. Chrismanto, Y. Lukito, and A. Susilo, “Implementasi Distance Weighted K-Nearest Neighbor Untuk Klasifikasi Spam dan Non-Spam Pada Komentar Instagram,” Jurnal Edukasi dan Penelitan Informatika, vol. 6, no. 2, p. 236, 2020, doi: 10.26418/jp.v6i2.39996. [12] A. Chrismanto, W. Raharjo, and Y. Lukito, “Design and Development of REST-Based Instagram Spam Detector for Indonesian Language,” Proceedings - 2018 International Seminar on Application for Technology of Information and Communication: Creative Technology for Human Life, iSemantic 2018, iSemantic 2018, pp. 345–350, Sep. 2018, doi: 10.1109/ISEMANTIC.2018.8549725. [13] A. R. Chrismanto, W. Sudiarto, and Y. Lukito, “Integration of REST-Based Web Service and Browser Extension for Instagram Spam Detection,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 12, 2018, doi: 10.14569/IJACSA.2018.091253. [14] C. Zhang, C. Liu, X. Zhang, and G. Almpanidis, “An up-to-date comparison of state-of-the- art classification algorithms,” Expert Systems with Applications., vol. 82, pp. 128–150, 2017, doi: 10.1016/j.eswa.2017.04.003. [15] M. P. Nugraha, A. Nurhadiyatna, and D. M. S. Arsa, “Offline Signature Identification Using Deep Learning and Euclidean Distance,” Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, vol. 12, no. 2, pp. 102–111, Aug. 2021, doi: 10.24843/LKJITI.2021.V12.I02.P04. [16] I. P. A. E. D. Udayana, M. Sudarma, and N. W. S. Ariyani, “Detecting Excessive Daytime Sleepiness With CNN And Commercial Grade EEG,” Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, vol. 12, no. 3, pp. 186–195, Nov. 2021, doi: 10.24843/LKJITI.2021.V12.I03.P06. [17] P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter SMS Spam,” Future Generation Computer Systems, vol. 102, pp. 524–533, 2020, doi: 10.1016/j.future.2019.09.001. [18] S. Dutta, T. Saha, S. Banerjee, and S. K. Naskar, “Text normalization in code-mixed social media text,” 2015 IEEE 2nd International Conference on Recent Trends in Information Systems, ReTIS 2015 - Proceedings, no. c, pp. 378–382, 2015, doi: 10.1109/ReTIS.2015.7232908. [19] A. Chandra and S. K. Khatri, “Spam SMS Filtering using Recurrent Neural Network and Long Short Term Memory,” 2019 4th International Conference on Information Systems and Computer Networks, ISCON 2019, ISCON 2019, pp. 118–122, 2019, doi: 10.1109/ISCON47742.2019.9036269. [20] T. Wu, S. Wen, Y. Xiang, and W. Zhou, “Twitter spam detection: Survey of new approaches and comparative study,” Computers & Security, vol. 76, pp. 265–284, Jul. 2018, doi: 10.1016/j.cose.2017.11.013. [21] A. A. Septiandri and O. Wibisono, “Detecting spam comments on Indonesia’s Instagram LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 58 posts,” Journal of Physics: Conference Series, vol. 801, no. 012069, pp. 1–7, 2017, doi: 10.1088/1742-6596/755/1/011001. [22] R. Wongso, F. A. Luwinda, B. C. Trisnajaya, O. Rusli, and Rudy, “News Article Text Classification in Indonesian Language,” Procedia Computer Science, vol. 116, pp. 137–143, 2017, doi: 10.1016/j.procs.2017.10.039. [23] F. Z. Ruskanda, “Study on the Effect of Preprocessing Methods for Spam Email Detection,” Indonesian Journal on Computing (Indo-JC), vol. 4, no. 1, p. 109, 2019, doi: 10.21108/indojc.2019.4.1.284. [24] W. Etaiwi and G. Naymat, “The Impact of applying Different Preprocessing Steps on Review Spam Detection,” Procedia Computer Science, vol. 113, pp. 273–279, 2017, doi: 10.1016/j.procs.2017.08.368. [25] C. Mus, “10+ Akun Instagram Dengan Followers Terbanyak Di Indonesia,” musdeoranje.net, 2015. http://www.musdeoranje.net/2016/08/akun-instagram-dengan-followers-terbanyak-di- indonesia.html (accessed Oct. 13, 2021). [26] D. Mekala and J. Shang, “Contextualized Weak Supervision for Text Classification,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 323–333. doi: 10.18653/v1/2020.acl-main.30. [27] K. Hammar, S. Jaradat, N. Dokoohaki, and M. Matskin, “Deep Text Mining of Instagram Data without Strong Supervision,” Deep Text Mining of Instagram Data without Strong Supervision, pp. 158–165, 2019, doi: 10.1109/WI.2018.00-94. [28] H. Zhang, “The Optimality of Naive Bayes,” in Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, 2004, pp. 562–567. [Online]. Available: http://www.aaai.org/Library/FLAIRS/2004/flairs04-097.php [29] Scikit-Learn, “1.4. Support Vector Machines — scikit-learn 0.23.2 documentation,” Scikit- Learn Documentation, 2021. https://scikit-learn.org/stable/modules/svm.html (accessed Nov. 19, 2020). [30] Suyanto;, Data mining untuk klasifikasi dan klasterisasi data, 1st ed. Bandung: Informatika, 2017. Accessed: Nov. 19, 2020. [Online]. Available: //catalogue.ubharajaya.ac.id/slims/index.php?p=show_detail&id=39879 [31] J. Han, M. Kamber, and J. Pei, Data Mining : Concepts and Techniques, 3rd ed. Morgan Kaufmann, 2011. Accessed: Nov. 19, 2020. [Online]. Available: https://www.amazon.com/Data-Mining-Concepts-Techniques-Management/dp/0123814790 [32] P. Soucy and G. W. Mineau, “A simple KNN algorithm for text categorization,” Proceedings - IEEE International Conference on Data Mining, ICDM, ICDM, pp. 647–648, 2001, doi: 10.1109/icdm.2001.989592. [33] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997, doi: 10.1006/jcss.1997.1504. [34] N. Bhandari, “A Gentle Introduction to XGBoost for Applied Machine Learning,” Medium, 2018. https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine- learning/ (accessed Dec. 16, 2020). [35] J. Brownlee, “ExtraTreesClassifier. How does ExtraTreesClassifier reduce… | by Naman Bhandari | Medium,” Machine Learning Mastery, 2016. https://medium.com/@namanbhandari/extratreesclassifier-8e7fc0502c7 (accessed Dec. 16, 2020). [36] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach Learn, vol. 63, pp. 3–42, 2006, doi: 10.1007/s10994-006-6226-1. [37] R. N. Waykole and A. D. Thakare, “A Review of Feature Extraction Methods for Text Classification,” International Journal of Advance Engineering and Research Development, vol. 5, no. 04, pp. 351–354, 2018. [38] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning word vectors for 157 languages,” LREC 2018 - 11th International Conference on Language Resources and Evaluation, pp. 3483–3487, 2019. [39] P. Liu, X. Qiu, and H. Xuanjing, “Recurrent neural network for text classification with multi- task learning,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2016-Janua, pp. 2873–2879, 2016. [40] Y. Lukito and A. Chrismanto, “Recurrent neural networks model for WiFi-based indoor positioning system,” in 2017 International Conference on Smart Cities, Automation & LONTAR KOMPUTER VOL. 13, NO. 1 APRIL 2022 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2022.v13.i01.p05 e-ISSN 2541-5832 Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 59 Intelligent Computing Systems (ICON-SONICS), Nov. 2017, vol. 2018-Janua, pp. 121–125. doi: 10.1109/ICON-SONICS.2017.8267833. [41] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computing, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735. [42] A. W. Ramadhan, D. Adytia, D. Saepudin, S. Husrin, and A. Adiwijaya, “Forecasting of Sea Level Time Series using RNN and LSTM Case Study in Sunda Strait,” Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, vol. 12, no. 3, p. 130, 2021, doi: 10.24843/lkjiti.2021.v12.i03.p01. [43] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1724–1734, 2014, doi: 10.3115/v1/d14-1179. [44] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transaction Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093. [45] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017. [46] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15, 2015.