Microsoft Word - ETASR_V12_N5_pp9364-9371 Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9364 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME and Orange Tools Shumaa Saeed Alzahrani Computer Science and Engineering Department College of Computers and Information Systems Umm Al-Qura University Makkah, Saudi Arabia shayma.s.s.alzahrani@gmail.com Received: 20 July 2022 | Revised: 15 August 2022 | Accepted: 22 August 2022 Abstract-This paper deals with data mining on verbal bullying by Instagram users. It tracks people who repeatedly have abusive behavior and may cause harm to other persons or groups. In this work, a dataset holding verbal bullying in the Arabic language was extracted from Instagram comments, and the entries were classified as regular verbal bullying and suspicious verbal bullying. KINIME and Orange open source data mining tools were utilized to discover comments that involved verbal bullying on Instagram and to delete previous comments while users sent their comments automatically and immediately. Classification algorithms Rule-Based in KNIME and Select Rows in Orange were used. Keywords-KNIME tool; Orange tool; Instagram; data mining; Instagram comments; cyberbullying; verbal bullying I. INTRODUCTION Social media platforms have developed rapidly during the last few years. One of them is the Instagram. It is free, contains online data, and provides an easy form of communication. Users can talk in private and upload, tag, like, comment on, and share posts. Machine learning powers the app. Instagram’s feed ranking is constantly adapting and improving based on new data [1]. The Instagram algorithm predicts how much you care about a post. This way, one can find and classify bullies based on trending [1]. Also, the people who use offensive of abusive words can be identified with data mining tools. A. How the Instagram Algorithm Works Six key factors influence the Instagram algorithm for feed posts: interest, relationship, timeliness, frequency, following, and usage. The Instagram algorithm is constantly changing. The more the Instagram algorithm "likes" a post, the higher it will appear in your feed. This phenomenon is based on the "past behavior on similar content and possibly machine vision analyzing the post’s actual content." What you see in your Instagram feed is a mixture of all your behaviors on Instagram [1]: the people you communicate with, the stories you watch, the individuals you are tagged with, and the topics you comment and like. Comments, likes, reshares, and views are the most critical engagements for feed rating, which is beneficial when you prepare content and captions [1]. B. Comment of All Lengths Count as Engagement The Instagram algorithm counts comments that are less than 3 words in length. Instagram comments have become essential sources of knowledge for making fast and informed decisions and understanding how people behave in the real world. National and human rights organizations are now tracking social media [2]. C. Motivation Bullying includes repetitive and violent physical, verbal, or emotional actions. In this paper, the target goal is verbal bulling, which includes calling names, mocking, taunting, threatening, or verbally assaulting. Bullying can make you feel powerless, ashamed, depressed, or even suicidal. Detecting bullying can assist authorities in taking appropriate action by copying data, deleting them from public comments, or imposing fines. Using social media sites as data providers may be an effective mechanism for protecting ourselves. Due to the evolving technology, bullying is no longer confined to schoolyards or street corners but can happen at home via phone calls, texts, emails, and social media. Cyberbullies stalk, attack, or humiliate victims using digital technology. Cyberbullying, unlike conventional bullying, does not involve face-to-face contact and is not limited to a few people at a time. It also does not necessitate physical strength or many bullies. The embarrassment can be shared by hundreds or thousands of people online with just a few clicks. Cyberbullying may involve sending threatening or degrading messages via text, email, social media posts, or instant messaging, as well as breaking into an email account or stealing someone’s online identity. Some cyberbullies may set up a website or a social media account to harass a victim. The approaches used to cyberbully are as diverse and creative as the technologies accessible to bullies. The effects of cyberbullying Corresponding author: Shumaa Saeed Alzahrani Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9365 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … and traditional bullying are similar. They make victims feel angry, hurt, scared, powerless, hopeless, lonely, embarrassed, and guilty. A victim's mental health is likely to deteriorate, and the victim is more likely to experience mental health issues like low self-esteem, depression, PTSD, or anxiety. Because most cyberbullying on Instagram is anonymous, the victims do not know who is targeting them, which can make them feel even more threatened, and it can embolden bullies, who think that because they are anonymous online, they are less likely to be exposed. While cyberbullies cannot see the victim’s reaction, they will sometimes go deeper with their harassment or mockery than they would if the victims were personally present. Arabic speakers post both formal and informal comments in social media. The formal form of Arabic is Modern Standard Arabic (MSA), while the informal form is the regional dialects (DA), the spoken language used for everyday contact in Arab countries. Compared to the most common languages, such as English, dealing with Arabic text poses substantial challenges. Arabic has a wide range of grammatical forms, word synonyms, and meanings, dependent on factors such as word order and diacritics. There is an additional difficulty when dealing with dialect or colloquial language, commonly used in Instagram comments. D. Research Questions The main goal of this paper is to demonstrate the ability to detect verbal cyberbullying from social media comments, with Instagram being used as a case study. To identify bullies’ comments, suspicious verbal cyberbullying terms, and verbal cyberbullying words, the following are the key research questions and sub-questions I hope to answer: • RQ1: How can an Arabic dataset of verbal cyberbullying be built? • RQ1.1: How can an Arabic dataset involving verbal cyberbullying in different dialects be built? • RQ1.2: How can a verbal cyberbullying dataset be appropriate for this study? • RQ1.3: How can the reliability of dataset classes be ensured? • RQ2: How is verbal cyberbullying distinguished from non- related event comments? • RQ2.1: What is the most effective tool for detecting verbal cyberbullying? • RQ2.2: What are the best machine learning tools that improve the performance of the approaches? E. Research Goals This section addresses the paper's research objectives, followed by a discussion of the targets and their justifications. The following are the main objectives of this paper: • To create an Arabic Instagram dataset of verbal cyberbullying comments that identifies known and suspicious verbal cyberbullying words in comments in Arabic. • To determine the best method for detecting verbal cyberbullying by comparing KNIME and Orange tools results to discover the most successful supervised learning strategy. Most relevant studies have created Instagram datasets for testing verbal cyberbullying detection approaches. Many datasets have been built for commonly used languages such as English, but Arabic has received less attention. To the best of my knowledge, no one has performed or identified verbal cyberbullying detection in Arabic. As a result of the increasing demand for Arabic datasets, I created a dataset specifically to assess my targeted verbal cyberbullying detection system. Because unsupervised methods are typically ineffective at detecting verbal cyberbullying, researchers must monitor their approaches. Most unsupervised approaches use burst detection, which compares constructed words to verbal bullying word frequencies in comments. The burst behavior of specific words may not be verbal cyberbullying. For instance, not all sentences in Arabic that include animal words (حيوان) are verbal cyberbullying. Table I shows examples of verbal cyberbullying-related comments and non-related verbal cyberbullying comments. TABLE I. INSTAGRAM COMMENT EXAMPLES Non-related verbal cyberbullying comment "Mara [a type of rodent] animal" "حيوان المارا" Related verbal cyberbullying comment "An animal, may God suffice us of him" "حيوان حسبي هللا عليه" Both comments in Table I use the same word. This word may indicate bursts, but it is not always indicative of verbal cyberbullying. Compared to unsupervised approaches, the detection domain of supervised approaches is small. II. RELATED WORK Most published studies concentrate on cyberbullying identification strategies for commonly used languages like English, with Arabic gaining less attention. The first subsection of this chapter is a related work overview of the most influential cyberbullying identification studies in English social media or SNS. The following subsection presents studies on cyberbullying identification in Arabic social media or SNS. In terms of cyberbullying detection, I have divided the reviewed papers into supervised and unsupervised approaches. Authors in [11] proposed a solution to dispose of verbal cyberbullying. They suggested using a new feature selection technique for the closest neighbor classifier, which involves summarizing the original training materials using a measure of sentence importance. The two measures of sentence similarity used in their method for summarizing a single document were the frequency of the terms in a sentence and the similarity of that sentence to other sentences. After the researchers ranked all sentences, they chose the best-ranking sentences for a summary (within a threshold limitation). The researchers took every document’s summary from the corpus and entered it into a new document used for summarization evaluation. In [12], the effort focused on classifying documents, a guided learning technique. Text preprocessing, feature extraction, and classification are the phases that make up the document Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9366 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … categorization process. The study evaluated the performance of two classifiers (KNN and Naive Bayes) and specific feature selection strategies with or without combining accuracy, average precision, precision, and recall. The researchers trained each experiment’s classifiers using a custom data set. The results showed that the Naive Bayes classifier outperformed the other classifiers in several instances. In [13], the authors employed the comparative study’s performance evaluation metrics of accuracy, precision, and F-measure. Three algorithms were used for cyberbullying classification, i.e. Naive Bayes, SVM, and C4.5. A. Event Detection in Arabic Social Media In [6], the researchers built a text corpus focusing on two common Arabic dialects on Twitter. They proposed a 3-level hierarchical annotation schema for hate and offensive language characterization. For hate speech, their emphasis was on 4 types: religion, ethnicity, nationality, and gender, for offensive speech, they focused on posts containing nonacceptable language or general profanity. Based on machine learning (SVM, Naive Bayes, logistic regression) and deep learning (CNN, LSTM, and GRU), they trained numerous 2-class, 3- class, and 6-class hate speech classifiers using a panoply of feature extraction techniques, including unigram, word, and character n-grams and word embeddings (random, skip-gram, CBOW, and fastText) and contextual word embedding (multilingual BERT). The researchers observed that deep learning was superior to machine learning across the 3 classification tasks. In deep neural networks, the CNN+mBERT model outperformed all the other learned models across the 3 prediction tasks, with 87.05% for the 2- class task, 78.99% for the 3-class task, and 75.51% for the 6- class task. In [7], the researchers presented a scheme to detect cyberbullying messages in Arabic social media streams (Twitter and YouTube). The detection algorithm used a corpus of offensive words most used among Arab youth. The proposed scheme involved the following steps: (i) data cleaning and preprocessing, (ii) extracting bullying keywords and attributing weights, (iii) detecting cyberbullying comments, and (iv) calculating the bullying strength and classifying the comments. This scheme only focused on labeling the comments as bullying or non-bullying and decision making. B. Existing Arabic Datasets Authors in [3] published one of the first studies on Arabic abusive language identification, which included the development of an Arabic dataset of abusive comments. The dataset contained 1,100 tweets gathered from controversial accounts and hashtags. Three annotators classified the dataset as pornographic, offensive, or clean. The authors used a pattern-based Twitter search to build a seed word list of 228 obscene Arabic words for classification. Then, based on their use of obscene words in the seed list, they separated Twitter users into clean and obscene. They also compiled a longer list of potentially obscene terms by removing only unigram and bigram words. They assessed the utility of both lists as features for categorizing tweets as obscene or clean. Experiments revealed that combining the seed word list with the extended list generated the best F1 score of 60%. Authors in [4] presented a more detailed dataset for religious hate speech in dialectical Arabic. The dataset comprised 6,600 tweets using religious-related keywords. The researchers used crowdsourcing to classify the tweets as hateful or not and if they were religious extremist targets. They looked at various features, including lexicon-based and n-gram features, as well as standard machine learning algorithms. They also used neural networks like GRU and LSTM to evaluate their theories. Their experiments showed that GRU had the highest prediction accuracy of 77%. Authors in [5] created the Levantine Hate Speech and Abusive Behavior (L-HSAB) Twitter dataset, which included 5,812 tweets categorized as average, hateful, or abusive. The researchers divided the learning tasks into 2-class (abusive, normal) and 3-class classification tasks (abusive, hateful, normal) for model validation. They tried various n- gram ranges, such as unigrams, bigrams, and trigrams, as functions. They compared SVM and Naive Bayes classification performance and found that Naive Bayes outperformed SVM with an F1 score of 89.6% for 2-class classification and 74.4% for 3-class classification. III. ARABIC CYBERBULLYING DATASET One of the difficulties in detecting Arabic cyberbullying is the small number of Arabic datasets. To detect cyberbullying, I wanted to create a dataset with comments written in MSA, Saudi, and other dialects. In this section, the dataset of Arabic comments from Instagram that I made to detect cyberbullying is presented and the research question and sub-questions RQ1, RQ1.1, RQ1.2, and RQ1.3 will be answered. A. Verbal Cyberbullying Instagram Comment Collection Two key measures were followed to establish a dataset to detect cyberbullying. In the first step, comments by keywords were manually collected. In the second step, the collected comments were filtered into two classes: bully or positive. The other way was by verbal keywords. The comments were manually collected from the Instagram website through the author's personal account. The reason for choosing Instagram was because it has a high volume of cyberbullying comments, although it is not open source. Thus, I created the dataset in an Excel file to use in KNIME and Orange tools. Comments were collected for a period of 11 months between January and November of 2021. By using search terms for cyberbullying, a keywords list was prepared, for example: • { ي -ي غبي -ياأغبياء-ياغبيه-ياغبي-إغبياء-أغبياء-اغبياء - غبية - غبيه -غبي ي أغبياء - ي اغبياء- غبيه }. These words mean stupid in different ways in the Arabic language. • { أنت -انت حمار -يا حمارة-ياحمارة -يا حماره-ياحماره- يا حمار - ياحمار النك حمار -انتي حمارة-انتي حماره –حماره ي -ي حمار - إنت حمار - حمار } These words mean donkey in different ways in the Arabic language. Cyberbullying comments were collected based on a set of keyword lists related to calling names, mocking, taunting, threatening, or verbally assaulting in singular and plural. There were approximately 1,500 comments received for cyberbullying, with about 1,000 comments filtered by 1,857 keywords. Figure 1 shows a part of this dataset. Most comments were collected in 6 months. They were saved in an Excel spreadsheet, which is an excellent choice for storing Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9367 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … Instagram data in tables. In addition, because Excel is a spreadsheet that consists of field and value pairs, storing comments is easy. Fig. 1. A section of table comments and verbal bullying keywords. IV. PROPOSED APPROACHES The goal was to examine cyberbullying comments using keywords and categorize them into two types: cyberbullying (known and suspicious) and non-cyberbullying. I used KNIME and Orange tools to evaluate two different methods. Both tools detect cyberbullying comments distinguishable from non- cyberbullying comments by performing workflows using many nodes to classify data. The two tools were evaluated and compared. In this section, the RQ2, RQ2.1, RQ2.2 question and sub-questions will be addressed. A. Utilized Methodologies In cyberbullying detection, classification is usually used for specific cyberbullying detection, while clustering is generally used for unspecific cyberbullying detection. Two methods were assessed to detect cyberbullying on Instagram. The comments were manually gathered and the dataset was created. The suggested methods aimed to identify a specific form of cyberbullying. As a result, only supervised learning methods were used. Because social media features such as follower counts, mention counts, and message lengths do not apply to the cyberbullying detection task, I focused the cyberbullying detection task on the textual content of the comments. The first method used the KNIME tool to identify cyberbullying and non-cyberbullying comments. The second method used the Orange tool. The two methods were compared to see if breaking down the issue of cyberbullying identification into two phases improved its effectiveness. Regarding the negative impact of noisy and informal comment text in both proposed methods, writing the keywords in different ways, such as in chatting manners, to detect relevant comments more effectively, is recommended. B. Data Mining Tools In this paper, cyberbullying is detected through data mining using the open-source tools KNIME and Orange. The reason behind the existence of these tools is the existence of massive amounts of data. As a result, the traditional statistics methods are no longer useful. In the late '80s, many pieces of research appeared to solve these problems, in addition to searching for solutions that combined several disciplines, including statistics, databases, artificial intelligence, distinguishing different models, or analog computing. Then, data mining and knowledge discovery emerged, which proved to be successful solutions for analyzing vast amounts of data by transforming them from accumulated and incomprehensible data into valuable information that could be exploited and used [8]. Data mining is the process of analyzing data from different perspectives, drawing relationships between them, and summarizing them into useful information. 1) KNIME Tool KNIME makes understanding data and developing data science workflows and reusable components accessible by being intuitive, transparent, and constantly incorporating new technologies [9]. 2) Orange Tool The Orange tool is open-source machine learning and data visualization software. With a comprehensive and diverse toolbox, it builds data analysis workflows visually [10]. Data visualizations that are parts of the Orange help find hidden data patterns, provide intuition behind data analysis procedures, or support collaboration between data scientists and domain experts. Scatter plots, box plots, and histograms are among the visualization widgets available, as are model-specific visualizations such as dendrograms, silhouette plots, and tree visualizations. Many other visualization tools, such as network visualizations, word clouds, and geographical maps are available as add-ons. Interactive visualizations allow exploratory data analysis. A user can pick interesting data subsets directly from plots, graphs, and data tables and mine them in downstream widgets. For instance, a user can perform cross-validation logistic regression on a data set and map some misclassifications to the two-dimensional projection. It is simple to transform Orange into a tool that allows domain experts to explore their data, even if they have little experience with statistics or machine learning. C. Cyberbullying Approaches 1) KNIME Tool a) Phase 1 The workflow in Figure 2 represents the data flow between different nodes, starting with Excel Reader, then moving on to Tika Language Detector (to recognize used languages), Column Filter, Filter Apply Row Splitter, Row Filter, Rule- based Row Filter, and finally Excel Writer. Fig. 2. The workflow of extracting bullying terms by rule. The working of nodes 128, 127, and 281 has been explained above. Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9368 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … • Column Filter: This node allows columns to be filtered from the input table while the remaining columns are passed to the output table. Within the dialog, columns can be moved between the Include and Exclude list. • Filter Apply Row Splitter: This node splits the input according to the filter definitions, either given in the input table itself or optional as an additional model input. Filter definitions are only applied if an additional model is provided as input. If the input contains a filter defined on a column not present in the input table, the node will not fail but will display a warning message. • Row Filter (Figure 3): This node allows for row filtering according to specific criteria. It can include or exclude certain ranges (by row number), rows with a particular row ID, or rows with a specific value in a selectable column (attribute). The node does not change the domain of the data table. In other words, the upper and lower bounds or the possible values in the table spec are not adapted, even if one of the bounds or values is fully filtered out. Figure 3 shows the configuration dialog of the node. Fig. 3. Row Filter output. Fig. 4. Ruled-based Row Filter output. • Rule-based Row Filter (Figure 4): This node takes a list of user-defined rules and tries to match them to each row in the input table. The row is selected for inclusion if the first matching rule has a TRUE outcome. Otherwise (i.e. if the first matching rule yields FALSE), it will be excluded. If no rule matches, the row will be excluded. Inclusion and exclusion may be inverted (see the options in Figure 4). In the dialog in Figure 4, I used the simple rule $Col0$ IN ("Bully") => TRUE to extract just cyberbullying comments by classifying the comments as bullying (1) or positive (0). Figure 5 shows a part of the cyberbullying comment results using the rule Bully term, which extracted 999 comments out of 1,500. Fig. 5. Part of cyberbullying comments results by Bully term. b) Phase 2 The workflow in Figure 6 represents the flow data to extract cyberbullying comments by verbal bullying keywords. Fig. 6. The workflow used to extract cyberbullying by VB Keywords. Fig. 7. Results extract cyberbullying comments by VB Keywords. Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9369 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … The results in Figure 7 represent 1,061 cyberbullying comments, 64 of them being positive. The KNIME tool is inaccurate because it does not deal perfectly with Arabic (as with English). Phase1 in KNIME has 999 right results out of 1500. In phase 2 there are 61 wrong comments. 2) Orange Tool a) Phase 1 The workflow in Figure 8 represents extracting cyberbullying comments by using the Bully term through a categorical type target role in the file node description. The file classifies bully and positive comments. The results are fairly accurate. There were 999 cyberbullying comments out of 1,498, and the workflow ignored the rest. Fig. 8. Workflow for classifying cyberbullying and non-cyberbullying comments by the Bully term. The Select Rows dialog in Figure 9 represents the condition pattern in which the classification is a column in the original file and the type of condition. Bully is one of the values in the classification column. The other value is Positive for non- cyberbullying comments. Fig. 9. Select Rows dialog with condition pattern. Figure 10 displays the results for extracting 999 cyberbullying comments by Bully term out of 1,500 comments. b) Phase 2 I changed the Select Rows node description in Figure 8. I used the condition in Select Rows node that the verbal bullying must contain the cyberbullying called ( اوعى تسدى المصريين ،ايران ماتنالم فيكم عيل، استوت مثل المومياء، اسلوبها زج، اسلوبج الخايس، اعصابك التنجلطين ههههههه،اعطتها اكبر من حجمها، افلس اخالقيا من بعد ماافلس فنيا ، اقلبي وجهك ، اذا تسئل على صراويل العراقين اسئل امك عنها اكيد محتفضه فيهم .as shown in Figure 11 (للذكره Fig. 10. Data Table node results for extracting 999 cyberbullying comments by Bully term. Fig. 11. Extracting cyberbullying comments with more than one VB keyword. Fig. 12. Result of extracting cyberbullying comments with more than one VB keywords. Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9370 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … The output in Figure 12 represents 10 chosen rows applied by the condition in the original file of 1,500 comments classified as cyberbullying c) Phase 3 I changed the Select Rows node description in Figure 8. I used the condition in the Select Rows node that the comment must contain the cyberbullying term (تبقون مذلولين), as shown in Figure 13. The output in Figure 14 represents one row in the original file of 1,500 comments classified as cyberbullying. Fig. 13. Select Rows with condition comment contains VB keywords. Fig. 14. Workflow result. Phase 1 in Orange tool has a right result. Phase 2 extracts cyberbullying comments by just one cyberbullying term. In Phase 3, only the right answer can be shown when the cyberbulling terms are less than 100. D. Comparison and Evaluation In this paper, Orange and KNIME tools were used to classify Instagram comments under the cyberbullying and non- cyberbullying categories. Two data mining methods were applied: the first used two classes, Bully and Positive. The second used VB keywords. Each tool has its advantages and disadvantages. The advantages of the KNIME tool are that it can deal with an extensive dataset, different types of files, a huge variety of components, has easy code-to-write conditions that make the tool more developed, and has excellent performance. The disadvantages are the need to use a special node to define the language, accurate results within the English language but inaccurate results within the Arabic language, the inability to use the Remove Punctuation node to get the correct result in the Arabic language, and unclear descriptions for using the tool. The advantages of the Orange tool are the lack of need to define the language, accurate results when the VB keywords are less than 100, easy dealing with nodes, and easy understanding of the concepts of the tool. The disadvantages are that it cannot deal with a large dataset, every single VB keyword must be chosen every time, and it cannot make a condition with comments containing more than one VB. TABLE II. COMPARISON BETWEEN KNIME AND ORANGE Tool Algorithm Phase1 Phase2 Phase3 Notes KNIME Rule- based 999 of 1500 1061 of 1500 Phase2 has wrong result Orange Select Rows 999 of 1500 Cyberbullying term<100 of 1500 1 of 1500 Phase2 allows small data. Phase3 allows one condition every time E. Conclusion Orange and KNIME tools were used in this paper to data mine cyberbullying comments and distinguish them from non- cyberbullying comments on Instagram. I extracted cyberbullying in two ways, one using VB keywords and the other classifying comments as Bully or Positive. In KNIME, I got inaccurate data results within the large dataset, while in Orange, I got accurate data results with less than 100 VB keywords. The results in both tools in the second way were accurate. V. CONCLUSIONS AND FUTURE WORK The driving question behind this study is "How can you detect cyberbullying in social media?" In this paper, emphasis was given on detecting name calling, mocking, taunting, threatening, or verbal abuse on Instagram. I addressed a complex problem in Arabic social media and carried out the key research goals. To assess cyberbullying detection methods, I created a cyberbullying dataset that included written comments in MSA, Saudi, and other Arabic dialects. I created the dataset taking cyberbullying into consideration. I evaluated two supervised learning approaches to detect cyberbullying—the KNIME tool and the Orange tool. The keywords were the same in both approaches, as they are on social media. Both methods produced positive evaluation outcomes. Regarding detecting Engineering, Technology & Applied Science Research Vol. 12, No. 5, 2022, 9364-9371 9371 www.etasr.com Alzahrani: Data Mining Regarding Cyberbullying in the Arabic Language on Instagram Using KNIME … cyberbullying, the Orange tool outperformed the KNIME tool. To address the issue of cyberbullying detection tasks, I suggest using the KNIME tool with raw data. Regarding future work on cyberbullying detection, the following two directions are suggested for further investigation: • Dataset expanding. A second version of the cyberbullying keyword dataset can be published by extracting extra samples and performing labeling process (known verbal, suspicious verbal, and non-cyberbullying). • Cyberbullying detection in audio files using the KNIME tool with the created cyberbullying keywords dataset. The utilized cyberbullying detection approach will be used in the future on the Twitter platform with its open-source API. REFERENCES [1] J. Warren, "This Is How the Instagram Algorithm Works in 2022," Later.com, Jun. 21, 2022. https://later.com/blog/how-instagram- algorithm-works/. [2] F. Chen and D. B. Neill, "Human Rights Event Detection from Heterogeneous Social Media Graphs," Big Data, vol. 3, no. 1, pp. 34–40, Mar. 2015, https://doi.org/10.1089/big.2014.0072. [3] H. Mubarak, K. Darwish, and W. Magdy, "Abusive Language Detection on Arabic Social Media," in Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada, Dec. 2017, pp. 52– 56, https://doi.org/10.18653/v1/W17-3008. [4] K. E Abdelfatah, G. Terejanu, and A. A Alhelbawy, "Unsupervised Detection of Violent Content in Arabic Social Media," in Computer Science & Information Technology (CS & IT), Mar. 2017, pp. 01–07, https://doi.org/10.5121/csit.2017.70401. [5] L. Kaati, E. Omer, N. Prucha, and A. Shrestha, "Detecting Multipliers of Jihadism on Twitter," in 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Aug. 2015, pp. 954–960, https://doi.org/10.1109/ICDMW.2015.9. [6] S. Alsafari, S. Sadaoui, and M. Mouhoub, "Hate and offensive speech detection on Arabic social media," Online Social Networks and Media, vol. 19, Sep. 2020, Art. no. 100096, https://doi.org/10.1016/j.osnem. 2020.100096. [7] D. Mouheb, R. Ismail, S. A. Qaraghuli, Z. A. Aghbari, and I. Kamel, "Detection of Offensive Messages in Arabic Social Media Communications," in 2018 International Conference on Innovations in Information Technology (IIT), Aug. 2018, pp. 24–29, https://doi.org/ 10.1109/INNOVATIONS.2018.8606030. [8] Α. Sayed, "Data mining tool open source: Analytical evaluation study," Journal of Taibah University Arts and Humanities, vol. 5, no. 10, pp. 791–865, Jun. 2016, https://doi.org/10.12816/0032954. [9] "Data Analytics Platform: Open Source Software Tools," KNIME. https://www.knime.com/knime-analytics-platform. [10] "Orange Data Mining." https://orangedatamining.com/. [11] S. R. Basha, J. K. Rani, and J. J. C. P. Yadav, "A Novel Summarization- based Approach for Feature Reduction Enhancing Text Classification Accuracy," Engineering, Technology & Applied Science Research, vol. 9, no. 6, pp. 5001–5005, Dec. 2019, https://doi.org/10.48084/etasr.3173. [12] S. R. Basha and J. K. Rani, "A Comparative Approach of Dimensionality Reduction Techniques in Text Classification," Engineering, Technology & Applied Science Research, vol. 9, no. 6, pp. 4974–4979, Dec. 2019, https://doi.org/10.48084/etasr.3146. [13] M. Alghobiri, "A Comparative Analysis of Classification Algorithms on Diverse Datasets," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2790–2795, Apr. 2018, https://doi.org/ 10.48084/etasr.1952.