191 THE EFFECT OF AMOUNT OF DATA ON RESULTS OF ACCURACY VALUE OF C4.5 ALGORITHM ON STUDENT ACHIEVEMENT INDEX DATA Anton Sunardi 1*), Sienny Rusli 2, Christina Juliane 3 Information System Master's Program STMIK LIKMI https://likmi.ac.id antonsunardi@gmail.com1*), siennyrusli@gmail.com2, Christina.juliane@likmi.ac.id3 (*) Corresponding Author Abstract Of the many academic data, data in the form of an achievement index needs to be used in-depth so that it does not become a display of numbers and information only. This achievement index evaluation data reflects the educational process students and teaching staff carries out in an educational process. This study aims to measure the accuracy of data mining processing based on differences in test data by analyzing the C4.5 algorithm using RapidMiner as a data processing tool and determining the decisions students can make and academic institutions in developing study strategies and educational curricula to be maximized. The data processing is carried out by classifying the student achievement index data at a private university using data analysis test equipment. The data source comes from kaggle.com, which consists of 1687 data that have been processed and processed. The conclusion from the results of this study is that the amount of data turns out to have a significant influence on the accuracy value of the C4.5 algorithm, where an accuracy rate of 91.69% is obtained from the test results of 1687 data with four main attributes, namely IPK1, IPK2, IPK3, IPK4 and correctly or not as a label. Keywords: the amount of data, C4.5, achievement index, data mining Abstrak Dari sekian banyak data akademik, data berupa Indeks Prestasi perlu dimanfaatkan secara mendalam agar tidak menjadi tampilan deret angka dan informasi saja. Data evaluasi Indeks Prestasi ini merupakan cerminan dari proses pendidikan yang dilakukan pelajar, mahasiswa, dan tenaga pengajar dalam suatu proses pendidikan. Penelitian ini bertujuan untuk mengukur tingkat akurasi pengolahan data mining berdasarkan perbedaan jumlah data uji dengan menganalisa algoritma C4.5 menggunakan RapidMiner sebagai alat bantu olah data, dan untuk mengetahui keputusan yang dapat diambil oleh pelajar, mahasiswa dan institusi akademis dalam menyusun strategi studi dan kurikulum pendidikan agar lebih maksimal. Proses olah data dilakukan dengan mengklasifikasikan data Indeks Prestasi mahasiswa pada sebuah perguruan tinggi swasta menggunakan alat uji analisis data. Sumber data berasal dari kaggle.com yang terdiri dari 1687 data yang telah diproses dan diolah. Kesimpulan dari hasil penelitian ini adalah jumlah data ternyata memiliki pengaruh signifikan terhadap nilai akurasi algoritma C4.5, dimana tingkat akurasi sebesar 91.69 % didapatkan dari hasil uji terhadap 1687 data dengan 4 atribut utama yaitu IPK1, IPK2, IPK3, IPK4 dengan Tepat atau tidak kelulusan sebagai label. Kata Kunci: jumlah data, C4.5, indeks prestasi, data mining INTRODUCTION From time to time, along with the rapid development of the world of data and education, a person is required to improve knowledge and skills to have a good and quality thinking pattern to plan strategies in the face of future competition. Performance evaluations are stored and collected by various universities in the form of student achievement index data that universities can use as one of the supports that the management of education organizers can use to determine their educational strategies. The way to use historical data is to process it using data mining methods, one of which is by applying data classification methods so that patterns and rules can produce helpful information to support students and educational institutions in developing learning strategies and educational strategies. mailto:antonsunardi@gmail.com1 mailto:siennyrusli@gmail.com2 mailto:Christina.juliane@likmi.ac.id3 192 In previous studies, the level of accuracy in processing student graduation data using several methods can also be tested using machine learning or deep learning techniques, as has been done by (Maryanto, 2017). From the explanation in the research, data processing requires a tool that can measure the level of accuracy. According to 2020 college statistics (Handini et al., 2020) The amount of student data recorded nationally compared to the number of students enrolled in the Ministry of Education and Culture has a significant difference. Therefore, research is needed to apply extensive data collection to support the industrial revolution as a breakthrough in rapid technological progress. Research that has been done previously by (Budiman & Ramadina, 2015) regarding predictions using the data mining classification algorithm conducted by (Windarti & Suradi, 2019) applies large amounts of data to data analysis concepts that can help readers, especially students and teaching staff and related agencies. Their field determines educational strategies by considering various aspects, especially the influence of GPA and the accuracy of the Student Achievement Index data results. This research is presupposed to provide benefits for students as an early reminder about the potential for untimely graduation so that students can develop a more effective study plan strategy. For academic institutions to foreknown provide information in the form of patterns and images that can be used to determine policies in minimizing the student's potential untimely graduation, which is not timely in the scientific field, it is hoped that this research can provide changes in data mining testing techniques. The classification method uses the C4.5 algorithm for varying amounts of data so that this research can be used as a reference for parties in need. RESEARCH METHODS The method used in data collection and reference consists of analytical techniques used to classify data and is done by selecting large amounts of data sourced from Kaggle.com, then sorting and cleaning steps to be arranged according to research needs, namely 1687 data divided into five research attributes. Data processing is carried out using the RapidMiner data processing tool to find C4.5 accuracy through careful calculations to find the best level of precision to produce data references to be implemented. The data is applied through the process of extracting patterns from data using training data of 80% and testing data of 20%, then conducted a comparison test of the accuracy rate of the C4.5 algorithm, with the number of data 100, 400, 900, 1250, 1450, and 1687 from a private university in Indonesia, referring to the Student Achievement Index. The research flow can be presented in figure 1 below: Figure 1. Research Flow Research Type According to (Chapman et al., 2000), CRISP- DM is a method that connects the background to be achieved to data usage and provides an overview of the data cycle. Research using similar techniques was conducted by (Sabna & Muhardi, 2016) regarding the data cycle of the data wealth of college students. This study uses six stages of the data processing process, namely: 1. Business Understanding The focus of this stage is more on understanding the objectives and requirements than turning this knowledge into the purpose of extracting data. 2. Data Understanding Data understanding begins with data collection, then getting to know the data, identifying data quality, looking for insights, and detecting groups of data parts that can generate hypotheses on confidential information. 3. Data Preparation (Amir & Abijono, 2018) It is posited that the stages in the preparation of data are processed in several locations and can be non-sequential as modelling tools are performed labelling, attribute selection, transformation, and data cleaning to build the final dataset. RapidMiner is a data processing tool used to find patterns, designs, knowledge, and evaluation of large amounts of data. It is an open- source learning machine that contains data tools for pre-processing, classification, rule, and association 193 so that it is easy to visualize stated. (Muis & Affandes, 2015) 4. Modelling This research refers to previous research by (Romadhona, Suprapedi, & Himawan, 2017). In his discussion of data modelling using the C4.5 algorithm decision tree, he explains that there is a higher level of accuracy when compared to ID3 and Chaid algorithms. Therefore, modelling is determined and adjusted to achieve optimal values at the modelling stage. At this stage, the authors tested the amount of data against the level of accuracy by comparing data < 1000, namely 100, 400, and 900, and data >1000 data, namely 1250, 1400, and 1687. (Hermawanti, Asriyanik, & Sunarto, 2019) found an accuracy rate of 68.42% using 145 test data, comparing < 1000 data i.e. 100, 400, 900 and data > 1000 data namely 1250, 1400 and 1687. This study used training data of 80% test data of 20% to produce optimal accuracy values referring to discussions done previously by (Musu, Ibrahim, & Heriadi, 2021) 5. Evaluation At the evaluation stage, an assessment of the data that has been generated from accuracy values based on Confusion Matrix, under curve area values (AUC) and execution time (ET). (Olson & Shi, 2007) his book entitled Introduction to the Science of Business Data Excavation explains that a confusion matrix produces four types of classifications, namely true Positive (TP), true negative (TN), false positive (FP), and false-negative (FN). The formulation of the confusion matrix in table form is described in table 1 below: Table 1. Confusion Matrix Table (Olson and Shi 2007) True Value TRUE FALSE Predicted Value True TP (True Positive) Correct result FP (False Positive) Unexpected result False FN (False Negative) Missing result TN (True Negative) Correct absence of result Information: True Positive (TP): The amount of positive and predicted data to be accurate as Positive. False Positive (FP): The amount of harmful data but predicted as Positive. False Negative (FN): The amount of data that is positive but predicted as Negative. True Negative (TN): The amount of negative and predicted data to be accurate as Negative. Referring to research conducted by (Azhari, Situmorang, & Rosnelly, 2021), it is mentioned that precision is the accuracy of getting information from accurate positive and negative class data. In addition to the accuracy value, there is also a value to recall specific information obtained from the recall value. The comparison value in testing against research data is obtained from the system's predictive value and the tester's prediction value called accuracy. Once the results of the Confusion Matrix calculation are obtained, estimations for precision, recall, and accuracy values can be calculated as in the measures in Formula 1 below: Precision = 𝑇𝑃 𝑇𝑃+𝐹𝑃 ............................................................... (1) Recall = 𝑇𝑃 𝑇𝑃+𝐹𝑁 .................................................................... (2) Accuracy = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 .............................................. (3) Formula1. Table Precision, Recall, and Accuracy (Olson & Shi, 2007) Formula with the following description: TP: True Positive FP: False Positive TN: True Negative FN: False Negative 6. Deployment According to (Saefulloh & Moedjiono, 2013), after the formation of the model, further analysis and measurements are carried out at the previous stage. At this stage, the most accurate model or rule is applied to predict timely graduation and can then be used to evaluate new data. The concept of data deployment refers to the application of a model to predict the accuracy results of 91.69% of 1687 data that students, educators, and readers can use to monitor data processing strategy plans with the implementation of data mining if the purpose of the model is to increase knowledge about data, knowledge. What is obtained needs to be arranged and presented so that students, educators, and related institutions can use it? The implementation stage produces research reports that can be used for repetitive data mining. Students or students, educators, and educational institutions that carry out deployment steps need to be understood to redevelop necessary knowledge so that the results have use value. 194 Procedure This research was conducted by referring to various literature studies in the form of research results in journals and books seized according to related research needs as a reference for writing. The steps of the data excavation and processing process procedure using RapidMiner that we do can be presented in Figure 2 below: Figure 2. Process of processing data, methods, knowledge, evaluation using RapidMiner (author source) RESULTS AND DISCUSSIONS (Megna, 2021) In his research in 2021, he believed that large amounts of data are enormous data sets in volume but grow exponentially with the time that traditional data does not have, so it requires management tools that can store or process it to be efficient. Based on this, the diversity of data formats that are increasingly complex and grow over time requires data governance using techniques and technology. Here are the results obtained based on the crisp-dm methodology theory 1. Business Understanding The purpose of the study was to determine the amount of data in the dataset that affects the degree of accuracy in the decision tree classification method with the C4.5 algorithm. 2. Data Understanding The student achievement index dataset is in the form of public data taken from kaggle.com, with 1687 data and four attributes containing student achievement index values and 1 point of appropriate or inappropriate graduation used as a label. The initial data is presented in table 2 and table 3 below: Table 2. Initial dataset table No SAI1 SAI2 SAI3 SAI4 Yes / No 1 2.30 1.97 1.80 1.56 No 2 1.81 1.68 1.57 1.86 No 3 3.07 3.00 2.75 3.21 No 4 2.71 2.33 2.61 1.98 No 5 3.17 3.02 3.28 2.96 No 6 3.16 3.45 3.02 3.06 No 7 2.72 2.50 2.92 3.00 No . . . . . . 1687 3.18 3.05 3.05 3.27 Yes Table 3. Dataset description table Attribute Name Information SAI1 Student Achievement Index 1 SAI2 Student Achievement Index 2 SAI3 Student Achievement Index 3 SAI4 Student Achievement Index 4 Yes / No Right or Not Graduation 3. Data Preparation In the preparatory stage, the dataset was adjusted to the data mining processing process using RapidMiner, and the decision tree classification method was carried out with the C4.5 algorithm and then obtained 1687 data with four attributes. Student Achievement Index and one attribute are appropriate or not timely as labels as presented in table 4 below: Table 4. Pre Processing data table No SAI1 SAI2 SAI3 SAI4 Yes / No 1 2.30 1.97 1.80 1.56 No 2 1.81 1.68 1.57 1.86 No 3 3.07 3.00 2.75 3.21 No 4 2.71 2.33 2.61 1.98 No 5 3.17 3.02 3.28 2.96 No 6 3.16 3.45 3.02 3.06 No 7 2.72 2.50 2.92 3.00 No . . . . . . 1687 3.18 3.05 3.05 3.27 Yes 195 According to (Dengen, Kusrini, & Luthfi, 2020) The calculation steps in the C4.5 algorithm decision tree using manual calculation methods are as follows: The calculation steps in the C4.5 algorithm decision tree using manual calculation methods are as follows: 1. Prepare dataset samples 2. Calculate the entropy value using the formula:: Entropy (S) = ∑ − 𝑝𝑖 ∗ log 2 𝑝𝑖𝑛𝑖=1 ............................ (4) Information: S: set N: number of partitions S Pi: proportion of Si to SS 3. Calculate the gain value of each attribute, followed by selecting the highest gain value Gain (S,A) = Entropy(S) - ∑ ∗ Entropy(Si)ni=1 ...... (5) Information: S: Set A: Attributes n: Number of attribute partitions A | Si |: The amount of data on the ith partition | S |: Number of cases in S 4. Modelling This stage progressed by testing <1000 data with 100, 400, and 900 samples and testing data >1000 with examples taken from 1250, 1400, and 1687. This study processed the composition of training data by 80% and testing data by 20%. Using Rapid Miner, data testing was executed to get performance vector results through the presentation in Table 5 and Table 6. Table 5. In performance, Vector data < 1000 obtained calculation results for 100 data with an accuracy result of 70% with confusion matrix on prediction "Yes" and prediction result "Yes" of 1, and on prediction "Yes" and prediction result "No" by 5. The results of the Performance Vector 400 data calculation were obtained with an accuracy result of 85% with confusion matrix on prediction "Yes" and prediction result "Yes" of 0, and on projection "Yes" and prediction result "No" of 5. Performance Vector calculation results from 900 data obtained an accuracy result of 86.11% with Confusion Matrix on prediction "Yes" and prediction result "Yes" of 1, and on prediction "Yes" and prediction result "No" of 12. Table 5. RapidMiner Vector Performance Table (Data < 1000) Data < 1000 Amount of Data Performance Vector 100 Accuracy 70.00 % Confusion Matrix True No Yes No 1 5 Yes 1 13 400 Accuracy 85.00 % Confusion Matrix True No Yes No 0 5 Yes 7 68 900 Accuracy 86.11 % Confusion Matrix True No Yes No 1 12 Yes 13 154 Table 6. RapidMiner Vector Performance Table (Data > 1000) Data > 1000 Amount of Data Performance Vector 1250 Accuracy 86.60 % Confusion Matrix True No Yes No 0 6 Yes 20 224 1400 Accuracy 90.36 % Confusion Matrix True No Yes No 0 3 Yes 24 253 1687 Accuracy 91.69 % Confusion Matrix True No Yes No 2 3 Yes 25 307 Table 6. In performance, Vector data >1000 obtained calculation results for 1250 data with an accuracy result of 86.6% with Confusion Matrix on prediction "Yes" and prediction result "Yes" of 0, and on prediction "Yes" and prediction result "No" of 6. The results of the Performance Vector 1400 data calculation were obtained an accuracy result of 90.36% with confusion matrix on prediction "Yes" and prediction result "Yes" of 0, and on prediction "Yes" and prediction result "No" of 3. Performance Vector 1687 data obtained an accuracy result of 91.69% with Confusion Matrix on the forecast "Yes" and the prediction result "Yes" of 196 2, and on the prediction "Yes" and the development of the prophecy. "No" of 3. 5. Evaluation The evaluation is seen from the results of calculating accuracy values based on confusion matrix using RapidMiner obtained results as table 7 below: Table 7. C4.5 algorithm data evaluation table Amount of data Accuracy Algorithm of C4.5 100 70.00 % True no True Yes Class Precision Prediction No 1 5 16.67 % Prediction Yes 1 13 92.86 % Class Recall 50.00 % 72.22 % 400 85.00 % True no True Yes Class Precision Prediction No 0 5 0.00 % Prediction Yes 7 68 90.67 % Class Recall 0.00 % 93.15 % 900 86.11 % True no True Yes Class Precision Prediction No 1 12 7.69 % Prediction Yes 13 154 92.22 % Class Recall 7.14 % 92.77 % 1250 89.60 % True no True Yes Class Precision Prediction No 0 6 0.00 % Prediction Yes 20 224 91.80 % Class Recall 0.00 % 97.39 % 1400 90.36 % True no True Yes Class Precision Prediction No 0 3 0.00 % Prediction Yes 24 253 91.34 % Class Recall 0.00 % 98.83 % 1687 91.69 % True no True Yes Class Precision Prediction No 2 3 40.00 % Prediction Yes 25 307 92.47 % Class Recall 7.41 % 99.03 % Table 7 shows the percentage value of C4.5 algorithm calculation accuracy against the number of data > 1000, i.e., 1250, 1400, and 1687 and the amount of information < 1000, i.e., 100, 400, and 900. The C4.5 algorithm showed higher accuracy in > 1000 data, namely 1250, 1400, and 1687. It indicates that data processing using RapidMiner from several data samples has a high degree of accuracy on data amounting to > 1000. The highest accuracy obtained is 91.69%, with 1687 data. 6. Deployment The results of this research are enclosed conceive writing comprise information on differences in the results of calculating the accuracy value of the C4.5 algorithm, which is affected by the amount of data. This paper can be functioned as an information and reference for research by using the following data classification methods. CONCLUSIONS AND SUGGESTIONS Conclusion The study concluded that the Achievement Index data processing test using RapidMiner against <1000 data showed a lower graduation accuracy rate than the graduation accuracy rate with the number of data > 1000. The difference in the amount of data tested causes a difference in the results of the accuracy value. The highest accuracy percentage is shown in data 1687 at 91.69%, while the test data amounting to 100 produced the lowest accuracy value of 70%. 197 Suggestion To obtain the outcome of the achievement index grouping, appropriate data support is needed. To aim for a more accurate potential value in the C4.5 algorithm technique, the suitability of the type of data and the amount of data is essential because it has a significant effect. The more data tested, the accuracy level will be determined, therefore it is concluded that the amount of data influences the result of processing the accuracy value. REFERENCE Amir, S., & Abijono, H. (2018). Penerapan Data Mining untuk Mendukung Pemasaran Produk. CAHAYAtech, 7(2), 161–182. Retrieved from https://ojs.cahayasurya.ac.id/index.php/CT/ article/view/102 Azhari, M., Situmorang, Z., & Rosnelly, R. (2021). Perbandingan Akurasi, Recall, dan Presisi Klasifikasi pada Algoritma C4. 5, Random Forest, SVM dan Naive Bayes. JURNAL MEDIA INFORMATIKA BUDIDARMA, 5(2), 640–651. https://doi.org/10.30865/mib.v5i2.2937 Budiman, I., & Ramadina, R. (2015). Penerapan Fungsi Data Mining Klasifikasi untuk Prediksi Masa Studi Mahasiswa Tepat Waktu pada Sistem Informasi Akademik Perguruan Tinggi. JUPITER (Jurnal Penelitian Ilmu Dan Teknologi Komputer), 7(1), 39–50. Retrieved from https://jurnal.polsri.ac.id/index.php/jupiter/ article/view/709 Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T. P., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. In SPSS inc. CRISP-DM consortium. Retrieved from https://www.kde.cs.uni- kassel.de/wp- content/uploads/lehre/ws2012- 13/kdd/files/CRISPWP-0800.pdf Dengen, C. N., Kusrini, K., & Luthfi, E. T. (2020). Implementasi Decision Tree Untuk Prediksi Kelulusan Mahasiswa Tepat Waktu. SISFOTENIKA, 10(1), 1–11. Retrieved from https://www.stmikpontianak.ac.id/ojs/index .php/ST/article/view/484 Handini, D., Hidayat, F., Attamimi, A. N. R., Putri, D. A. V., Rouf, M. F., & Anjani, N. R. (2020). Statistik Pendidikan Tinggi Tahun 2020. Jakarta: Sekretaris Direktorat Jenderal Pendidikan Tinggi. Retrieved from Sekretaris Direktorat Jenderal Pendidikan Tinggi website: https://pddikti.kemdikbud.go.id/asset/data/ publikasi/Statistik Pendidikan Tinggi 2020.pdf Hermawanti, S. N., Asriyanik, A., & Sunarto, A. A. (2019). Implementasi Algoritma C4.5 untuk Prediksi Kelulusan Tepat Waktu ( Studi Kasus : Program Studi Teknik Informatika ). Jurnal Ilmiah SANTIKA, 9(1), 853–864. https://doi.org/10.37150/jsa.v9i1.552 Maryanto, B. (2017). Big Data dan Pemanfaatannya dalam Berbagai Sektor. Media Informatika, 16(2), 14-19. Retrieved from https://jurnal.likmi.ac.id/Jurnal/7_2017/071 7_02_BudiMaryanto.pdf Megna, A. A. K. (2021). Big Data: Development of Revolutionary technologies in Business. Istanbul. Muis, I. A., & Affandes, M. (2015). Penerapan Metode Support Vector Machine (SVM) Menggunakan Kernel Radial Basis Function (RBF) Pada Klasifikasi Tweet. Jurnal Sains, Teknologi Dan Industri, 12(2), 189–197. Retrieved from http://ejournal.uin- suska.ac.id/index.php/sitekin/article/view/1 010 Musu, W., Ibrahim, A., & Heriadi, H. (2021). Pengaruh Komposisi Data Training dan Testing terhadap Akurasi Algoritma C4.5. Seminar Sistem Informasi Dan Teknologi Informasi (SISITI), 186–195. Makasar: STMIK Dipanegara Makassar. Retrieved from https://www.ejurnal.dipanegara.ac.id/index. php/sisiti/article/view/802 Olson, D., & Shi, Y. (2007). Pengantar Ilmu Penggalian Data Bisnis - Introduction to Business Data Mining. Jakarta: Salemba Empat. Romadhona, A., Suprapedi, S., & Himawan, H. (2017). Prediksi Kelulusan Mahasiswa Tepat Waktu Berdasarkan Usia, Jenis Kelamin, Dan Indeks Prestasi Menggunakan Algoritma Decision Tree. Jurnal Teknologi Informasi CyberKU, 13(1), 69–83. Retrieved from http://research.pps.dinus.ac.id/index.php/Cy berku/article/view/10 Sabna, E., & Muhardi, M. (2016). Penerapan Data Mining Untuk Memprediksi Prestasi Akademik Mahasiswa Berdasarkan Dosen, Motivasi, Kedisiplinan, Ekonomi, dan Hasil Belajar. Jurnal CoreIT: Jurnal Hasil Penelitian Ilmu Komputer Dan Teknologi Informasi, 2(2), 41. https://doi.org/10.24014/coreit.v2i2.2392 Saefulloh, A., & Moedjiono, M. (2013). Penerapan Metode Klasifikasi Data Mining Untuk Prediksi Kelulusan Tepat Waktu. InfoSys 198 Journal, 2(1), 41–54. Windarti, M., & Suradi, A. (2019). Perbandingan Kinerja 6 Algoritme Klasifikasi Data Mining untuk Prediksi Masa Studi Mahasiswa. Telematika, 12(1), 14–30. Retrieved from https://ejournal.amikompurwokerto.ac.id/in dex.php/telematika/article/view/778