knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 1, january 2018, pp. 20–25 eissn 2597-4637 https://doi.org/10.17977/um017v1i12018p20-25 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) market basket analysis to identify customer behaviors by way of transaction data fachrul kurniawan a, 1, *, binti umayah a, jihad hammad c, supeno mardi susiki nugroho b, mochammad hariadi b a dept. of informatics eng., maulana malik ibrahim state islamic univ., jl. gajayana no.50, malang 65144, indonesia b dept. of electrical engineering, institut teknologi sepuluh nopember, jl. raya its,surabaya 60111, indonesia c ict faculty, alquds open university, beit jalla-the main road-khallat al badd, bethlehem, palestine 1 fachrulk@ti.uin-malang.ac.id* * corresponding author i. introduction consumer behavior is a consumer activity in deciding to purchase, use, as well as consume the purchased goods and services including in the customer factors which can give a rise to their decisions whether to purchase and use products. every customers have different needs and inclinations as well as have different behaviors in fulfilling those things. however, in the event of different behaviors to fulfill their needs, they still share some similarities, one of them is desiring to maximize their satisfactions in consuming a necessary product or service. of that consumption activity, that can be inferred as to the behavior, pattern, or habit that the customers do in fulfilling their needs and desires. that behavior can be identified by way of the logging carried out by the intermediary provider of consumer needs (supermarket). the logging is brought off in view of the requisite of documentation and history data identification over the carrying out of transaction activity. in these recent years, transaction data have been commonly used as research and analysis objects for researchers. in this study, also, transaction data are to be re-processed/re-explored to generate more valuable information. for instance, information of an item whose sales is the highest. besides, information can be utilized in regard with the stock addition of that item. moreover, from transaction data there can be utilized as to the relation of each purchased item inside the customer’ basket. by that information, we can make use of it for effective product display/assortment to attract customers’ interest. the commonly-used application to analyze transaction data customers’ shopping basket is market basket analysis. market basket analysis is one of the modes from data mining technique prevalently employed to analyze items/goods in one or more shopping baskets that a customer has in one particular moment article info a b s t r a c t article history: received 30 august 2017 revised 13 september 2017 accepted 1 october 2017 published online 8 january 2018 transaction data is a set of recording data result in connections with sales-purchase activities at a particular company. in these recent years, transaction data have been prevalently used as research objects in means of discovering new information. one of the possible attempts is to design an application that can be used to analyze the existing transaction data. that application has the quality of market basket analysis. in addition, the application is designed to be desktop-based whose components are able to process as well as re-log the existing transaction data. the used method in designing this application is by way of following the existing steps on data mining technique. the trial result showed that the development and the implementation of market basket analysis application through association rule method using apriori algorithm could work well. with the means of confidence value of 46.69% and support value of 1.78%, and the amount of the generated rule was 30 rules. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: data mining market basket analysis association rule apriori http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um017v1i12018p20-25 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 21 f. kurniawan et al. / knowledge engineering and data science 2018, 1 (1): 20–25 [1]. market basket analysis application ought to be designed and implemented at a supermarket not only owing to being able to help the sales promotion design but also able to be made as a reference to re-manage item stock’ incoming and outcoming in the warehouse. in this study, market analysis application will be implemented at bc uin malang supermarket, in regard with its inability to use transaction data. this application is expected to work well and is able to generate the desired result. a. market basket analysis market basket analysis is an analysis mode performed on customer behavior whilst shopping at a supermarket through the means identifying association and connections among various items placed by the customers in their shopping baskets [2]. in specific, market basket analysis aims at simultaneously identifying the most frequently-purchased items by customers. here, item is depicted as several kinds of products in supermarket. using market basket analysis mode, a knowledge of what are the items oftenly purchased by the customers simultaneously and having an opportunity to be promoted can be obtained. with regards to the objective of market basket analysis mode to decide which products that customers purchase at the same time, whereby the name of this mode is taken from the behavior of the customers in placing shopping products into their shopping baskets or shopping list. over identifying shopping basket pattern of a customer will significantly be able to help a company in using that information in respect of business strategy needs, one of them is placing the most frequently-purchased products simultaneously into one specific area. b. association rule association rule is related to the statement of “what’ with what”. this matter can be in a form of statement on transaction activity carried out by the customers at a supermarket. from that statement, there has a strong relation to the study of customer transaction data database to determine the habit of a purchased product with what product, thus, association rule is frequently referred as market basket analysis [1], [2]. the significance of an associative rule can be figured in the presence of two parameters, namely support and confidence. support (supporting value) is the percentage of combinations of product items in the database. while confidence (certainty value) is a value to determine the strength of inter-item relationships in association rules. 𝑆 = ∑(𝑇𝑎+𝑇𝑐) ∑(𝑇) (1) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) = 𝑇ℎ𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑡ℎ𝑎𝑡 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴 𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 (2) where: s = support ∑(𝑇𝑎 + 𝑇𝑐) = the number of transaction that contains antecendent and consequent. ∑(𝑇) = the number of transaction. ∁ = ∑(𝑇𝑎+𝑇𝑐) ∑(𝑇𝑎) (3) 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃(𝐴 | 𝐵) = 𝑇ℎ𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑡ℎ𝑎𝑡 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴 𝑎𝑛𝑑 𝐵 𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑡ℎ𝑎𝑡 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴 (4) where: c = confidence ∑(𝑇𝑎 + 𝑇𝑐) = the number of transaction that contains antecendent and consequent. ∑(𝑇𝑎) = the number of transaction that contains antecendent. c. customer behaviour definition of customer behavior is a dynamic interaction between cognition, affection, behavior, and its environment whereby someone performs exchange activities in their regular basis [3]. in view of this statement, there are three significant matters to grasp, namely: f. kurniawan et al. / knowledge engineering and data science 2018, 1 (1): 20–25 22 1. customer behavior bears dynamic characteristic, thus, hard to predict. 2. involving interaction, like cognition, affection, behavior, and the occurrences around customers, 3. involving exchange, like the exchange of item and money from merchant to customer four factors that could give a rise to customer purchase in shopping [4], some of them were: 1. cultural factor 2. social factor 3. personal factor 4. psychological factor there were three variables that must be regarded in understanding customer behavior, namely stimulus variable, response variable, and intervening variable[5]. d. data mining the researcher argued that data mining had garnered the interest of information industry and public during in that recent years, this was brought about by massive data availability and the necessities to render information data to usable knowledge [6], [7]. data mining is prevalently depicted as retrieval pattern on a large set of crude data so that some unveiled knowledge in the data can be found. data mining is considered a major step in the process of knowledge retrieval in the database. data mining inherits many aspects and techniques from established fields of science first. data mining also has long roots from fields such as artificial intelligence (artificial intelligent), machine learning, statistics, database and also information retrieval. ii. methods in association rule mode, the implemented algorithm in the making of market basket analysis application is apriori algorithm. algoirthm is used to develop frequent itemset using 1-item first, then value support of every item would be later counted. item whose support value was above the minimum support value was selected as 1-itemset high frequency pattern and as 2-itemset candidates. by that of 1-itemset, the development of frequent itemset into 2-itemset which would then the value of confidence be calculated next was recursively brought off. iii. results and discussion this application was implemented in private supermarket of uin malang business center, over inputting the collected transaction data of which the data used were transaction data in bc uin malang on october 1st, 2014 with 1553 collected data. in additions of the stored transaction receipts of 890753 up to 891319 transaction receipts. the further step was inputting minimum support and minimum value of confidence. in illustration, user wanted to insert transaction data with receipts from 890753 up to 890853. from that receipt there were 56 transaction data with 20 items could be retrieved. those 56 transaction data were obtained from the transaction receipts whose item amounts were minimumly 2 in 1 single transaction the trial system was carried out in means of system trial of market basket analysis application by way of association rule counting mode using apriori algorithm. in this study, a discussion of the tests that had been carried out on the system and evaluation of the results responded by the system. from this test, how patterns differences of the generated data would be known. the test was performed on 56 data by entering the initial value  minimum support= 3  minimum confidence= 10 fig. 2 shows the data that would be used for the test, which were the receipts between 890753 up to 890853 with the number of transaction data was 56 data assorted in respect of “receipts” the calculation of the support value of 1 item can be obtained through the following formula: 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) = 𝑇ℎ𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑤ℎ𝑖𝑐ℎ 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴 𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 (5) 23 f. kurniawan et al. / knowledge engineering and data science 2018, 1 (1): 20–25 while the support value for 2 items can be obtained with the following formula: 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴, 𝐵) = 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑁𝑢𝑚𝑏𝑒𝑟 𝑊ℎ𝑖𝑐ℎ 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴 𝑎𝑛𝑑 𝐵 𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛  confidence value calculation was generated of this following formula: 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃(𝐴|𝐵) = 𝑇ℎ𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑊ℎ𝑖𝑐ℎ 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴 𝑎𝑛𝑑 𝐵 𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑤ℎ𝑖𝑐ℎ 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴  fig. 1. apriori algorithm design fig. 2. application implementation f. kurniawan et al. / knowledge engineering and data science 2018, 1 (1): 20–25 24 in calculating support 1-itemset with 2-itemset, the inputted minimum support was differentiated. minimum support for 2-itemset calculation was deducted as if it was evened out, there would be no candidate for 2-itemset. hence, the generated output was only constrained o 1-itemset. thereby, on the test of this study, for the calculation of 2-itemset, the researcher deliberate entered different minimum support value with minimum support value on 1-itemset calculation by means of generating the desired recommendation outcome. fig. 3. the calculation result of 1 item support value fig. 4. confidence value calculation result 25 f. kurniawan et al. / knowledge engineering and data science 2018, 1 (1): 20–25 iv. conclusions in the test and analysis that have been carried through, it is concluded that of development and implementation of market basket analysis over association rule mode using apriori algorithm on the transaction data of business centre (bc) uin malang supermarket, can perform well. with the confidence value average of 46.69% from support value 1.78% and the generated rules are 30 rules. the data transaction rule/ pattern which was retrieved had a low association tendency. that was due to the owned data and the analyzed data did not really support association analyses/ the relation between items. references [1] m. kaur and s. kang, “market basket analysis: identify the changing trends of market data using association rule mining,” procedia comput. sci., vol. 85, no. cms, pp. 78–85, 2016. [2] a. mansur and t. kuncoro, “product inventory predictions at small medium enterprise using market basket analysis approach-neural networks,” procedia econ. financ., vol. 4, no. icsmed, pp. 312–320, 2012. [3] x. su, “intertemporal pricing with strategic customer behavior,” manage. sci., vol. 53, no. 5, pp. 726–741, 2007. [4] g. armstrong, s. adam, s. denize, and p. kotler, armstrong, g., adam, s., denize, s., & kotler, p. pearson australia., 2014. [5] e. sherman, a. mathur, and r. b. smith, “store environment and consumer purchase behavior: mediating role of consumer emotions,” psychol. mark., vol. 14, no. 4, pp. 361–378, 1997. [6] n. jothi, n. a. rashid, and w. husain, “data mining in healthcare a review,” procedia comput. sci., vol. 72, pp. 306–313, 2015. [7] a. bertoni and t. larsson, “sciencedirect data mining in product service systems design: literature review and research questions,” procedia cirp, vol. 64, pp. 306–311, 2017. https://doi.org/10.1016/j.procs.2016.05.180 https://doi.org/10.1016/j.procs.2016.05.180 https://doi.org/10.1016/s2212-5671(12)00346-2 https://doi.org/10.1016/s2212-5671(12)00346-2 https://doi.org/10.1287/mnsc.1060.0667 https://doi.org/10.1002/(sici)1520-6793(199707)14:4%3c361::aid-mar4%3e3.0.co;2-7 https://doi.org/10.1002/(sici)1520-6793(199707)14:4%3c361::aid-mar4%3e3.0.co;2-7 https://doi.org/10.1016/j.procs.2015.12.145 https://doi.org/10.1016/j.procs.2015.12.145 https://doi.org/10.1016/j.procir.2017.03.131 https://doi.org/10.1016/j.procir.2017.03.131 knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 1, january 2018, pp. 26–32 eissn 2597-4637 https://doi.org/10.17977/um018v1i12017p26-32 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) capital letter pattern recognition in text to speech by way of perceptron algorithm novan wijaya a, 1, * a management informatics study prog., amik multi data palembang, jl. rajawali no.14, palembang 30113, indonesia 1 novan.wijaya@mdp.ac.id* * corresponding author i. introduction computer vision is a data transformation retrieved and obtained from webcam into another form in means of determining a future decision. all kinds of transformation forms are carried through to achieve a few of particular objectives [1]. there are several operations in computer vision starting from object image capturing by camera, object image processing into a more efficient and simple form without omitting representative information of that object, and eventually system analysis to determine the action that will be taken.[2]. one of the applications that is possible to be developed from computer vision is capital letter pattern introduction. the fundamental concept is that of the image is captured by webcam, then the captured image is processed into digital image, furthermore, the analysis of the captured image is deemed to decide whether the image belongs to which letter. even more, for object image processing on computer vision can employ digital image processing concept [3]. these digital image processing techniques can be used in the recognition of capital letters by way of a webcam in terms of image quality improvements captured by a webcam. once the computer has obtained a good digital image or the information required by the computer has been obtained, it takes a pattern recognition technique in order for the computer to make a decision to recognize the pattern of the captured image letters. the method that can be used for the pattern recognition process is the method of artificial neural network (ann). artificial neural networks have the ability to learn to solve problems that are rather complicated. this is because the existing knowledge in ann is not programmed, but through the undergone process of training information. artificial neural networks are trained using the perceptron algorithm [4]. perceptron algorithm is an artificial neural network used to classify a pattern of entry into a class or not. perceptron is also able to be used to classify a pattern belongs to which class, by comparing patterns into each class. perceptron is a single layer learning algorithm through several process procedures by repeating until it gets the right neural weights [5]. article info a b s t r a c t article history: received 16 august 2017 revised 25 september 2017 accepted 1 november 2017 published online 8 january 2018 computer vision is a data transformation retrieved or generated from webcam into another form in means of determining decision. all kinds of transformations are carried through to attain specific aims. one of the supporting techniques in implementing computer vision on a system is digital image processing as the objective of digital image processing is to transform digital-formatted picture so that it can be processed in computer. computer vision and digital image processing can be implemented in a system of capital letter introduction and real-time handwriting reading on a whiteboard supported by artificial neural network mode “perceptron algorithm” used as a learning technique for the system to learn and recognize the letters. the way it works is captured in letter pattern using a webcam and generates a continuous image that is transformed into digital image form and processed using several techniques such as grayscale image, thresholding, and cropping image. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: computer vision digital image perceptron algorithm http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v1i12017p26-32 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ n. wijaya / knowledge engineering and data science 2018, 1 (1): 26–32 27 the pattern recognition system and its reading (text to speech) keep on evolving to date, using other methods both from the pattern recognition side, and from the decision-making side. text to speech is a system capable of converting from a text into a speech or sound. this pattern recognition utilizes several digital image processing techniques in order to obtain accurate image information in accordance with system requirements without compromising the contained important information in the image [6]. from the pattern recognition side using the perceptron algorithm, as well as the maximum approach as a decision-making mode. this paper discusses the use of perceptron algorithm to recognize capital letter pattern recognition in text to speech system. ii. methods the introduction of real time letters using the perceptron algorithm is the development of a pattern recognition system that has been created by previous researchers [7]. the pattern recognition system will try to recognize handwriting and will be added to the reading feature of the letter patterns read by the system. pattern recognition using the webcam as a sensor that is useful for capturing images of letters that will be written on the white board media. the image captured by the webcam is processed by the laptop, and finally the laptop will output the output of the sound pattern of the letters that are read by the system. the workability principle of the whole system is, first the webcam captures the image of an existing letter on the white board. the image is processed by the means of retrieving the information required by the pattern recognition system letters. the required information is the pixel dimension for the image as well as the binary value contained in each pixel. in order to get the binary value on each pixel then used the technique of digital image processing that is, grayscale image, binary image (threshold), median filter and cropping image [8]. upon obtaining the desired image information that is the binary value 0 or 1 on each pixel, then those values will be the inputs of the perceptron algorithm as the pattern recognition method. the concept of the perceptron algorithm is to work out the existing patterns in assent with certain rules, up to finally the system generates a special trait for each worked out pattern, in the presence of these special features the system can distinguish existing letter patterns [4]. image segmentation is a technique for separating the required objects and backgrounds so that the objects in the image are easily analyzed in means of pattern recognition. one of the simplest segmentation techniques is image thresholding [9]. image mining will separate the image into two areas, i.e. the object area and the background region, the object area can be set in white while the background area is set black, or vice versa [10]. for self image mining using otsu method. the approach adopted by the otsu method is to conduct a discriminant analysis of determining a variable that can distinguish between two or more groups that occur naturally [10]. discriminant analysis will maximize the variable in order to divide the foreground object (foreground) and background (background). the result of this image mining process is a binary image that has only two grayish degrees i.e. black or white. iii. results and discussion software design takes substantial role on the establishment of capital letter pattern recognition. fig. 1 shows the general process of the developed system, while fig. 2 shows system establishment diagram. fig. 3 shows the example of a capital letter that is used as the system input. the letter will followed five process of image segmentation as depicted in fig. 4. the first stage, digital imaging stage aims at capturing letter picture on the white board by webcam, which eventually is captured fig. 1. thresholding technique image segmentation process 28 n. wijaya / knowledge engineering and data science 2018, 1 (1): 26–32 and rendered into real time image form. the next step is image segmentation.the captured real time image will be processed to obtain necessary information by way of several modes in this segmentation process, i.e. grayscale, thresholding, cropping image, and resize image. grayscale image possesses two gradation colors of white up to black in each of its pixel. the result of this stage is depicted in fig. 5. fig. 6 shows the result of the following process that is called thresholding. binary image is a digital image whose each pixel only shares two values of 0 (black) fig. 2. system establishment flow diagram fig. 3. a letter sample rgb real time image n. wijaya / knowledge engineering and data science 2018, 1 (1): 26–32 29 and 1 (white) or the other way around. the process of digital image processing, binary image can only be obtained out of the conversion result from grayscale to binary image using threshold technique. afterwards, the image that have undergone several stages would then be cropped to obtain precise information and ease the succeeding stages. initially, the used image resolution is 640 x 480 pixel. upon cropping the image, the resolution turns into 20x20 pixel. that amount is singled out to ease the training input on perceptron algorithm. fig.7 shows the result of cropping image process. upon image cropping process, image size would be rendered into 20 x 20 pixel size (fig.8). resize is a technique both to reduce size and resolution of an image without removing specific information of that image. the retrieved information of the resized image would then be processed using perceptron algorithm as the employed artificial neural network mode for pattern recognition and learning processes. fig. 4. the process of image segmentation fig. 5. grayscale image from real time image ad grayscale image value fig. 6. threshold image letter a and threshold image value letter a 30 n. wijaya / knowledge engineering and data science 2018, 1 (1): 26–32 artificial neural network is a mode designed that way to emulate human’ way of thinking in learning things or new information. perceptron algorithm is used to practice sample data. sample data of this study is a collection of caputured capital letter image by webcam which then be processed using pc and programs as well as image segmentation techniques. the final result of image segmentation is ninary image, it is an image containing binary value (1 or 0). in view of pixel number on the image in concert with thw forementioned explications i.e. 20 x 20 pixels, thus, the input number of perceptron algorithm would be 400 inputs for each character in regard with the number of binary image pixel. the first stage to carry out ahead of performing learning stage (training) by way of algorithm perceptron is sample data collection. the collected data are compartmented into a folder as storage and classification media., those data would undertake traing process using algorithm perceptron. upon the completion of training process, weighting value would be generated for each capital letter and stored in a database and would later be utilized for trial process. fig. 9 shows the data and weighting value data prior to training process. upon completing training process, the trial stage would then be performed. at the trial stage, system workability is reviewed in regard with how far it can distinguish letter patterns that have been worked out and validated. fig. 7. image upon cropping image process fig. 8. resized image result 20 x 20 pixel in binary form fig. 9. sample data and weighting value data prior to training process n. wijaya / knowledge engineering and data science 2018, 1 (1): 26–32 31 the letters directly tested using webcam. over pressing a button on the program display, the system performs several processes starting from capturing the letters on the white board, processing the image into a digital image and processed by way of image segmentation techniques, as well as perceptron algorithm as artificial neural network method. in the end the system tries to recognize existing letter patterns, as well as to read recognizable letters. after going through a series of processes with the aim of recognizing the patterns of letters trained, the last stage is the testing of letters that have been trained using perceptron algorithm. this testing stage is designed to recognize more than one letter, word, even sentence, but basically the introduction of the tested letter pattern will still be done in the letter. at the image segmentation process for the testing phase this time, slightly different from the image segmentation in the training process. the difference is when separating one letter with the other letters that exist in an image as when a picture is captured through the camera then the image will be rendered into a digital image by the computer. therefore, digital images contained several letters in it must be separated first into the letters, so that the test system can recognize the pattern of letters are tested properly. after each letter is separated then the letters are the input for the test system. similar to the training process, this testing phase will process the input images using image segmentation techniques. after getting the information input required by the system then, the input will be processed and produce the output as expected. decision making mode for this validation system is decision making mode in uncertainty by way of maximax criterion [11]. uncertainty decision making exhibits decision condition in which the probabilities of potential results are unidentified, in an uncertain circumstance, the decision maker is aware with alternative results in various events, yet the decision maker cannot determine the event’ probability nonetheless. maximax criterion would search for the best (maximum) alternative for every existing option, then make a decision in regards with the maximum value of that outcome, maximum criterion is also entitled as optimistic criterion decision or an alternate with the highest beneficences. upon merging the input data as well as the weighting value with maximax criterion, speech (reading) in a form of output would later by generated by computer on the identified letter patterns. at the undertake test (fig. 10), it results on an information that an image of hello world tested letters and the system can distinguish that letter. whereas, upon multiple tests, as in fig. 11, system cannot discern one letter. letter “h” is identified as letter “u”, while letter “r” is identified as “k” so that it turns to be “uello wokld”. fig. 10. the interface of the letters that would be tested. 32 n. wijaya / knowledge engineering and data science 2018, 1 (1): 26–32 iv. conclusion by way of several image segmentation techniques like grayscale, binarization, as well as image cropping, and supported with perceptron algorithm as capital letter learning mode. capital letter identification system is designed to be able to run well. from multiple tests, there is one test failing to distinguish letter. references [1] g. bradski and a. kaehler, learning opencv: computer vision with the opencv library. usa: o’reilly media inc, 2008. [2] a. hamzahan, g. santosa, and w. widiarto, “klasifikasi objek dalam visi komputer dengan analisis diskriminan,” makara teknol., vol. 6, no. 1, pp. 24–32, 2002. [3] g. . papakostas, e. . karakasis, and d. . koulouriotis, “accurate and speedy computation of image legendre moments for computer vision applications,” image vis. comput., vol. 28, no. 3, pp. 414–423, 2010. [4] m. r. . dawson, d. m. kelly, m. l. spetch, and b. dupuis, “using perceptrons to explore the reorientation task,” cognition, vol. 114, no. 2, pp. 207–226, 2010. [5] i. l. may, “pengenalan vokal bahasa indonesia dengan jaringan syaraf tiruan melalui transformasi wavelet diskret,” universitas diponegoro. 2002. [6] d. putra, pengolahan citra digital. yogyakarta: andi offset, 2010. [7] y.-c. hu, “pattern classification by multi-layer perceptron using fuzzy integral-based activation function,” appl. soft comput., vol. 10, no. 3, pp. 813–819, 2010. [8] e. nugroho, susilo, and akhlis, “pengembangan program pengolahan citra untuk radiografi digital,” j. mipa, vol. 1, pp. 46–56, 2012. [9] y. li, d. m. . tax, and m. loog, “scale selection for supervised image segmentation,” image vis. comput., vol. 30, no. 12, pp. 991–1003, 2012. [10] t.-h. min and r.-h. park, “eyelid and eyelash detection method in the normalized iris image using the parabolic hough model and otsu’s thresholding method,” pattern recognit. lett., vol. 30, no. 12, pp. 1138–1143, 2009. [11] e. d. handoyo and l. w. susanto, “penerapan jaringan syaraf tiruan metode propagasi balik dalam pengenalan tulisan tangan huruf jepang jenis hiragana dan katakana,” j. inform., vol. 7, no. 1, pp. 39–55, 2011. fig. 11. text trial “ hello world” http://shop.oreilly.com/product/9780596516130.do http://shop.oreilly.com/product/9780596516130.do https://doi.org/10.7454/mst.v6i1.59 https://doi.org/10.7454/mst.v6i1.59 https://doi.org/10.1016/j.imavis.2009.06.011 https://doi.org/10.1016/j.imavis.2009.06.011 https://doi.org/10.1016/j.cognition.2009.09.006 https://doi.org/10.1016/j.cognition.2009.09.006 http://www.elektro.undip.ac.id/sumardi/www/datapribadi/rapi_e-006.pdf http://www.elektro.undip.ac.id/sumardi/www/datapribadi/rapi_e-006.pdf https://doi.org/10.1016/j.asoc.2009.09.011 https://doi.org/10.1016/j.asoc.2009.09.011 https://journal.unnes.ac.id/artikel_nju/jm/2096 https://journal.unnes.ac.id/artikel_nju/jm/2096 https://doi.org/10.1016/j.imavis.2012.08.010 https://doi.org/10.1016/j.imavis.2012.08.010 https://doi.org/10.1016/j.patrec.2009.03.017 https://doi.org/10.1016/j.patrec.2009.03.017 microsoft word 1-1720-nadiasari-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 2, september 2018, pp. 39–45 eissn 2597-4637 https://doi.org/10.17977/um018v1i22018p39-45 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) signature pattern recognition using kohonen network nadia roosmalita sari a, 1, *, mohammad zoqi sarwani b, 2, yudha alif aulia c, 3, wayan firdaus mahmudy d, 4 a department of da’wah management, institut agama islam negeri (iain) tulungagung jl.mayor sujadi timur no 46 tulungagung, 66221, indonesia b department of information technology, universitas merdeka pasuruan jl. ir. h. juanda no. 6, pasuruan, 67129, indonesia c department of information technology, universitas jember jl. kalimantan no 37, jember, 68121, indonesia d department of computer science, universitas brawijaya jl. veteran, malang, 65145, indonesia 1 nadiaroosmalitasari@gmail.com*; 2 zoqi.sarwani@unmerpas.ac.id; 3 yudha.alif7@gmail.com; 4 wayanfm@ub.ac.id * corresponding author i. introduction the manual process of signature identification is extremely ineffective. the being recognised signature is compared with a very large number of similar signatures. a computer based signature recognition could be ease the signature recognition process. various studies about signature recognition have been done. backpropagation neural network is applied for the signature pattern recognition [1]. the results of this study produced an accuracy rate of 95 % for the training data and 88 % for the testing data. another study used the moment invariant method and euclidean distance for signature pattern recognition [2]. moment invariant used to reduce the dimensions of the matrix. the result vector from invariant moment used as input data in the process of digital image recognition. recognition method using euclidean recognition measures the difference between the vectors. the result of this study indicated that the image recognizes all of the test data. many studies have investigated the signature pattern recognition. akram, et al [3] summarized a study for offline signature recognition system using artificial neural network (ann). ann has the ability to learn and generalize diversity and variation human signatures. ann recognized signatures precisell [4]. in a particular case, ann could be more accurate than other techniques such as svm and pmt [5]. chaudari, et al [6] has successfully examined the fuzzy min-max algorithm signature pattern recognition. input was given in the form of digital images by using the writing pad, an optical scanner, or digital camera. the fuzzy min-max algorithm implemented to classify the signature patterns and this method is really nice applied to the ann framework. the middle layer of ann worked as a defuzzification neuron so that output can be classified correctly. accuracy that was generated by using these methods amounted to 92 %. article info a b s t r a c t article history: received 25 october 2017 revised 26 november 2017 accepted 19 january 2018 published online 31 august 2018 a signature is a special form of handwriting that used for human identification process. the current identification process is extremely ineffective. people have to manually compare signatures with the previously stored data. this study proposed som kohonen algorithm as the method of signature pattern recognition. this method has able to visualize high-dimensional data. the image processing method is used in this study in pre-processing data phase. the accuracy of som kohonen was 70 %, indicated the method used was good enough for pattern recognition. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: signature image processing kohonen 40 n.r. sari et al. / knowledge engineering and data science 2018, 1 (2): 39–45 another study that has been doing a study about signature pattern recognition was a research deore and handore study [7]. this study discussed about the offline signature pattern recognition where the figure was extracted using discrete wavelet transform (dwt) and principal component analysis (pca). self organizing feature map (som) kohonen is a neural network method that is often used for pattern recognition. som kohonen has been used as a method of pattern recognition in the form of handwritten in numbers. this study used the united movement invariant for extracting writing numbers as many as 500 data and som kohonen method as a grouping method [8] with an accuracy of 98 %. in addition, sound processing for speaker recognition also has been examined by [9] the study also implemented som kohonen as a method for sound processing. accuracy generated by this study was 96 %. a similar study in sound recognition has been done by [9]. linear predictive coding and som kohonen have been used as a sound recognition method with an accuracy of 78 % [8] used. based on some previous studies, this study proposes a neural network with som kohonen models as a method for signature pattern recognition. in addition to som kohonen method has been successfully implemented in some studies, som also has an advantage to visualize the highdimensional data into the simple geometric that has low dimension. ii. methods a. data collection and preprocessing the data used in this study in the form of signature pattern from 15 people with each person has signatures as much as 6. therefore, the total of signature sample is 90 signatures. preprocessing performed during this phase to process the data in the form of a signature. signatures that have been existed will be done a scanning process first. the next process was the processing to be performed the noise reduction, background elimination, and width normalization process. in this scanning stage, the authentic signature used ink that has been obtained will do the scan process. therefore, in this stage input was in the form of original signatures and the output resulted in the form of digital images. the results of the scanning process shown in fig 1. in the background elimination stage, background cropping in the area of data (signatures) was done. background in the signature area located outside and besides to the signature object. background cropping aims to equalize the entire area outside the object so that it has the same background color as shown in fig. 2. noise is dots on the image which are not a part of the image, but also in the image for a reason [10]. noise is something undesirable, regarded as the cause of the lack of detail of the signature object. therefore, noise reduction used to eliminate objects that are outside the signature area thus making the object or image detail becomes better. the result of the noise reduction process shown in fig. 3. in image processing, normalization is a process that converts the pixel intensity range. length and width dimensions of an object image are different. in this stage, the length and width of an object have to be recognized. in this stage, width normalization process was done by cutting the space between the edges of the image with the image object using adobe photoshop software. the outcome of this process is shown in fig 4. fig. 1. scanning process n.r. sari et al. / knowledge engineering and data science 2018, 1 (2): 39–45 41 image processing is done to improve the image quality to be easily interpreted by a human or a computer for certain purposes, which in this study image processing used for the signature recognition. hence, som kohonen method implemented for image processing in the form of a signature. som kohonen method requires some featured values that are extracted from the image. feature extraction process is divided into two stages, including the global features and grid features. global features are features that are common and easily obtained from the image. an example is the resolution of the image which includes the image length per pixel. in this study, the image resolution is required to be able to make the process of grid features. in this process, the feature was obtained by dividing the image into several parts. each grid that has been divided then calculated the black pixel value to the white pixel, white pixels to the total pixels, and so on. fig. 5 is an example of a grid feature process result. in this discussion will look for the appropriate composition of the grid distribution. in general, the more the grid distribution, the greater the accuracy level, but requires memory and computation time that is greater. therefore, it is needed to be done the grid testing of 3x3, 3x4 or 5x5. the experimental results are presented in table 1. fig. 2. background elimination fig. 3. the result of noise reduction process fig. 4. the result of width normalization process 42 n.r. sari et al. / knowledge engineering and data science 2018, 1 (2): 39–45 b. signature database for the training and testing data on the signature, recognition will be used the different data. the testing data and training data are treated equally. the signature is made in a sitting position and by using the same pen. the 90 signatures data is obtained from 15 people where each person is taken 6 signatures: 75 training data and 15 testing data. for example, signature data will undergo the feature values extracting process with 15x15 grids 225 segments. because the grid feature distribution used was 15x15, then there will be 225 segments. the result of data extraction can be seen in table 1. c. training using self organizing maps (som) kohonen som method aims to cluster the input vectors based on how they are grouped according to the input characteristics. som combine the competitive layers process to the topology of input vectors included in the iteration process. som network consists of two layers, which are the input layer and the output layer [4]. each neuron in the input layer is connected to each neuron in the output layer. each neuron in the output layer represents the class of a given input. during the process self-arranging, cluster that has the weight vector which is the most appropriate to the input pattern (has the closest space) will be selected as the winner. neuron becomes the winner along with its neighbour neurons would improve their weights. if we want to divide the data into k-cluster, then the competitive layer will consist of k neurons. fig. 6 illustrates the architecture of som kohonen and fig. 7 shows the diagram of signature recognition process. as shown in fig. 1, as an example, that there are two input units (p1 and p2), which will be formed into 3 neuron clusters with the output layer (y1, y2, y3). furthermore, the neurons would improve their weights, as the wij weight. in this case, the wij weight contains the meaning a weight that connects jth neuron into the input layer toward the ith neuron output layer. table 1. the result of feature extract on the data no name segment 1 segment 2 ... segment 224 segment 225 1 aahjp 255 255 236.4941 168.0235 2 alif 255 255 255 255 3 dinna 177.5214 177.4714 255 255 4 eko 255 255 255 254.5844 5 evi 255 255 255 254.9846 6 fadli 255 255 255 255 7 ida 254.9944 244.2278 255 240.7111 8 vivi 255 255 255 255 9 wisnu 255 255 211.9375 200.2708 10 asyrofa 255 255 255 255 fig. 5. example of signature grid feature n.r. sari et al. / knowledge engineering and data science 2018, 1 (2): 39–45 43 iii. results and discussions before entering into the som kohonen calculation phase, should be determined in advance which data would be used as training data and which data would be testing data. for the training data would be used 10 signature data that were the 1st-10th signature were going to be used . so, the amounts of training data are 75 signatures. training data and testing data distribution were shown in table 2 and table 3. the training data process aims to find the suitable final weights and mapping to be used in the testing data process. time needed to do the training as much as 75 data takes 3 hours to calculate som kohonen. som kohonen mapping results can be seen in table 4. fig. 6. som kohonen architecture fig. 7. flow diagram of signature recognition process in this study 44 n.r. sari et al. / knowledge engineering and data science 2018, 1 (2): 39–45 in this study, the testing data was done after the result of the weight and mapping obtained in the previous data training. it was obtained based on the training data result. the testing data used in this study were 10 signatures with each different signature. these data used as input data. while the output data on the testing stage were in the form of true or false condition, the testing data was carried out in a minute. table 1 shows the result of testing data. based on table 5, the results of data analysis recognizable signature or the correct data obtained as much as 7 data. while the remaining 3 are the signature data that are not recognized or wrong. after the testing data process, the accuracy calculation using equation 1 was done. the accuracy calculation obtained from the amount of the correct data divided by the total of data used, then multiplied by 100 %. %100* datatotal datacorrecttheofamountthe accuracy  (1) the accuracy obtained by calculation in equation 1 was 70 %. based on the result of accuracy, the application of som kohonen method was good enough to be used as a method for signature pattern recognition. table 2. training data no name segment 1 segment 2 ... segment 224 segment 225 1 aahjp 255 255 ... 236.4941 168.0235 2 aahjp 255 255 ... 164.9074 248.0185 3 aahjp 255 255 ... 228.2039 255 ... ... ... ... ... ... ... 73 asyrofa 255 255 ... 255 255 74 asyrofa 255 255 ... 255 255 75 asyrofa 255 255 ... 255 255 table 3. testing data no name segment 1 segment 2 ... segment 224 segment 225 1 aahjp 255 255 ... 254.6538 206.0529 2 alif 255 255 ... 245.8788 255 3 dinna 177.5214 177.4714 ... 255 255 4 eko 255 255 ... 255 254.5844 5 evi ... ... ... ... ... 6 fadli 7 ida 254.9944 244.2278 ... 255 240.7111 8 vivi 255 255 ... 255 255 9 wisnu 255 255 ... 211.9375 200.2708 10 asyrofa 255 255 ... 255 255 table 4. the result of mapping data index 1 2 3 4 5 6 1 aahjp alif alif aahjp fadli fadli 2 wisnu aahjp eko alif dinna alif 3 fadli vivin asyrofa dinna eko jaya 4 vivin hilman vivin dinna bagus eko 5 ida evi evi evi jaya eko 6 ida ida eko agung evi vivi n.r. sari et al. / knowledge engineering and data science 2018, 1 (2): 39–45 45 iv. conclusion som kohonen network can analyse 7 signature data from 10 data correctly, so that gives a result of accuracy that is 70 %. this accuracy has not perfectly reached because it is influenced by many factors, which are the determination of the initial weight dimensions that still random and the grid determination on data. the algorithm performed an introduction to the more complex data, e.g. signature data that are rotated. for further study, it will be performed the weight modification by optimizing the weight, using evolutionary algorithms to improve the accuracy. references [1] a. hidayatno, r. r. isnanto, and d. k. w. buana, “identifikasi tanda-tangan menggunakan jaringan saraf tiruan perambatan-balik (backpropagation),” j. teknol. ist akprind, vol. 1, no. 2, 2008. [2] david and s. kosasi, “penerapan algoritma jaringan saraf tiruan backpropagation untuk pengenalan pola tanda tangan,” teknologi, vol. 6, no. 2, 2013. [3] m. akram, r. qasim, and m. a. amin, “a comparative study of signature recognition problem using statistical features and artificial neural networks,” in 2012 international conference on informatics, electronics & vision (iciev), 2012, pp. 925–929. [4] a. karouni, b. daya, and s. bahlak, “offline signature recognition using neural networks approach,” procedia comput. sci., vol. 3, pp. 155–161, 2011. [5] i. bhattacharya, p. ghosh, and s. biswas, “offline signature verification using pixel matching technique,” procedia technol., vol. 10, pp. 970–977, 2013. [6] b. m. chaudhari, a. a. barhate, and a. a. bhole, “signature recognition using fuzzy min-max neural network,” in 2009 international conference on control, automation, communication and energy conservation, 2009, pp. 1–7. [7] m. r. deore and s. m. handore, “offline signature recognition: artificial neural network approach,” in 2015 international conference on communications and signal processing (iccsp), 2015, pp. 1708–1712. [8] g. f. fitriana and s. samsuryadi, “penggunaan united moment invariant dan self organizing maps untuk pengenalan tulisan tangan angka,” j. generic, vol. 10, no. 1, pp. 398–404, 2015. [9] p. sahayu and g. fitriana, “pengenalan suara menggunakan linear predictive coding dan self organizing maps,” annu. res. semin., vol. 1, no. 1, pp. 21–22, 2015. [10] e. özgündüz, t. şentürk, and m. e. karslıgil, “off-line signature verification and recognition by support vector machine,” in 2005 13th european signal processing conference, 2005, pp. 1–4. table 5. the results of testing data no name input output result 1 aahjp aahjp aahjp true 2 alif alif alif false 3 dina dina asyrofa true 4 eko eko eko true 5 evi evi evi true 6 fadli fadli fadli true 7 ida ida ida true 8 vivi vivi vivi true 9 wisnu wisnu ida false 10 asyrofa asyrofa vivi false keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 2, december 2020, pp. 99–105 eissn 2597-4637 https://doi.org/10.17977/um018v3i22020p99-105 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) segmentation method for face modelling in thermal images albar a, 1, hendrick a, 2, *, rahmat hidayat b, 3 a department of electrical engineering, politeknik negeri padang jl. kampus, limau manis, kec. pauh, kota padang, sumatera barat 25162, indonesia b department of information technology, politeknik negeri padang jl. kampus, limau manis, kec. pauh, kota padang, sumatera barat 25162, indonesia 1 albar@pnp.ac.id; 2 hendrick@pnp.ac.id *; 3 rahmat@pnp.ac.id * corresponding author i. introduction face recognition has been applied in many areas, especially in the security system. avoiding the spoofing face [1], usually, stereo cameras have been applied to face recognition systems [2]. nowadays, face recognition is supported by a deep learning method, which is reduced machine learning procedures [3]. when applying machine learning methods, feature extraction is done manually and trained to get the models. by using the deep learning methods, the creation model achieved high accuracy in prediction [4]. the other method to identify the real person is by using a thermal camera which records the subject temperature. thermal is mostly applied in contactless temperature measurements, such as the steel industry. the thermal camera is not only applied in the industrial field, but also applied in a biomedical application, such as contactless breath rate measurement [5][6], breast health, musculoskeletal, neurological medicine, dermatology, and dental care [7]. convolutional neural network (cnn) is the common method in deep learning which has been applied in many areas such as in biomedical images [8]. deep learning has some famous frameworks, which are tensorflow, keras, pytoch, caffe, cntk, and mxnet [4]. region based convolutional neural networks (rcnn) is one of the best methods for object detection. rcnn has been applied in some applications which are finding optical nerve in fundus images [9], face detection in rgb images [10], and facial detection [11]. in this research, we proposed a segmentation method, mask rcnn, to create a face model from thermal images. the model will detect and locate the face from thermal images. face images were recorded by using a flir lepton thermal camera, which has specification as a military standard device [12]. the dataset was created by combining direct recorded images and images from the online dataset. the dataset was expanded using the data augmentation method to achieve accurate prediction models [13]. the face model was created by using the segmentation method of mask rcnn. the mask rcnn is covered by tensorflow-gpu and keras framework [14]. to reduce training time, tensorflow-gpu is applied in this research. the final model was article info a b s t r a c t article history: received 01 december 2020 revised 12 december 2020 accepted 25 december 2020 published online 31 december 2020 face detection is mostly applied in rgb images. the object detection usually applied the deep learning method for model creation. one method face spoofing is by using a thermal camera. the famous object detection methods are yolo, fast region based convolutional neural networks (rcnn), faster rcnn, ssd, and mask rcnn. we proposed a segmentation mask rcnn method to create a face model from thermal images. this model was able to locate the face area in images. the dataset was established using 1600 images. the images were created from direct capturing and collecting from the online dataset. the mask rcnn was configured to train with 5 epochs and 131 iterations. the final model predicted and located the face correctly using the test image. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: face detection segmentation thermal images deep learning http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v3i22020p99-105 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 100 albar et al. / knowledge engineering and data science 2020, 3 (2): 99–105 applied in real-time detection by using opencv. in future works, this model will be embedded in mini pc, such as raspberry pi. this model will be developed to measure face temperature from thermal images. ii. methods and materials a. data collection data collection was done by using flir lepton thermal camera. data collection is not only from flir lepton thermal images, but also collected from online dataset. the thermal images have some formats which are contrast, gray, artic, and lava. each format has their own function. in this research, we only selected contras format for all dataset. figure 1 shows the thermal images formats. to increase the dataset size, image augmentation was applied to the original dataset. image augmentation could be done by rotating, flipping, etc. the data augmentation methods will create 100 images from each original image. the final dataset size is 1600 images. figure 2 shows the image augmentation result of the chest x-ray image [15]. b. training preparation for object detection purposes, every object has to label indicating the object location in images. label creations were done by using labeling. the final files were saved in xml files. the system was trained and deploy in ubuntu 16.04 operating system. the dataset was separated into images train and annots folder. labeling was installed by using a command. figure 3 shows the xml result, which contained object location, image size, image deep. the red square box shows the object location. c. mask rcnn mask rcnn was a development model of rcnn and fast rcnn. the fast rcnn produced a class label and a bounding box offset for every candidate object. mask rcnn has the same output as well as rcnn, but it also created the object mask. the other important thing that made mask rcnn fig. 1. thermal image format fig. 2. augmentation result of the chest x-ray image albar et al. / knowledge engineering and data science 2020, 3 (2): 99–105 101 better than rcnn is pixel-to-pixel alignment. roi alignment has the function of creating a small feature map for each roi. the final stage of mask rcnn is an instance of segmentation. an instance segmentation generates a pixel-wise mask for each object in the image. even though two objects are in the same class, mask rcnn treats them as a different instance. figure 4 shows the framework mask rcnn with instance segmentation. training mask rcnn in python required some libraries to install correctly, especially for cuda and cudnn. in this training, we used cuda 9.0 with nvidia driver 384. we must consider the laptop specification to decide the version of cuda, and cudnn. tensorflow-gpu and keras install to the device. it generates an error core dump if the installation is not proper. in this research, we used a laptop which has a specification, as mention in table 1. the thermal image dataset was divided into train datasets and test datasets. the train dataset size is 80% of all datasets, and the test set is 20%. fig. 3. object location in xml file fig. 4. mask rcnn framework with instance segmentation table 1. laptop specifications no device specification 1 cpu core i7 2 gpu gtx 750 3 ram 16 gbyte 102 albar et al. / knowledge engineering and data science 2020, 3 (2): 99–105 iii. results and discussions figure 5 shows the result of image augmentation process. the images became 100 images from each source image. the total images for the dataset are 1600. the training set was configured with epoch = 5 and iteration =131. by setting 5 for the epoch value, the training loop ended at 5 the epoch. after 6 hours, the model was created in h5 formats. figure 6 shows the created model from training. because we configured 5 for the epoch value, the model was also created as mask_rcnn_cfg_0001.h5 for 1st epoch, mask_rcnn_cfg_0002.h5 for 2nd epoch, mask_rcnn_cfg_0003.h5 for 3rd epoch, mask_rcnn_cfg_0004.h5 for 4th epoch, and mask_rcnn_cfg_0005.h5 for 5th epoch. all models were saved automatically by training program. to find out the performance, each model was tested by using test dataset. figure 7 depicts the test image which deployed mask_rcnn_cfg_0005.h5 in the program. the face was predicted perfectly by using the face model. the program automatically created red rectangle to visualize the detected face in thermal images. this stage tested a model with a single image. fig. 5. image augmentation result fig. 6. the face models albar et al. / knowledge engineering and data science 2020, 3 (2): 99–105 103 fig. 7. face detection of single images (a) (b) fig. 8. face detection of new images; (a) actual images and (b) predicted images 104 albar et al. / knowledge engineering and data science 2020, 3 (2): 99–105 beside a single image, the model was also verified by using test images. test images mean a number of images which were prepared to test the model. the models were tested with test images which have five images. figure 8 shows the result of image prediction. the result displayed two outputs which were actual and predicted. the predicted images visualized by square with white color. based on the 5th model, all faces from thermal images were predicted correctly by displaying white square. iv. conclusion this research proposed a segmentation method for face modelling by using thermal images. the model was created by using a mask rcnn methods. the data collection was done by using flir lepton 3.5 thermal camera which is military standard camera. the model was tested by using test images which have been prepared during data preparation. a final model successfully located faces in thermal images which have contrast type. the model was successfully predicted all tested images through some experiment. for future work, this model will be deployed in nvidia embedded device such as jetson nano. our goal is to make a portable device to measure temperature from all detected faces in frame. we will extend the dataset by re-capturing images from public area such as airport. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] w. sun, y. song, h. zhao and z. jin, "a face spoofing detection method based on domain adaptation and lossless size adaptation," in ieee access, vol. 8, pp. 66553-66563, 2020. [2] f. alqahtani, j. banks, v. chandran and j. zhang, "3d face tracking using stereo cameras: a review," in ieee access, vol. 8, pp. 94373-94393, 2020. [3] r. he, x. wu, z. sun and t. tan, "wasserstein cnn: learning invariant features for nir-vis face recognition," in ieee transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1761-1773, 1 july 2019. [4] h. hendrick, “the halal logo classification by using nvidia digits,” in international conference on applied information technology and innovation (icaiti), 2018, pp. 162–165. [5] g. scebba, g. da poian and w. karlen, "multispectral video fusion for non-contact monitoring of respiratory rate and apnea," in ieee transactions on biomedical engineering, vol. 68, no. 1, pp. 350-359, jan. 2021. [6] c. massaroni, d. s. lopes, d. lo presti, e. schena, and s. silvestri, “contactless monitoring of breathing patterns and respiratory rate at the pit of the neck: a single camera approach,” j. sensors, vol. 2018, . [7] kasprzyk-kucewicz, t., cholewka, a., bałamut, k. et al. “the applications of infrared thermography in surgical removal of retained teeth effects assessment". j therm anal calorim , vol 144, no 1, pp 139-144, 2020. [8] z. rustam, s. hartini, r. y. pratama, r. e. yunus, and r. hidayat, “analysis of architecture combining convolutional neural network (cnn) and kernel k-means clustering for lung cancer diagnosis,” int. j. adv. sci. eng. inf. technol., vol. 10, no. 3, pp. 1200–1206, 2020. [9] h. almubarak, y. bazi, and n. alajlan, “two-stage mask-rcnn approach for detecting and segmenting the optic nerve head, optic disc, and optic cup in fundus images,” appl. sci., vol. 10, no. 11, 2020. [10] c. zhang, x. xu, and d. tu, “face detection using improved faster rcnn,” no. february 2018, 2018. [11] l. hao and f. jiang, “a new facial detection model based on the faster r-cnn,” iop conf. ser. mater. sci. eng., vol. 439, no. 3, 2018. [12] c. fujii, “thermal camera,” j. inst. telev. eng. japan, vol. 29, no. 9, pp. 705–713, 1975. [13] z. pei, h. xu, y. zhang, m. guo, and y. yee-hong, “face recognition via deep learning using data augmentation based on orthogonal experiments,” electron., vol. 8, no. 10, pp. 1–16, 2019. https://doi.org/10.1109/access.2020.2985453 https://doi.org/10.1109/access.2020.2985453 https://doi.org/10.1109/access.2020.2994283 https://doi.org/10.1109/access.2020.2994283 https://doi.org/10.1109/tpami.2018.2842770 https://doi.org/10.1109/tpami.2018.2842770 https://doi.org/10.1109/icaiti.2018.8686730 https://doi.org/10.1109/icaiti.2018.8686730 https://doi.org/10.1109/tbme.2020.2993649 https://doi.org/10.1109/tbme.2020.2993649 https://doi.org/10.1155/2018/4567213 https://doi.org/10.1155/2018/4567213 https://doi.org/10.1007/s10973-020-09457-6 https://doi.org/10.1007/s10973-020-09457-6 https://doi.org/10.18517/ijaseit.10.3.12113 https://doi.org/10.18517/ijaseit.10.3.12113 https://doi.org/10.18517/ijaseit.10.3.12113 https://doi.org/10.3390/app10113833 https://doi.org/10.3390/app10113833 https://arxiv.org/abs/1802.02142 https://doi.org/10.1088/1757-899x/439/3/032117 https://doi.org/10.1088/1757-899x/439/3/032117 https://doi.org/10.3169/itej1954.29.705 https://doi.org/10.3390/electronics8101088 https://doi.org/10.3390/electronics8101088 albar et al. / knowledge engineering and data science 2020, 3 (2): 99–105 105 [14] k. he, g. gkioxari, p. dollár, and r. girshick, “mask r-cnn,” ieee trans. pattern anal. mach. intell., vol. 42, no. 2, pp. 386–397, 2020. [15] h. hendrick, w. zhi-hao, c. hsien-i, c. pei-lun, and j. gwo-jia, “ios mobile app for tuberculosis detection based on chest x-ray image,” proc. icaiti 2019 2nd int. conf. appl. inf. technol. innov. explor. futur. technol. appl. inf. technol. innov., pp. 122–125, 2019. https://doi.org/10.1109/tpami.2018.2844175 https://doi.org/10.1109/tpami.2018.2844175 https://doi.org/10.1109/icaiti48442.2019.8982152 https://doi.org/10.1109/icaiti48442.2019.8982152 https://doi.org/10.1109/icaiti48442.2019.8982152 i. introduction ii. methods and materials a. data collection b. training preparation c. mask rcnn iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] w. sun, y. song, h. zhao and z. jin, "a face spoofing detection method based on domain adaptation and lossless size adaptation," in ieee access, vol. 8, pp. 66553-66563, 2020. [2] f. alqahtani, j. banks, v. chandran and j. zhang, "3d face tracking using stereo cameras: a review," in ieee access, vol. 8, pp. 94373-94393, 2020. [3] r. he, x. wu, z. sun and t. tan, "wasserstein cnn: learning invariant features for nir-vis face recognition," in ieee transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1761-1773, 1 july 2019. [4] h. hendrick, “the halal logo classification by using nvidia digits,” in international conference on applied information technology and innovation (icaiti), 2018, pp. 162–165. [5] g. scebba, g. da poian and w. karlen, "multispectral video fusion for non-contact monitoring of respiratory rate and apnea," in ieee transactions on biomedical engineering, vol. 68, no. 1, pp. 350-359, jan. 2021. [6] c. massaroni, d. s. lopes, d. lo presti, e. schena, and s. silvestri, “contactless monitoring of breathing patterns and respiratory rate at the pit of the neck: a single camera approach,” j. sensors, vol. 2018, . [7] kasprzyk-kucewicz, t., cholewka, a., bałamut, k. et al. “the applications of infrared thermography in surgical removal of retained teeth effects assessment". j therm anal calorim , vol 144, no 1, pp 139-144, 2020. [8] z. rustam, s. hartini, r. y. pratama, r. e. yunus, and r. hidayat, “analysis of architecture combining convolutional neural network (cnn) and kernel k-means clustering for lung cancer diagnosis,” int. j. adv. sci. eng. inf. technol., vol. 10, no. ... [9] h. almubarak, y. bazi, and n. alajlan, “two-stage mask-rcnn approach for detecting and segmenting the optic nerve head, optic disc, and optic cup in fundus images,” appl. sci., vol. 10, no. 11, 2020. [10] c. zhang, x. xu, and d. tu, “face detection using improved faster rcnn,” no. february 2018, 2018. [11] l. hao and f. jiang, “a new facial detection model based on the faster r-cnn,” iop conf. ser. mater. sci. eng., vol. 439, no. 3, 2018. [12] c. fujii, “thermal camera,” j. inst. telev. eng. japan, vol. 29, no. 9, pp. 705–713, 1975. [13] z. pei, h. xu, y. zhang, m. guo, and y. yee-hong, “face recognition via deep learning using data augmentation based on orthogonal experiments,” electron., vol. 8, no. 10, pp. 1–16, 2019. [14] k. he, g. gkioxari, p. dollár, and r. girshick, “mask r-cnn,” ieee trans. pattern anal. mach. intell., vol. 42, no. 2, pp. 386–397, 2020. [15] h. hendrick, w. zhi-hao, c. hsien-i, c. pei-lun, and j. gwo-jia, “ios mobile app for tuberculosis detection based on chest x-ray image,” proc. icaiti 2019 2nd int. conf. appl. inf. technol. innov. explor. futur. technol. appl. inf. technol. inn... keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 1, july 2021, pp. 49–54 eissn 2597-4637 https://doi.org/10.17977/um018v4i12021p49-54 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by indonesian ministry of research & technology face images classification using vgg-cnn i nyoman gede arya astawa a, 1, *, made leo radhitya b, 2, i wayan raka ardana a, 3, felix andika dwiyanto c, 3 a electrical engineering department, politeknik negeri bali kampus jimbaran, badung, bali, 80361 indonesia b department of informatics, stmik stikom indonesia tukad pakerisan 97, denpasar, bali, 80225 indonesia c association for scientific computing electronics and engineering (ascee) jl. janti, karangjambe 130b, banguntapan, bantul, yogyakarta, indonesia 1 arya_kmg@pnb.ac.id*; 2 leo.radhitya@stiki-indonesia.ac.id; 3 rakawyn@pnb.ac.id; 4 felix@ascee.org * corresponding author i. introduction facial recognition is one of the most widely studied biometrics fields due to a high level of difficulty [1][2]. specifically, image classification is part of facial recognition processes, which is an actual problem in computer vision [3]. the classification process helps accelerate the training process due to data that has been classified before performing the training process. the classification method selection also determines the level of accuracy in the training process [4]. several popular classifications in the facial recognition process are euclidean distance, knn, svm, pca, and cnn [5][6]. currently, studies that apply the deep learning method provide better results in facial recognition [7]. the most compelling image recognition method is convolutional neural network (cnn) [8]. recent researches results show that transfer learning solutions are the basis for image classification [7][9][10]. the research claimed that cnn provides significant results. moreover, each binary image classification, relu activation function, and sigmoid classifier combination provide the best classification accuracy [11]. other studies result that the activation function strongly influences the system's accuracy to identify and recognize mushroom images [12]. article info a b s t r a c t article history: received 4 march 2021 revised 29 march 2021 accepted 4 april 2021 published online 17 august 2021 image classification is a fundamental problem in computer vision. in facial recognition, image classification can speed up the training process and also significantly improve accuracy. the use of deep learning methods in facial recognition has been commonly used. one of them is the convolutional neural network (cnn) method which has high accuracy. furthermore, this study aims to combine cnn for facial recognition and vgg for the classification process. the process begins by input the face image. then, the preprocessor feature extractor method is used for transfer learning. this study uses a vgg-face model as an optimization model of transfer learning with a pre-trained model architecture. specifically, the features extracted from an image can be numeric vectors. the model will use this vector to describe specific features in an image. the face image is divided into two, 17% of data test and 83% of data train. the result shows that the value of accuracy validation (val_accuracy), loss, and loss validation (val_loss) are excellent. however, the best training results are images produced from digital cameras with modified classifications. val_accuracy's result of val_accuracy is very high (99.84%), not too far from the accuracy value (94.69%). those slight differences indicate an excellent model, since if the difference is too much will causes underfit. other than that, if the accuracy value is higher than the accuracy validation value, then it will cause an overfit. likewise, in the loss and val_loss, the two values are val_loss (0.69%) and loss value (10.41%). this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: classification cnn face image vgg http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i12021p49-54 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.ristekbrin.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 50 i.n.g.a. astawa et al. / knowledge engineering and data science 2021, 4 (1): 49–54 this study aims to determine the cnn method's facial images classification. in this study, the pretrained model used is the vgg-face model [8]. this significant result was obtained using 16-19 layer weights. the classification modeling in this study is changing the last layer on cnn. ii. method in this study, there are several processes to achieve the expected result. those processes are data collection, map feature extraction, classification modeling, and result validation testing. therefore, this study using komnet dataset [13], and the face image used is 36600 face images with a size of 224×224 pixels. in computer vision, transfer learning is commonly expressed through the use of a pretrained model. a typical implementation used is to import and use models from existing libraries. the next step is to create a new convolutional neural networks (cnn) model for image classification using multiclass cnn. this image classification model is generated from the transfer learning approach, which is based on the cnn pre-trained model [14]. in general, cnn proved to be superior in a variety of computer vision tasks [15]. convolutional networks (convnets) have shown excellent performance in handwritten digital classification and face detection [16]. figure 1 is an outline of the cnn processes in the system. the process begins by inputting the face image. the method used for transfer learning is the feature extractor preprocessor. this study uses an optimization model of transfer learning with a pre-trained model architecture, the vgg-face model. mainly, the features extracted from an image can be numeric vectors. the model will use this vector to describe specific features in an image. the reason for the vgg-face model selection because it is perfect for producing facial feature extraction [17]. the feature extraction has a vgg-face 16 layer architecture. after performing the vgg-face model, the last layer of the vgg-face will be modified to achieve the maximum result. figure 2 presents the vgg-face architecture with the last three layers are the classifications to be modified. the first layer features are general, and the last layer features specific, so there must be a transition from general to specific somewhere on the network [18]. the pre-trained strategy leaves some initial layers unprocessed and trains the final layers to avoid overfitting [19]. the initial layer is for convolution or feature extraction, while the last layer is for classification. the last three layers are fig. 1. cnn process fig. 2. vgg-face architecture i.n.g.a. astawa et al. / knowledge engineering and data science 2021, 4 (1): 49–54 51 fully connected + relu. the modification of the last three layers is required to provide better performance. following is the pseudocode for the last three layers. #last layer classifier_model=sequential() classifier_model.add(dense(units=100,input_dim=x_train.shape[1],kernel_initiali zer='glorot_uniform')) classifier_model.add(batchnormalization()) classifier_model.add(activation('tanh')) classifier_model.add(dropout(0.3)) classifier_model.add(dense(units=10,kernel_initializer='glorot_uniform')) classifier_model.add(batchnormalization()) classifier_model.add(activation('tanh')) classifier_model.add(dropout(0.2)) classifier_model.add(dense(units=24,kernel_initializer='he_uniform')) classifier_model.add(activation('softmax')) classifier_model.compile(loss=tf.keras.losses.sparsecategoricalcrossentropy(),o ptimizer='nadam',metrics=['accuracy']) the last layer of the vgg-face model is the one that is wholly connected before the output layer. these layers will provide a complex set of features for describing an input image and provide useful input when training a new image classification model. after the pre-trained model from the last vgg-face layer is loaded, the next step is to create a data train and data test. it consists of five stages, and the first stage is to change the existing wavelet feature in the train or test folder with a target size of 224 (224×224 pixels). at the second stage, it needs to change the image into an array. next, the third stage is inputting the results into the last vgg-face. then, the results are entered into the train data array and the test data array. after that, the last stage is to repeat step one until all face images in the train or test folder have been read after the model is made, the next step is using epoch 100 in the training process. in this process, the weight value will be obtained, which is stored in a file in h5 format. tests are performed to obtain a validation test of the results through training facial images from several devices. the results of image testing from several devices are displayed in graphical form. iii. results and discussions the use of massive data is necessary to produce an ideal result. moreover, the pre-trained model is a conversion model provided by tensorflow or keras. this pre-trained model can be used directly from the vgg-face keras library. after the model is made, the next step is the training process with epoch 100. epoch 100 limits the iteration of large amounts of data that takes a long time to train in one training session. however, the deep learning method has a weakness with the long training process when using a server computer. it can be overcome using graphical processing unit (gpu) technology [9][20]. this study using gpu, which google colab owns for the training process. the results of the training process are the weight value which is stored in an h5 format file. the train results with epoch 100 with three sources of face images are presented in table 1. table 1 shows that the results of facial image training are from three devices at epoch 100. in the training process, facial images are divided into two, which are 17% as data test and 83% as data train. the accuracy value, which are val_accuracy, loss, and val_loss values, are impressive. however, the best training result is the image that comes from a digital camera with a modified classification. the val_accuracy result is very high (99.84%), not too far from the accuracy result (94.69%). the difference in value that is not too significant indicates a great model. it is because if the difference is too far, it will cause an under fit. other than that, if the accuracy value is higher than the accuracy validation value, it will cause an over fit. moreover, the val_loss result is very low (0.69%) and the loss value is 10.41%. the error or loss value is the smallest compared to the others, which means that the model is ideal and proper to be used as a prediction. 52 i.n.g.a. astawa et al. / knowledge engineering and data science 2021, 4 (1): 49–54 the training results from start to finish are presented in a graphic image. figure 3 shows a graph of the facial image training result on epoch 100 with modified classification. figure 3 shows that the model (the modification of the last three layers) is great and ideal since the value differences are insignificant. likewise, the difference between val_loss and loss is relatively small, and the values are close. table 1. training result of three image sources on epoch 100 image source number of face images epoch 100 train test accuracy (%) val_accuracy (%) loss (%) val-loss (%) mobile phone 11,000 2,200 94.05 98.69 12.32 6.29 digital camera 11,000 2,200 94.69 99.84 10.41 0.69 social media 11,000 2,200 93.02 92.75 20.07 38.20 (a) (b) (c) fig. 3. graph of training results on epoch 100 with modification and facial image classification sourced from (a) mobile phones, (b) digital cameras, and (c) social media i.n.g.a. astawa et al. / knowledge engineering and data science 2021, 4 (1): 49–54 53 iv. conclusion this study performed a pre-trained model using the vgg-face architecture to modify the last three layers or the classification section. the model provides a result of very high accuracy. also, the resulting loss is shallow. it is indicated that the model is great and ideal for prediction. moreover, the image data for training are obtained from three sources. based on the three image sources, the best source is from the digital camera with accuracy = 94.69%, and loss = 10.41%. therefore, further research needs to focus on the quality of camera image sources to improve the classification performance optimally. acknowledgment politeknik negeri bali and stiki indonesia supported this research. we thank everyone who contributed to the completion of this paper. hopefully, this research significantly contributes to knowledge development, especially in face image classification. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering unversitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] m. andrejevic and n. selwyn, “facial recognition technology in schools: critical questions and concerns,” learn. media technol., vol. 45, no. 2, pp. 115–128, apr. 2020, doi: 10.1080/17439884.2020.1686014. [2] c. m. cook, j. j. howard, y. b. sirotin, j. l. tipton, and a. r. vemury, “demographic effects in facial recognition and their dependence on image acquisition: an evaluation of eleven commercial systems,” ieee trans. biometrics, behav. identity sci., vol. 1, no. 1, pp. 32–41, jan. 2019, doi: 10.1109/tbiom.2019.2897801. [3] y. lin and h. xie, “face gender recognition based on face recognition feature vectors,” in 2020 ieee 3rd international conference on information systems and computer aided education (iciscae), sep. 2020, pp. 162–166, doi: 10.1109/iciscae51034.2020.9236905. [4] m. imani and h. ghassemian, “fast feature selection methods for classification of hyperspectral images,” in 7’th international symposium on telecommunications (ist’2014), sep. 2014, pp. 78–83, doi: 10.1109/istel.2014.7000673. [5] y. zhu, c. zhu, and x. li, “improved principal component analysis and linear regression classification for face recognition,” signal processing, vol. 145, pp. 175–182, apr. 2018, doi: 10.1016/j.sigpro.2017.11.018. [6] a. raikwar and j. agrawal, “a review of face recognition using feature optimization and classification techniques,” in information management and machine intelligence. icimmi 2019. algorithms for intelligent systems, d. goyal, v. e. bălaş, a. mukherjee, c. de a. v. hugo, and a. k. gupta, eds. singapore: springer, 2021, pp. 595– 604. [7] a. bilgic, o. c. kurban, and t. yildirim, “face recognition classifier based on dimension reduction in deep learning properties,” in 2017 25th signal processing and communications applications conference (siu), may 2017, pp. 1–4, doi: 10.1109/siu.2017.7960368. [8] t. purwaningsih, i. a. anjani, and p. b. utami, “convolutional neural networks implementation for chili classification,” in 2018 international symposium on advanced intelligent informatics (sain), aug. 2018, pp. 190–194, doi: 10.1109/sain.2018.8673373. [9] a. krizhevsky, i. sutskever, and g. e. hinton, “imagenet classification with deep convolutional neural networks,” commun. acm, vol. 60, no. 6, pp. 84–90, may 2017, doi: 10.1145/3065386. [10] k. simonyan and a. zisserman, “very deep convolutional networks for large-scale image recognition,” arxiv prepr. arxiv1409.1556, sep. 2014. [11] k. chauhan and s. ram, “image classification with deep learning and comparison between different convolutional http://journal2.um.ac.id/index.php/keds https://doi.org/10.1080/17439884.2020.1686014 https://doi.org/10.1080/17439884.2020.1686014 https://doi.org/10.1109/tbiom.2019.2897801 https://doi.org/10.1109/tbiom.2019.2897801 https://doi.org/10.1109/tbiom.2019.2897801 https://doi.org/10.1109/iciscae51034.2020.9236905 https://doi.org/10.1109/iciscae51034.2020.9236905 https://doi.org/10.1109/iciscae51034.2020.9236905 https://doi.org/10.1109/istel.2014.7000673 https://doi.org/10.1109/istel.2014.7000673 https://doi.org/10.1109/istel.2014.7000673 https://doi.org/10.1016/j.sigpro.2017.11.018 https://doi.org/10.1016/j.sigpro.2017.11.018 https://doi.org/10.1007/978-981-15-4936-6_64 https://doi.org/10.1007/978-981-15-4936-6_64 https://doi.org/10.1007/978-981-15-4936-6_64 https://doi.org/10.1007/978-981-15-4936-6_64 https://doi.org/10.1109/siu.2017.7960368 https://doi.org/10.1109/siu.2017.7960368 https://doi.org/10.1109/siu.2017.7960368 https://doi.org/10.1109/sain.2018.8673373 https://doi.org/10.1109/sain.2018.8673373 https://doi.org/10.1109/sain.2018.8673373 https://doi.org/10.1145/3065386 https://doi.org/10.1145/3065386 https://arxiv.org/abs/1409.1556 https://arxiv.org/abs/1409.1556 http://ijaerd.com/papers/finished_papers/image%20classification%20with%20deep%20learning%20and%20comparison%20between%20different%20convolutional%20neural%20network%20structures%20using%20tensorflow%20and%20keras-ijaerdv05i0263082.pdf 54 i.n.g.a. astawa et al. / knowledge engineering and data science 2021, 4 (1): 49–54 neural network structures using tensorflow and keras,” int. j. adv. eng. res. dev., vol. 5, no. 02, pp. 533–538, 2018. [12] a. fadlil, r. umar, and s. gustina, “mushroom images identification using orde 1 statistics feature extraction with artificial neural network classification technique,” journal of physics: conference series, vol. 1373, p. 012037, nov. 2019. [13] i. n. g. a. astawa, i. k. g. d. putra, m. sudarma, and r. s. hartati, “komnet: face image dataset from various media for face recognition,” data br., vol. 31, p. 105677, aug. 2020, doi: 10.1016/j.dib.2020.105677. [14] a. voulodimos, n. doulamis, a. doulamis, and e. protopapadakis, “deep learning for computer vision: a brief review,” comput. intell. neurosci., vol. 2018, pp. 1–13, 2018, doi: 10.1155/2018/7068349. [15] y. bengio, “learning deep architectures for ai,” found. trends® mach. learn., vol. 2, no. 1, pp. 1–127, 2009, doi: 10.1561/2200000006. [16] m. d. zeiler and r. fergus, “visualizing and understanding convolutional networks,” in computer vision – eccv 2014, d. fleet, t. pajdla, b. schiele, and t. tuytelaars, eds. springer, 2014, pp. 818–833. [17] q. cao, l. shen, w. xie, o. m. parkhi, and a. zisserman, “vggface2: a dataset for recognising faces across pose and age,” in 2018 13th ieee international conference on automatic face & gesture recognition (fg 2018), may 2018, pp. 67–74, doi: 10.1109/fg.2018.00020. [18] j. yosinski, j. clune, y. bengio, and h. lipson, “how transferable are features in deep neural networks?,” arxiv prepr. arxiv1411.1792, nov. 2014. [19] p. marcelino, “transfer learning from pre-trained models,” 2018. [20] y. e. wang, g.-y. wei, and d. brooks, “benchmarking tpu, gpu, and cpu platforms for deep learning,” arxiv prepr. arxiv1907.10701, jul. 2019. http://ijaerd.com/papers/finished_papers/image%20classification%20with%20deep%20learning%20and%20comparison%20between%20different%20convolutional%20neural%20network%20structures%20using%20tensorflow%20and%20keras-ijaerdv05i0263082.pdf https://doi.org/10.1088/1742-6596/1373/1/012037 https://doi.org/10.1088/1742-6596/1373/1/012037 https://doi.org/10.1088/1742-6596/1373/1/012037 https://doi.org/10.1016/j.dib.2020.105677 https://doi.org/10.1016/j.dib.2020.105677 https://doi.org/10.1155/2018/7068349 https://doi.org/10.1155/2018/7068349 https://doi.org/10.1561/2200000006 https://doi.org/10.1561/2200000006 https://doi.org/10.1007/978-3-319-10590-1_53 https://doi.org/10.1007/978-3-319-10590-1_53 https://doi.org/10.1109/fg.2018.00020 https://doi.org/10.1109/fg.2018.00020 https://doi.org/10.1109/fg.2018.00020 https://arxiv.org/abs/1411.1792 https://arxiv.org/abs/1411.1792 https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751 https://arxiv.org/abs/1907.10701 https://arxiv.org/abs/1907.10701 i. introduction ii. method iii. results and discussions iv. conclusion acknowledgment declarations author contribution funding statement conflict of interest additional information references knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 1, january 2018, pp. 1–7 eissn 2597-4637 https://doi.org/10.17977/um018v1i12018p1-7 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) network traffic time series performance analysis using statistical methods purnawansyah a, 1, haviluddin b, 2, *, rayner alfred c, 3, achmad fanany onnlita gaffar d, 4 a faculty of computer science, universitas muslim indonesia, jl. urip sumoharjo km5, makassar 90231, indonesia b faculty of computer sci. and information tech, mulawarman university, jl. kuaro no.1, samarinda 75123, indonesia c faculty of computing and informatics, universiti malaysia sabah, jalan ums, kota kinabalu 88400, malaysia d dept. of information tech., samarinda state polytechnic, jl. dr. ciptomangunkusumo, samarinda 75131, indonesia 1 purnawansyah@gmail.com; 2 haviluddin@gmail.com*; 3 ralfred121@gmail.com; 4 onnygaffar212@gmail.com *corresponding author i. introduction the remarkable and high accuracy of forecasting result is indeed required to take a decision [1, 2]. in this paper, three statistical models i.e. decomposition, winter’s exponential smoothing and autoregressive integrated moving average (arima) were used to make forecasting on the use of daily internet traffic. in which, the data traffic constitutes a time series. furthermore, time series comprises a series of observation pursuant to time. employed time series, principally, for making forecasting is a data series of (yt+1, yt+2, ..., yt-n) in accordance with (xt+1, xt+2, ..., xt-n) in particular time range [2-4]. then, the primary factor influencing forecasting technique determination relies on identification and approach to determine pattern data which basic notation of forecasting yt: time series data value during the period of t, ŷt: forecasting value of yt and 𝑒𝑡 = 𝑌𝑡 − 𝑌𝑡 : surplus or error in forecasting. time series comprises of (1) trend (t); data characteristic tend to be high or low, (2) seasonal variation (s); periodical fluctuated data in a year such as monthly, weekly, and daily data, (3) cycles (c); fluctuated data in more than a year, (4) random component (r); data combination from seasonal variation, trends, cycles and random factor are required to be taken into account within forecasting method [5-7]. this present study aims at juxtaposing forecasting result using time series data in accordance with three statistical methods i.e. decomposition, winter’s exponential smoothing and arima. this paper consists of four different part. the first part deals with the issue on why the authors were intrigued on conducting such a study. then, it is followed by the second part which exposes several related theories and technique on time series forecasting. the third part presents the results of the study and the fourth part discusses the results and draws a conclusion. article info a b s t r a c t article history: received 10 august 2017 revised 12 september 2017 accepted 10 october 2017 published online 8 january 2018 this paper presents an approach for a network traffic characterization by using statistical techniques. these techniques are obtained using the decomposition, winter’s exponential smoothing and autoregressive integrated moving average (arima). in this paper, decomposition and winter’s exponential smoothing techniques were used additive and multiplicative model. then, arima based-on box-jenkins methodology. the results of arima (1,0,2) was shown the best model that can be used to the internet network traffic forecasting. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: decomposition winter’s exponential smoothing arima additive multiplicative http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v1i12018p1-7 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 2 purnawansyah et al. / knowledge engineering and data science 2018, 1 (1): 1–7 ii. methods in making forecasting, numerous statistical methods and literature review are available and have been employed by several researchers. these statistical methods have been widely used in financial and demography aspects. in making forecasting, the employment of statistical methods are considerably influenced by time series pattern generated, hence, initial observation in making forecasting shall oversee and analyze the type of the data since every statistical method possesses different working phase [8]. internet data traffic is characterized as a seasonal variation which fluctuates periodically. thus, three statistical methods considered as the most applicable to make forecasting are decomposition, winter’s exponential smoothing and arima [1-3]. below, the three methods employed in this present study is briefly explained a. decomposition decomposition method comprises of two models which include the additive and multiplicative model. additive model constitutes by (1). yt = trend + seasonal + error (1) while multiplicative model constitutes by (2). yt = trend * seasonal * error (2) in which, yt observation towards time t. the basic principle of time series decomposition method is to disintegrate time series data in several patterns and identified those time series segregated, then discovered it separately. after, the data is discovered, integrate the data to make a forecasting. the disintegration of the data is conducted to improve the accuracy of forecasting and attain better time series data attitude comprehension [4, 7]. b. winter’s exponential smoothing exponential smoothing method is the procedure for continuous improvement in forecasting the recent object observations. this method provides an average weighted exponential moving at the entire last observed values. winter's exponential smoothing recognizes three constants as determinants of outcome data forecasting, it is composed of α as a smoothing constant,  as the trend component and d as  seasonal component, where the magnitude of a constant between 0 and 1. to generate accurate forecasting, it is determined some combination of values smoothing constant. winter's exponential smoothing method for forecasting time series comprises of two models i.e., multiplicative and additive. multiplicative, principally, contains duplication between the trend component and the seasonal component and it is used when the data in a particular season proportional to the previous season. the formula used (3). yt = (b1 + b2t) st + εt (3) where b 1 is the permanent component; b2 is linear; trend component; st is the multiplicative seasonal factors; ε t is error component. while the additive model containing the sum of the trend component with the seasonal component and is used if the difference data reaches a relatively constant in every season, (4). yt = (b1 + b2t) st + εt (4) where b 1 is the permanent component; b2 is a linear trend component; st is additive seasonal factors; ε t is the error component [1, 2, 4] c. autoregressive integrated moving average (arima) arima method is used to analyse the time series consisting of autoregressive (ar) and moving average (ma). the methods of arima (p, d, q) (p, d, q) s is used with the provisions of the time series which is stationary, where p is the process in ar, d is the process of differencing to convert the data into stationary type, and, finally, q is processed on the ma [1, 5]. in general, the time series is considered not to be stationary in the means and variances. if the time series is not stationary, then the transformation process should be carried out in variance and the purnawansyah et al. / knowledge engineering and data science 2018, 1 (1): 1–7 3 differencing process is performed in the means. in variance, rules of transformation, namely (1) only for series zt are positive, (2) the process of transformation is done before the process of difference, and (3) the value of λ serving as a standard is seen from the sum of square error (sse) in the transformation process. normally, the smallest value of sse variance indicating transformation process has been successfully carried out. meanwhile, the means, the process of differencing will show specifically the period between data, table 1. according to the box-jenkins methodology, there are four stages in doing the forecasting using arima model, i.e.; (1) identification of models and patterns; it visually looks the data pattern to be analyzed and check actual data validity, (2) parameters determination; can be done using statistical ttest and p-value, (3) model check (hypothesis testing and diagnostics); testing model that is widely used is the ljung-box q statistic, to check the white noise with the provisions of the p-value> α of 0.05 and kolmogorov-smirnov test to check for normal distribution with the provisions of the pvalue> α 0:05, and (4) forecasting; the results of arima process will be analyzed in three parts, namely the upper limit, the lower limit should be worth 95%, and forecast values. the finest arima model for forecasting is the model with the smallest error value [1, 2]. d. dataset testing daily usage of data internet traffic is the main indicator of telecommunication usage in a particular network. daily usage data internet traffic is used for the network technicians in controlling and managing the use of the network. in this study, daily usage data of internet traffic used is data of daily usage of internet traffic in the network at mulawarman university taken from the main server using cacti software. these data were taken in the span of 21 to 24 june 2013. prior to the forecasting process is done, the original data is normalized to speed up the counting process without eliminating the actual data value [6].the normalization formula as (5). �̅� = 𝑋− 𝑋𝑚𝑖𝑛 𝑋𝑚𝑎𝑥−𝑋𝑚𝑖𝑛 (5) in which, �̅� is the original data; 𝑋𝑚𝑎𝑥 is the maximum data value; 𝑋𝑚𝑖𝑛 is the minimum data value. table 2 presents the original data of daily usage of internet traffic. while fig. 1 exposes daily usage of internet traffic plot. e. determining the finest forecasting model the selection of the finest time series method is determined by an indicator measuring the accuracy of the data through a specific method of analysis. in the statistical method of determining the finest table 1. acf and pacf identification model acf pacf ar (p) dies down cut-off after lag p ma (q) cut-off after lag q dies down arma (p,q) dies down dies down ar (p) or ma (q) cut-off after lag q cut-off after lag p source: [1] table 2. original data traffic on 21-24 june 2013 date time inbound + outbound date time inbound + outbound 6/21 1 00:00 6293000 6/23 97 00:00 10517000 2 00:30 5185000 98 00:30 6715000 … … … … … … … … 48 23:30 11661000 144 23:30 5236000 6/22 49 00:00 8390000 6/24 145 00:00 4528000 … … … … … … … … 96 23:30 14530000 192 23:30 5969000 4 purnawansyah et al. / knowledge engineering and data science 2018, 1 (1): 1–7 indicator is set to a certain size, among other things, mean absolute error (mae) or mean absolute deviation (mad), mean absolute percentage error (mape), mean square error (mse) or mean square deviation (msd), root mean square error (rmse) and mean percentage error (mpe). the data test results indicator provided by values such as mae / mad, mape, mse / msd, rmse, and mpe are the smallest error values. where the value indicates the error value of testing a method. therefore, the determination of the finest model is performed by selecting the smallest error value. hence, the forecasting result having the smallest value is the finest model since it will give the test results closer to the actual value data [7-11]. in this study, the method of measuring the accuracy of forecasting is using mape, mad, msd. where each method has a formula. first, mape formula is as follows, (6). 𝑀𝐴𝑃𝐸 = 100 𝑛 ∑ (𝑌𝑡−𝑌𝑡 ′) 𝑌𝑡 𝑛 𝑡=1 (6) second, the formula of mad, (7). 𝑀𝐴𝐷 = ∑ (𝑌𝑡−𝑌𝑡 ′) 𝑛 (7) third, the formula of msd, (8). 𝑀𝑆𝐷 = ∑ | (𝑌𝑡−𝑌�̂�) 𝑛 | 2 𝑛 𝑡=1 (8) where, 𝑌𝑡 observation value; 𝑌𝑡 ′ forecasting value; and 𝑛 the amount of observation. this present study deals with a comparative study on the result of predetermined statistical model testing; decomposition, winter’s exponential smoothing and arima. the following fig. 2 illustrates the flow of undertaken study. iii. results and discussion in this study, the observations were contrived to the daily usage of internet traffic (inbound and outbound) at a state university. the data was collected for forecasting in june 2013 for 4 days (21-24 june 2013) amounting to 192 data samples. further, the data were analysed and observed using a predetermined statistical method includes decomposition, winter's exponential smoothing, and arima. the aforementioned methods were determined and undertaken due to seasonal variation of daily usage of internet traffic. spss 19 and minitab 16 were utilized to assist the data analysis. fig. 1. daily usage of internet traffic plot purnawansyah et al. / knowledge engineering and data science 2018, 1 (1): 1–7 5 a. decomposition analysis the first stage undertaken is to test the data using decomposition model. in this study, decomposition models used consisted of two models of decomposition includes additive and multiplicative decomposition. simultaneously, the process of network traffic analysis was done by dividing the dataset into two parts, namely the data inbound and outbound. the data were analysed separately. then, the analysis results are re-consolidated. the fairly-decent error rate of forecasting was obtained by decomposition additive which mape is worth 4.69e + 01, mad is worth 1.65e-01, and msd is worth 4.02e-02. b. winter exponential smoothing additive analysis the second stage is to test the data using winter's exponential smoothing additive models. in this study, winter's exponential smoothing consisted of additive and multiplicative models were used. the process of analysis is done identically to the decomposition model. in this study, respectively trend and smoothing of the data set are worth 0.2 and 0-1 to get a satisfying forecasting accuracy. the fairlydecent error rate of forecasting was obtained by winter's exponential smoothing additive which mape is worth 2.35e + 01, mad worth 1.89e + 06, and msd is worth 2.58e + 06. c. arima analysis the last stage is to test data using arima model. the data testing within arima phase was done by stationer processes, thus the data converting into variance (transformation) and means (differencing) to obtain the arima model (1,0,0), (1,1,0), (1,1,1), (1,0,1) and (1,2,1). after checking the model (hypothesis testing and diagnostics) with the test model ljung-box q statistic, to check the white noise with the provisions of the p-value> α of 0.05 and continued with the kolmogorov-smirnov to check the normal distribution with the provisions of the p-value> α 0:05. then the arima model (1,0,2) has qualified which upper limit, the lower limit is worth 95%. a fair forecasting error rate is obtained with arima where mape is worth 2.78e + 01, mad is worth 2.54e + 06, and msd is worth 1.89e + 06. the results of the forecasting are illustrated in fig. 2, 3, and 4 and the comparison of mape, mad, and msd are exposed in table 3. fig. 2. research flow start input data decomposi tion model winter’s exponential smoothing model arima model forecasting results compare end 6 purnawansyah et al. / knowledge engineering and data science 2018, 1 (1): 1–7 table 3. comparison of three predetermined statistical model analysis models mape mad msd decomposition  additive model 4.69e+01 1.65e-01 4.02e-02  multiplicative model 5.69e+01 1.66e-01 4.03e-02 winter’s exponential smoothing  additive model 2.35e+01 1.89e+06 2.58e+06  multiplicative model 2.39e+01 2.05e+06 3.07e+06 arima (102) 2.78e+01 2.54e+06 1.89e+06 fig. 3. winter’s exponential smoothing additive plot fig. 4. winter’s exponential smoothing additive plot fig. 5. arima (1,0,2) plot purnawansyah et al. / knowledge engineering and data science 2018, 1 (1): 1–7 7 iv. conclusion this present study utilizing statistical methods decomposition, winter's exponential smoothing and arima to forecast the usage of internet traffic on mulawarman university. in order to identify the results of forecasting, the three predetermined models, mape, mad, and msd were employed. the test results of the three methods confirm that arima model (1,0,2) has a fair forecasting error rate which is calculated with the smallest value of msd is 1.89e + 06. this indicates that the accuracy of the arima forecasting accuracy approaches the actual data. however, the arima model cannot accommodate the increase or decrease of internet users’ frequencies. in addition, if the data sample is large then the forecasting result will be constant. along with the widespread development of computational intelligence, then the future undertaken study will employ forecasting using one of the machine learning methods that are considered in accordance with the seasonal variation time series. references [1] box, g.e.p., g.m. jenkins, and g.c. reinsel, time series analysis forecasting and control fourth edition. 2008, copyright © 2008 by john wiley & sons, inc. all rights reserved. [2] wei, w.w.s., time series analysis univariate and multivariate methods second edition. 2006, pearson education, inc. all rights reserved. [3] santos, a.c.f., et al., network traffic characterization based on time series analysis and computational intelligence. journal of computational interdisciplinary sciences, 2011. 2(3): p. pp. 197-205. [4] brockwell, p.j. and r.a. davis, introduction to time series and forecasting second edition, g. casella, editor. 2002, © 2002, 1996 springer-verlag new york, inc. [5] li, c. and t.-w. chiang, complex neurofuzzy arima forecasting—a new approach using complex fuzzy sets. ieee transactions on fuzzy systems, 2013. 21(no. 3, june 2013). [6] bernacki, j. and g. kołaczek, anomaly detection in network traffic using selected methods of time series analysis. i. j. computer network and information security, 2015. 9: p. 10-18. [7] sermpinis, g., et al., forecasting and trading the eur/usd exchange rate with stochastic neural network combination and time-varying leverage. decision support systems, 2012. 54, (2012): p. 316–329. [8] khashei, m. and m. bijari, an artificial neural network (p, d,q) model for timeseries forecasting. expert systems with applications, 2010. 37 (2010): p. 479–489. [9] khashei, m. and m. bijari, a new class of hybrid models for time series forecasting. expert systems with applications, 2012. 39(2012): p. 4344–4357. [10] gomes, g.s.d.s. and t.b. ludermir, optimization of the weights and asymmetric activation function family of neural network for time series forecasting. expert systems with applications, 2013. 40(2013): p. 6438–6446. [11] haviluddin and r. alfred, forecasting network activities using arima method. journal of advances in computer networks (jacn), 2014. 2, (3) september 2014: p. 173-179. http://doi.org/10.1002/9781118619193 http://doi.org/10.1002/9781118619193 https://doi.org/10.2307/1269015 https://doi.org/10.2307/1269015 https://doi.org/10.6062/jcis.2011.02.03.0046 https://doi.org/10.6062/jcis.2011.02.03.0046 https://doi.org/10.1007/b97391 https://doi.org/10.1007/b97391 https://doi.org/10.1109/tfuzz.2012.2226890 https://doi.org/10.1109/tfuzz.2012.2226890 https://doi.org/10.5815/ijcnis.2015.09.02 https://doi.org/10.5815/ijcnis.2015.09.02 https://doi.org/10.1016/j.dss.2012.05.039 https://doi.org/10.1016/j.dss.2012.05.039 https://doi.org/10.1016/j.eswa.2009.05.044 https://doi.org/10.1016/j.eswa.2009.05.044 https://doi.org/10.1016/j.eswa.2011.09.157 https://doi.org/10.1016/j.eswa.2011.09.157 https://doi.org/10.1016/j.eswa.2013.05.053 https://doi.org/10.1016/j.eswa.2013.05.053 https://doi.org/10.7763/jacn.2014.v2.106 https://doi.org/10.7763/jacn.2014.v2.106 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 1, july 2020, pp. 11–18 eissn 2597-4637 https://doi.org/10.17977/um018v3i12020p11-18 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) query rewriting with thesaurus-based for handling semantic heterogeneity in database integration i made riyan adi nugroho a, 1, *, i wayan budi sentana b, 2 a jurusan teknik elektro, politeknik negeri bali uluwatu st no.45, jimbaran, south kuta, badung regency, bali 80361 indonesia b department of computing, macquarie university of sydney balaclava rd, macquarie park nsw 2109, australia 1 maderiyan@pnb.ac.id *; 2 i-wayan-budi.sentana@hdr.mq.edu.au * corresponding author i. introduction integration of data sources is a process of combining two or more data resource so that the data which contained can be accessed simultaneously [1]. in the process of integrating data sources, data can be derived from different places or applications. hence, its heterogeneously potential in format, structure, syntax, and semantic [2]. heterogeneity can occur at the schema or instance data level [1]. this paper only focuses on semantic heterogeneity in both schema and instance data level. semantic diversity at the schema level is related to name conflicts caused by synonyms, hyponyms, hypernym, and polysemy. on the other hand, semantic diversity at the data instance level only associated with a name conflict caused by synonyms. research on handling the diversity of data sources has long been done. query rewriting becomes one of the methods that have been proposed [3]. this method contains a process of rewriting an original query to the new one by adjusting concepts or terminology which used in each data source [3]. there are several approaches in query rewriting, one of them is the ontology-based query rewriting [3][4][5][6][7][8][9]. on this method, ontology is used as a representation of the schema from any data source [3]. moreover, query rewriting with ontology requires a global ontology as a mediator in identifying the data source schema [3]. in order to make global ontology, ontology reference is needed to identify the connection between existing concept [6]. it usually specific to a particular problem domain [6]. this kind of reference contains both concept and relation which refers to specific standard [6]. the main problem which usually seen is not all problem domains have a reference ontology [6]. in the domain of problem which have no ontology references, global ontologies created based on developer knowledge which potentially produce ambiguity [6]. this paper proposes a query rewriting method using a thesaurus to identify the scheme of a data source. in this step, a global scheme does not necessary. thus, the identification is processed on schema matching by using the thesaurus and n-gram similarity. this process can be seen on fig. 1. article info a b s t r a c t article history: received 2 march 2020 revised 23 april 2020 accepted 1 july 2020 published online 17 august 2020 nowadays, studies on handling semantic heterogeneity still become a challenge for researcher. several methods have been used to solve these problems, one of which is query rewriting, implemented by rewriting a query into the latest one by using the selected schema. semantic query rewriting needs a framework in order to identify the connection through the data schema sources. this line is used as a basis for scheme selection. also, ontology is a model which often be used in these specific cases. the lack of ontology becomes a significant problem that usually seen. therefore, this paper will describe an alternative framework in order to identify the link of semantic, which assisted by thesaurus. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: database integration semantic heterogeneity query rewriting thesaurus mailto:maderiyan@pnb.ac.id mailto:i-wayan-budi.sentana@hdr.mq.edu.au https://creativecommons.org/licenses/by-sa/4.0/ 12 i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 this paper consists of several sections. the second section, methods describes query extraction, schema matching, keyword enrichment, and generate a query. the result and discussion explains about analysis and test result. the last one is the conclusion ii. method the query extraction process is needed to identify the schema of the querying user [10]. this process is made by dividing the querying user into three parts: domain scheme, property scheme, and keyword [3]. in the relational model, domains represent the name of the table, and attribute data is represented by attribute name and keyword, which represent data value. both domain and property schema are processed at the schema matching stage. at the same time, the keyword is processed at the keyword enrichment phase. the example of query extraction results can be founded in fig. 2. schema matching is needed in order to choose similar data resources with user query schema. the selection process is carried out by considering the similarity between semantics and syntax. the process consists of five stages: schema extraction, get source schema, schema enrichment, string matching, and schema selection. fig. 3 is the proposed schema matching process. schema extraction is the pre-stage of schema matching. the purpose of this phase is to extract the schema from each data source. the schema extracted includes: the name of the data source, the table name, the table relation, attribute names, and attribute data type. the extracted schema is stored in a schema repository. fig. 4 is an example of the schema extraction results. get source schema is a process of getting data source which produced by extraction schema stage. the obtained schema will then be calculated in order to find the syntactical similarity values with the schema generated from the enrichment one. the calculation performed on the string matching stage. enrichment schema is an enrichment process of user query outline that will be compared with the data source on the string matching process by adding synonyms, hyponyms, and hypernyms. the purpose of these three stages is to identify the data source, which has a semantic correlation. in this paper, the identification of synonyms, hyponyms, and hypernyms are identified by thesaurus. the selected thesaurus is wordnet. fig. 2. the proposed query rewriting process fig. 2. sample result of query extraction i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 13 the words in wordnet are organized into a set of a synonym (synset) [11]. each set closely related to other synset based on semantic relationships such as synonym, hyponym, hypernym, and antonym. a hierarchy tree can be founded from a synset correlation. fig. 5 describes a synset connection. fig. 3. the proposed schema matching process fig. 4. sample result of the schema extraction fig. 5. sample of the synset relation 14 i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 a synonym can be identified by looking for a similar word located in a common synset. in addition, a hyponym can be founded by searching for an identic word that stands below it. furthermore, hypernym can be seen by searching for words on it [12]. fig. 6 is the proposed schema enrichment process. the sample results of enrichment schema, as shown in table i. after the data source schema has been obtained, and the user query successfully enriched, the following stage is string matching. string matching is a process to calculate the value of similarity between each scheme represented by a string [13]. this value is used as a basis for determining which schema that will be used as the query. the calculation is carried out between the domain scheme with table name as well as the property structure with the attribute name. the string matching technique used in this paper is n-gram similarity. this method can be used in multiple string comparisons. by using this procedure, the typical number of n-gram can be counted as n character series between the string. in order to count the similarity of two strings, we can use the jaccard coefficient equation. fig. 7 is an example of the n-gram similarity calculation [14]. ( ) ⋃ (1) schema selection is a data source selection process in order to find the most appropriate structure with user query schema. this phase is carried out based on the highest similarity. both string matching and schema selection are implemented consecutively, where the calculation and selection fig. 6. the proposed schema of enrichment process table 1. example of schema enrichment schema synonym hyponym hypernym patient sufferer inpatient, outpatient person, individual gender sex feminine, masculine category fig. 7. example calculation of the n-gram similarity i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 15 are made for the table and must first be done. not only reducing the calculation of string matching, this process also decreasing the selection error made by homonym conditions. the example of schema selection is presented in fig. 8. the semantic heterogeneity on instance data level occurs due to entities differenciacy while it saved. this diversity contributes an impact on the completeness of data which are integrated. this problem is solved by keyword enrichment. this process is occurred by adding the synonym of keyword. the purpose of this additional is to integrate the information, not only based on the keyword which inputted but also followed by synonym of it. the synonym identification is performed by thesaurus wordnet. this process followed by words recognition that are located in the corresponding synset as the keyword. fig. 9 is the proposed keyword enrichment process. generate query is a process of query building in accordance with both schema and keyword, which generated in the process of matching schema and keyword enrichment [15]. in this research, the query is built in accordance with the terminology of sql (structured query language) language select. in order to show the selected data, both select and terminology must have contained in the select order. while select represents the attribute of the table name, in other cases, form represents the table name itself. furthermore, where, order by, group by, and having are optional terminology that is representing the condition of data. the main focus of query development concerns in three parts, such as select, from, and where. from user query perspective, select represents the attribute schema. form represents the domain schema, as well as where represents the keyword. fig. 10 is an example of generating query results. fig. 8. sample results selection scheme fig. 9. the proposed of keyword enrichment process table 2. sample result of keyword enrichment keyword synonym male man, boy gender woman, girl 16 i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 iii. results and discussions in order to validate an offer, it is needed to build sqre (semantic query rewriting) tool and performed some experiments. sore developed with codeigniter framework (php based) and using library nltk (python), which can be used to build api wordnet. experiments were carried out by integrating two databases from different health information systems. both data sources have semantic diversity at the schema level and instance data level. the first database scheme was shown in fig. 11, and the second one can be seen in fig. 12. the test was finished by determining 5 query user and heterogeneity types of 2 data sources. table iii is showing the result of the test. the table showed that this model could handle the semantic heterogeneity in database integration, such as „pria-lelaki’, ‘pasien-penderita’, ‘pekerjaanprofesi’, ‘kelamin-gender’. however, query 3 and query 5 were failed. the failure of query 3 caused by the matching method couldn't handle a scheme which have more than two words, such as „kode penyakit‟. in addition to this, the limitation of data synset (in query 5) such as „aktivitas-profesi‟, have made the word connection become unidentified. fig. 10. sample results of the generate query process fig. 11. database schema 1 i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 17 iv. conclusion this study has introduced an alternative method to handle semantic heterogeneity in the process of database integration with thesaurus-based query rewriting. semantic heterogeneity at the schema data level is handled by identifying synonyms, hyponymy, and hypernym of each user query. the result of this identification then compared with each data source schema. semantic heterogeneity at the instance data level handled by identifying synonyms of the keywords, and it will be used in keyword enrichment. furthermore, the technique used in this schema comparison is n-gram similarity. the proposed method can be optimized in further research. the reduction of synonym, hyponym, and hypernym can be minimized in order to simplify the calculation. moreover, the election of schema can be added by metadata analysis and instance data from any data source. the process of schema election can collaborate with both metadata and instance data checking of any source schema. this process is expected can improve the speed as well as the accuracy of the query rewriting process. acknowledgement this research was supported by politeknik negeri bali and macquarie university of sydney. we thank our colleagues from both institutions. we thank everyone who contributed to the completion of this paper in one way or another. hopefully, this research can be useful. fig. 12. database schema 2 table 3. testing result type domain property keyword status query 1 pasien pekerjaan wiraswasta db1 tbpasien pekerjaan wiraswasta, wirausaha success db2 penderita profesi wiraswasta, wirausaha success query 2 pasien kelamin pria db1 tbpasien jeniskelamin pria, lelaki, jantan success db2 penderita gender pria, lelaki, jantan success query 3 penyakit kode penyakit i10 db1 tbpenyakit kodepenyakit i10 success db2 tabel_penyakit penyakit i10 fail query 4 pasien kota bandung db1 tbpasien kota bandung, pasang success db2 penderita kota bandung, pasang success query 5 pasien aktivitas buruh db1 tbpasien pekerjaan buruh, karyawan, pegawai, pekerja success db2 penderita tanggallahir buruh, karyawan, pegawai, pekerja fail 18 i.m.r.a nugroho and i.w.b. sentana. / knowledge engineering and data science 2020, 3 (1): 11–18 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] o. m. tamer and v. patrick, principles of distributed database system third edition, vol. 91, no. 5. 2011. [2] e. rahm and h. h. do, “data cleaning: problems and current approaches,” dol. proc. acm int. work. data warehous. ol., pp. 49–56, 2007. [3] a. aslam, s. khan, and k. latif, “semantic based query rewriting in heterogeneous sources,” proc. 4th ieee int. conf. emerg. technol. 2008, icet 2008, pp. 292–297, 2008, doi: 10.1109/icet.2008.4777517. [4] handoko and j. r. getta, “query decomposition strategy for integration of semistructured data,” acm int. conf. proceeding ser., vol. 04-06-december-2014, pp. 459–463, 2014, doi: 10.1145/2684200.2684343. [5] h. imran and a. sharan, “thesaurus and query expansion,” int. j. comput. sci. inf. technol., vol. 1, no. 2, pp. 89–97, 2009. [6] k. ramar and t. t. mirnalinee, “a semantic web for weather forecasting systems,” 2014 int. conf. recent trends inf. technol. icrtit 2014, 2014, doi: 10.1109/icrtit.2014.6996127. [7] f. l. r. lopes, e. r. sacramento, and b. f. lóscio, “using heterogeneous mappings for rewriting sparql queries,” proc. int. work. database expert syst. appl. dexa, no. iii, pp. 267–271, 2012, doi: 10.1109/dexa.2012.58. [8] thantawi, wicaksana, and w. lily, “query rewriting berbasis semantik menggunakan wordnet dan lch pada search engine google,” in konferensi nasional sistem informasi, no. february, 2013. [9] a. shiri and c. revie, “query expansion behavior within a thesaurus-enhanced search environment: a usercentered evaluation,” j. am. soc. inf. sci. technol., vol. 57, no. july, pp. 462–478, 2006, doi: 10.1002/asi. [10] h. jayadianti, c. s. pinto, l. e. nugroho, and w. widayat, “solving different languages problem (portuguese, english and bahasa indonesia) in digital library with ontology,” proc. 7th icts, vol. 7, pp. 197–202, 2013. [11] gunawan and a. saputra, “building synsets for indonesian wordnet with monolingual lexical resources,” proc. 2010 int. conf. asian lang. process. ialp 2010, pp. 297–300, 2010, doi: 10.1109/ialp.2010.69. [12] gunawan and e. pranata, “acquisition of hypernymy-hyponymy relation between nouns for wordnet building,” proc. 2010 int. conf. asian lang. process. ialp 2010, pp. 114–117, 2010, doi: 10.1109/ialp.2010.70. [13] g. recchia and m. louwerse, “a comparison of string similarity measures for toponym matching,” comp 2013 acm sigspatial int. work. comput. model. place, no. july 2018, pp. 54–61, 2013, doi: 10.1145/2534848.2534850. [14] n. h. sulaiman and d. mohamad, “a jaccard-based similarity measure for soft sets,” ieee symp. humanit. sci. eng. res., pp. 634–651, 2012, doi: 10.4018/978-1-5225-0204-3.ch030. [15] j. wang, y. zhang, j. lu, z. miao, and b. zhou, “query processing for heterogeneous relational data integration,” int. conf. intell. comput. integr. syst., pp. 777–781, 2010. https://doi.org/10.1007/978-1-4419-8834-8 http://dc-pubs.dbs.uni-leipzig.de/files/rahm2000datacleaningproblemsand.pdf http://dc-pubs.dbs.uni-leipzig.de/files/rahm2000datacleaningproblemsand.pdf https://doi.org/10.1109/icet.2008.4777517 https://doi.org/10.1109/icet.2008.4777517 https://doi.org/10.1145/2684200.2684343 https://doi.org/10.1145/2684200.2684343 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.9942&rep=rep1&type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.9942&rep=rep1&type=pdf https://doi.org/10.1109/icrtit.2014.6996127 https://doi.org/10.1109/icrtit.2014.6996127 https://doi.org/10.1109/dexa.2012.58 https://doi.org/10.1109/dexa.2012.58 http://repository.gunadarma.ac.id/66/ http://repository.gunadarma.ac.id/66/ https://doi.org/10.1002/asi.20319 https://doi.org/10.1002/asi.20319 https://repository.ugm.ac.id/36092/ https://repository.ugm.ac.id/36092/ https://doi.org/10.1109/ialp.2010.69 https://doi.org/10.1109/ialp.2010.69 https://doi.org/10.1109/ialp.2010.70 https://doi.org/10.1109/ialp.2010.70 https://doi.org/10.1145/2534848.2534850 https://doi.org/10.1145/2534848.2534850 https://doi.org/10.1109/shuser.2012.6268901 https://doi.org/10.1109/shuser.2012.6268901 https://doi.org/10.1109/iciss.2010.5657113 https://doi.org/10.1109/iciss.2010.5657113 i. introduction ii. method iii. results and discussions iv. conclusion acknowledgement declarations author contribution funding statement conflict of interest additional information references knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 1, january 2018, pp. 33–38 eissn 2597-4637 https://doi.org/10.17977/um018v1i12017p33-38 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) sql logic error detection using start end mid algorithm jevri tri ardiansah a, 1, *, aji prasetya wibawa a, 2, triyanna widiyaningtyas a, 3, okazaki yasuhisa b, 4 a department of electrical engineering, state university of malang, jl. semarang no.5, malang 65145, indonesia b school of science and engineering, saga university, honjo-machi, saga 840-8502, japan 1 jev.ardian@gmail.com*; 2 aji.prasetya.ft@um.ac.id; 3 triyannaw@gmail.com; 4 okaz@cc.saga-u.ac.jp * corresponding author i. introduction sql is an important language since it used to access database to extract information and make decision. there are two types of error that can be occurred in sql syntax, they are syntax error and logic error [1]. syntax error is caused by mistakes in syntax writing, so it is automatically detected by the compiler. in this case, compiler gives warning about occurred error to let users fix their mistakes. by those mechanisms, users can easily learn from given error messages. on the other hand, logic error occurs if the syntax is correct but it does not produce the intended result [2]. it makes logic error more difficult to learn since there is no warning from compiler about occurred mistakes [3]. based on observation, sql in theory is easy to understand but it is difficult to use sql syntax in practical implementation [4]. that conclusion is supported by another research about sql errors. by giving some questions to students and new programmer; let them answer with sql syntax, the result shows that 32% participants made logical error, 8% made syntax error and 14% made both syntax error and logic error [5]. based on that research, the possibility of logic error occurrence is 46%, it is larger than syntax error. one of the ways to detect the occurred logical error is by knowing the task to detect the case whether it has solved or not go [6]. in this system, if compiler did not detect syntax error, result table will be produced. the result table can be used to check the task status. if it is different with the expected one, it means that the task’s status is not solved yet. by that status, it can be concluded that users made a logical error. this potential way can be used to show the logical error information to users and let them correct the mistakes. ii. methods this system provides many cases based on selection keywords in sql. there are 36 cases in total that should be answered using sql syntax. user can choose any of the keywords based on what they want to learn. each keyword contains material and case so users are able to learn and article info a b s t r a c t article history: received 27 august 2017 revised 13 september 2017 accepted 3 november 2017 published online 8 january 2018 database is an important part of a system and it stores data to be manipulated. sql (structured query language) is used for manipulating those data to extract information and make decision. there are two types of error which make sql is challenging to learn, namely syntax error and logic error. compiler can detect syntax error, but it does not show error warning while logical error occurred. it makes logic error more difficult to understand than syntax error. a web based sql compiler with errors detection ability by using start end mid algorithm is then developed, to help database's user to learn sql in practical implementation. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: sql logic error string matching start end mid algorithm http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v1i12017p33-38 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 34 j.t. ardiansyah et al. / knowledge engineering and data science 2018, 1 (1): 33–38 answer each case using sql syntax. by submitting sql query to the compiler, system shows the status of submitted query and the type of error; no error, syntax error or logic error. the input of this system is sql query and the output is a warning. the warning is about no error and syntax error or logic error which occured. from that warning, users are expected to apply correction and learn by their mistakes. system has three main steps to produce syntax or logical error information to users [7]. those steps are shown as fig.1. the first step is syntax error checking. system uses online sql compiler to detect syntax error. that compiler compiles query to web server by a real database management system. this system uses mysql as database management system. after compiling process, the occured syntax error will be shown to warn user about their mistakes. after there is no syntax error, system will produce result table based on submitted query. to check whether the query contains logic error or not, system compares the similarity between the result table and key table in the second step. in this case, key table is the expected table that should be produced by given task. if there is difference between them, then user did not answer the case correctly. it means that there is logic error in their answer query [6]. there are many string matching algorithms based on brute force algorithm. those algorithms are knuth-morris-pratt (kmp), boyer-moore (bm) and karp and rabin [8]–[10]. start end mid algorithm is also a developed algorithm based on brute force algorithm [11]. start end mid algorithm is used to check the similarities between result table and key table. result table is the table which is produced by user's sql and key table is the correct table which is stored in data base. start end mid algorithm has many advantages in system. this algorithm is simple and easy because it has no preprocessing phase like other string matching algorithms [12], [13]. those characteristics are very usefull in implementation considering this system uses small of data. this algorithm checks the first, end and mid character before doing sequential checking like brute force. characters between result table and expected table will be put in array and it will be checked by this algorithm. the steps in similarity checking based on start end mid algorithm are: a. compare the first character between result table and expected table. if they are similar, then go to the next step. b. compare the last character between result table and expected table. if they are similar, then go to the next step. c. compare the middle character between result table and expected table. if they are similar, then go to the next step. d. compare remaining characters from start to end sequentially. using the steps outlined above, it is possible to measure whether there is a difference; even in a single step. if there is a mistake, then there is a difference between result table and expected table [14], [15]. it means that user did not answer the related case correctly. in this case, they made a logic error in their submitted sql syntax. after logic error is detected, the system gives warning to users about the mistake location in their submitted query that contributed to the logic error. by this warning, hopefully users are able to apply correction and learn from their mistakes. this process makes a comparison between submitted sql with expected sql. submitted sql is the user answer's sql to solve the related case and expected sql is a key answer which is stored in data base. the differences between those data indicate the logic error position and will be used as warning to users. because this research material is from sql selection keyword, this warning is divided into three blocks warning, they are select block, from block and where block. select block means the mistakes space select clause. this implies that the mistake is in the select column. from block warning will be shown if the mistake occurred nearby from clause. wrong selected table is the example of this case. and where block fig. 1. the method to produce error warning syntax error checking logic error checking warning making j.t. ardiansyah et al. / knowledge engineering and data science 2018, 1 (1): 33–38 35 will be shown if the mistake took placed in where clause. this occurs when there is a missed condition. this checking method compares submitted sql and expected sql sequentially. the idea is to make users do correction by the first mistake and then the rest. for example if there are logic error in select, from and where block, this system shows only the select position. it can be conclude that this system shows the first logic error. iii. results and discussion this system is able to do prediction about user's mistake regarding sql logic error. therefore, the system evaluation is about to check the rate of precision, recall and accuracy. the evaluation of this prediction system is based on confusion matrix [5], [16]. this matrix compares between the prediction value and the real value. confusion matrix is shown as table 1. there are four types of value that are produced by confusion matrix. true positive value shows that system gives correct prediction. in this system case, it gives correct logic error warning. false negative value shows that system gives incorrect prediction. in this research case, system gives free of logic error warning but it should be logic error warning. false positive is also incorrect prediction. system gives logic error warning but that should be free of logic error warning. finally, true negative value shows correct result. system shows that there is no logic error and it is same with the real one. from those values in confusion matrix, precision recall and accuracy rate can be calculated [17]. precision rate compares between the quantities of correct system prediction with produced prediction by system. this value can be obtained by (1). 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = tp (tp+fp) 𝑥100%  recall value can be obtained by comparing between the quantities of correct system prediction with correct prediction should be. this value can be calculated by (2). 𝑅𝑒𝑐𝑎𝑙𝑙 = tp (tp+fn) 𝑥100%  accuracy value can be calculated by comparing the quantity between correct predictions by system with all system trial. it can be obtained by (3). 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = tp+tn 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁 𝑥100%  the system evaluation was conducted by giving input to the system and observing the output. since this system has 36 cases about selection concept in sql, the evaluation was done by 72 testing. for each case, there was a free logic error input and a logic error input. by those inputs, observation to the output was conducted then analysed whether it is correct or not. the output is the warning which is about the prediction of mistakes position in the user’s sql query. the syntax error detection is not evaluated because it is considered as a valid tool since it compiles sql directly by dbms (data base management system) in web server. the examples of evaluation data is shown as table 2. table 2 is the example of system evaluation which has 72 input queries. each query was submitted to the system so system produced error warning which is prediction about occurred error mistake. output prediction value column is the value of system prediction output. there are t (true) and f (false) values. the same kind of values are also in the logical value column which is the real value based on logical analysis. those values was counted so that confusion matrix was produced as table 3. table 1. confusion matrix system prediction value true false real value true true positive (tp) false negative (fn) false false positive (fp) true negative (fn) 36 j.t. ardiansyah et al. / knowledge engineering and data science 2018, 1 (1): 33–38 by the value in table 3, precision, recall and accuracy values can be calculated using each formula as described before. table 4 shows the percentages of precision, recall and accuracy. those testing value in table 4 show that system is able to detect logic error well. precision rate shows 87.5% and it means that system is appropriate to detect logic error. recall rate is 97.2% and it means that system is succeeded in logic error detection. and precision rate is 91.7% and it shows the system prediction and real prediction closeness. based on the obtained evaluation result, system is able to detect logic error well although it is not 100%. from 72 input queries, there were 6 mistake predictions. those mistakes are shown as table 5. table 3 shows the mistake prediction by system from 72 testing. these are divided into two types of mistake. there are 5 cases false negative and 1 false positive. as discussed earlier, false negative is occurred when system shows free of logic error but it should be logic error. based on evaluation, it happened because the result table is same with expected table although the sql is different. in this case, user sql syntax has logic error. this can happen, because to detect logic error, system uses result table as the main source as described before. this type of case happened with the first five cases in table 3. for example with the first case which is a logical error that occurred in where clause. user's sql is <= 101 and it should be <=100. since both of those sql syntaxes produce the same table result, system detects them as free of logic error. so, system does not check the differences between user's sql and sql key. the second type of mistake is false positive. based on the evaluation, system did false positive once and it is case number 6 in table 3. system detects logic error but actually user's sql is free of logic error. on that case, user answered "country" instead of "state". so system shows warning about logical error in select clause. logically, user should select "country", but key answer which is stored in database is "state". it makes the result tables of those sql are different, then system gets it table 2. examples of system evaluation no sql query input output prediction value logical value 1 select * from department t t 2 select name from department f f 3 select city from department t t 4 select city from customer f f 5 select distinct country from customer t t 6 select country from customer f f 7 select * from employee where id=2; t t 8 select * from employee where id!=2; f f 9 select * from product where stock != 0 t t … … … … 72 select name, now() from customer f f table 3. evaluation result as confusion matrix system prediction value true false real value true 35 1 false 5 31 table 4. evaluation results testing value precision 87.5% recall 97.2% accuracy 91.7% j.t. ardiansyah et al. / knowledge engineering and data science 2018, 1 (1): 33–38 37 as logic error. this type of mistake happened because admin made mistake when storing the key answer to database. as discussed before that start end mid algorithm is a developed algorithm based on brute force, this research also compare those algorithms. both of them are simple algorithm which means they have no preprocessing phase. to know about the better algorithm in logical error detection, the looping of both algorithms was recorded. those data is shown in table 6. table 6 shows that there are 50 trials of logic error which detected by both of start end mid and brute force algorithm. in the first trial, there were 2701 total characters that will be checked in order to find the different data. the checking process was conducted by both of start end mid and brute force algorithm. start end mid found the different data on 2nd looping, but brute force found the different data on 89th looping. it means that start end mid is faster than brute force. but in the implementation, there was no significant difference in timing since this research compares not that much data. the compared character was lower than 10.000 characters. the rest trials had the same result that start end mid has lower looping in logical error detection process than brute force. the result shows that start end mid was able to find the different data mostly in 2nd looping. it is because this algorithm checks the last character of data in the 2nd looping, so that the diffence can be found faster. it is different with brute force which cheks the first until the last character sequentially. if the different data is in the middle or even in the last, this algorithm will have more looping for checking process. in average data, the amount of total charactera were 1132 characters, but start end mid was able to find the different data on 61st looping. in the other hand, brute force was able to find the different data on 479th looping. table 5. mistake predictions no user’s sql sql key warning 1 select * from product where stock <= 101 select * from product where stock <= 100 free of logic error 2 select * from product where stock not between 10 and 15 select * from product where stock not between 10 and 20 free of logic error 3 select id from department order by country desc select emp_id from department order by country desc free of logic error 4 select country from customer union select distinct country from department select country from customer union select country from department free of logic error 5 select avg(quantity)from detail_trans where id<5 select avg(quantity)from detail_trans free of logic error 6 select left(country,3) from customer select left(state,3) from customer there is logic error near select clause table 6. example of looping comparison trial total character start end mid brute force 1 2701 2 89 2 621 2 195 3 840 2 527 4 2701 71 144 5 3279 2 1388 6 1035 188 984 7 3732 2 424 8 1665 2 221 9 3635 1360 1510 10 2486 2 814 ... ... ... ... 50 1020 63 964 38 j.t. ardiansyah et al. / knowledge engineering and data science 2018, 1 (1): 33–38 iv. conclusion based on this research, obtained rate of precision, recall and accuracy show that system is able to detect logic error well. but there are still mistake predictions which are false negative and false positive. to avoid false negative prediction, the case that produces unique table should be made. it makes no other possible table which can produce same table by such sql syntax. besides that, to avoid false positive prediction, admin should give much attention to database design for storing free logic error sql key. the comparing result between start end mid and brute force algorithm shows that start end mid algorithm has lower looping then brute force regarding to find the different character in such data. it means that start end mid algorithm is faster than brute force in order to find the logical error which occured in user’s sql query. references [1] r. dollinger, “sql lightweight tutoring module–semantic analysis of sql queries based on xml representation and linq,” in edmedia: world conference on educational media and technology, 2010. [2] s. brass and c. goldberg, “semantic errors in sql queries: a quite complete list,” j. syst. softw., vol. 79, no. 5, pp. 630–634, 2006. [3] a. ahadi, v. behbood, a. vihavainen, j. prior, and r. lister, “students’ syntactic mistakes in writing seven different types of sql queries and its application to predicting students’ success,” proc. 47th acm tech. symp. comput. sci. educ. sigcse ’16, pp. 401–406, 2016. [4] a. fanani, “pengembangan sumber belajar sql berbasis web untuk matakuliah basis data prodi s1 pendidikan teknik informatika universitas negeri malang,” universitas negeri malang, 2014. [5] e. p. costa, a. c. lorena, a. c. p. l. f. carvalho, and a. a. freitas, “a review of performance evaluation measures for hierarchical classifiers,” in aaai-2007 workshop, aaai technical report ws-07-05, 2007. [6] c. goldberg, “do you know sql? about semantic errors in database queries,” in 7th workshop on teaching, learning and assessment in databases, 2009. [7] j. ardiansah, o. yasuhisa, and t. wibawa, aji prasetya widyaningtyas, “development and trial use of a web-based database learning system,” in jaise (japanese society for information and system in education), 2017. [8] t. lecroq, “experimental results on string matching algorithmsitle,” softw. pract. exp., vol. 25, no. 7, pp. 727–765, 1995. [9] o. masanori, t. ryo, and s. tadamasa, “an evaluation of string search algorithms at user standing,” in proceedings of the 3rd wses international conference on mathematics and computers in mechanical engineering (mcme), 2001, pp. 4231–4236. [10] t. h. cormen, c. e. leiserson, r. l. rivest, and s. clifford, introduction to algorithms 3rd edition. mit press, 2009. [11] r. a. abdeen, “an algorithm for string searching based on brute-force algorithm,” ijcsns int. j. comput. sci. netw. secur., vol. 11, no. 7, pp. 24–27, 2011. [12] b. w. watson and r. e. watson, “a boyer–moore-style algorithm for regular expression pattern matching,” sci. comput. program., vol. 48, no. 2–3, pp. 99–117. [13] m. t. goodrich and r. tamassia, algorithm design. wiley, 2002. [14] l. i. zhulin, “a method for data structure course design based on cdio teaching idea 2 the course design based on cdio model,” pp. 418–421. [15] j. ardiansah, o. yasuhisa, and t. wibawa, aji prasetya widyaningtyas, “developing of a web-based database learning support system for practical implementation of sql,” ieice (institute electron. inf. commun. eng. tech. rep., vol. 116, no. 266, pp. 57–62, 2016. [16] t. saito and m. rehmsmeier, “the precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets,” plos one, vol. 10, no. 3, 2015. [17] t. fawcett, “an introduction to roc analysis,” pattern recognit. lett., vol. 27, no. 8, pp. 861–874, 2006. https://www.learntechlib.org/p/35118/ https://www.learntechlib.org/p/35118/ https://doi.org/10.1109/qsic.2004.1357967 https://doi.org/10.1109/qsic.2004.1357967 https://doi.org/10.1145/2839509.2844640 https://doi.org/10.1145/2839509.2844640 https://doi.org/10.1145/2839509.2844640 http://karya-ilmiah.um.ac.id/index.php/te/article/view/30441 http://karya-ilmiah.um.ac.id/index.php/te/article/view/30441 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.512.6561&rep=rep1&type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.512.6561&rep=rep1&type=pdf http://dbs.informatik.uni-halle.de/sqllint/tlad09.pdf http://dbs.informatik.uni-halle.de/sqllint/tlad09.pdf https://doi.org/10.1002/spe.4380250703 https://doi.org/10.1002/spe.4380250703 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.5508&rep=rep1&type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.5508&rep=rep1&type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.5508&rep=rep1&type=pdf https://doi.org/10.2307/2583667 http://paper.ijcsns.org/07_book/201107/20110704.pdf http://paper.ijcsns.org/07_book/201107/20110704.pdf https://doi.org/10.1016/s0167-6423(03)00013-3 https://doi.org/10.1016/s0167-6423(03)00013-3 https://doi.org/10.1371/journal.pone.0118432 https://doi.org/10.1371/journal.pone.0118432 https://doi.org/10.1016/j.patrec.2005.10.010 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 2, december 2020, pp. 106–111 eissn 2597-4637 https://doi.org/10.17977/um018v3i22020p106-111 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) generating javanese stopwords list using k-means clustering algorithm aji prasetya wibawa a, 1, *, hidayah kariima fithri a, 2, ilham ari elbaith zaeni a, 3, andrew nafalski b, 4 a electrical engineering department, universitas negeri malang jl semarang 5, malang, east java 65145, indonesia b unisa education futures, school of engineering, university of south australia sct2-39 mawson lakes campus, adelaide, south australia 5095, australia 1 aji.prasetya.ft@um.ac.id *; 2 hidayah9a20@gmail.com; 3 ilham.ari.ft@um.ac.id; 4 andrew.nafalski@unisa.edu.au * corresponding author i. introduction text processing in information retrieval (ir) requires text documents as primary data sources. however, not all words in the text document are used. some words often appear in text documents and do not have meaning called stopword [1], stored in a stopword list called a stopword database (corpus) [2][3]. the stopword removal approach depends on this corpus to remove unnecessary words on the text [4]. the formed word list must be in the same language [1][5]. various stopword list has been developed for popular languages such as english, chinese [6], sanskrit [7], arab [8], gujarati [9], and indonesia [10]. however, a stopword list for low resources language such as javanese is not available yet. javanese is one of the traditional languages in indonesia [11]. javanese language has a level of politeness or known as unggah-ungguh, namely ngoko, madya and krama [12][13]. many historical documents, news, and stories are written in javanese. since the use of javanese tends to become unpopular, retrieving information from such language could be difficult. the use of stopword removal may ease the ir process on javanese text. despite its benefit, list generation is quite complicated. in general, linguists manually label the substantial corpus and store and send the result to separate storages. therefore, an alternative to stopword list generation is badly needed. this paper aims to explore the use of the clustering approach for creating a stopword list in javanese. the words are excluded from the bag of words to speed up the text classification process [14]. the clustering method used is k-means, one of the fast algorithms in the big data processing. the method classifies a given set of data through a certain number of k clusters [15]. determination of words included in the stopword list is done by grouping words based on each word frequency. clustering eases the way to determine the threshold of words that include stopwords. article info a b s t r a c t article history: received 1 december 2020 revised 15 december 2020 accepted 29 december 2020 published online 30 december 2020 stopword removal necessary in information retrieval. it can remove frequently appeared and general words to reduce memory storage. the algorithm eliminates each word that is precisely the same as the word in the stopword list. however, generating the list could be time-consuming. the words in a specific language and domain must be collected and validated by specialists. this research aims to develop a new way to generate a stop word list using the k-means clustering method. the proposed approach groups words based on their frequency. the confusion matrix calculates the difference between the findings with a valid stopword list created by a javanese linguist. the accuracy of the proposed method is 78.28% (k=7). the result shows that the generation of javanese stopword lists using a clustering method is reliable. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: stopwords javanese language clustering k-means http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v3i22020p106-111 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ a.p. wibawa et al. / knowledge engineering and data science 2020, 3 (2): 106–111 107 ii. materials and methods the goal of this study is to generate a stopword list from the javanese stopword corpus. the selected javanese level of politeness is ngoko, due to its usage and vocabularies [11][12]. figure 1 shows the four stages in conducting this research. the first stage is data collection. the dataset used was taken from the website ki-demang.com in the javanese short stories category. the data consists of 106 stories without considering page numbers and titles. the collection of stories is combined into a text document, used as the stopword generation dataset. the second stage is data preprocessing: case folding, punctuation removal, tokenizing, and filtering. the first preprocessing, case folding, changes uppercase letters into lowercase letters. the punctuation removal deletes the punctuation characters and numbers from the dataset. furthermore, the tokenizing step spits the dataset into a single word. this step produces 17,763 types of words and their frequency. the result of tokenizing is words, cleared from typographical errors, words without meaning, names, and non-ngoko words, resulting in 14,384 types. this deletion is based on a javanese-indonesian and indonesian javanese translation dictionary. table 1 shows examples of deleted words. the dataset of 14,384 different words is submitted to javanese linguists. the linguists group the dataset into two classes, namely stopwords and non-stopwords. furthermore, general words (conjunction) considered as stopwords are 3,224 words. the non-stop words consist of 11,160 specific words: noun, verb, and adjectives. table 2 shows the example of two categories. the third stage is clustering the 14,384 unique words and their frequency. figure 2 shows the pseudocode of the k-means clustering method [16]. the first k-means clustering stage determines the k value or the number of clusters. in the study, the k value is k=3, k=5, k=7, k=9, k=11, k=13, and k=15 [17]. the next step calculates the distance between data and centroid using euclidean distance [18]. here, the results of each case are recognized in two classes: stopwords and non-stopwords. all words in cluster 0 are labeled as non-stopwords, while stopword is all words in other clusters. for example, if k=7, each word in the cluster 1 to 6 are stopwords, while the rest (in cluster 0) is nonstop words. this first assumption is based on the observation that words with high frequency [19] are outside cluster 0. table 3 illustrates one example of the frequency distribution of stopwords when k=7. in this case, 680 words is labeled as stop words, where 13704 words are non-stopwords. fig. 1. research stages table 1. examples of deleted words typographical errors words without meaning names non-ngoko words lungaa lha ezza wontening rilaaaaa we sukartiah inbox ã³mongan lur yono meresahkan ewosemono aaaaaaaa inah mengganggu senaosa loh sumantri pusaraning banjarpetambakan dhuk laras out sesambhungane ugh yani awalnya ampuunn sttt irvan berbincang data collection •106 javanese stories data preprocessing •case folding •removal of punctuations •tokenizing •filtering data clustering •k-means stopword evaluation •confusion matrix 108 a.p. wibawa et al. / knowledge engineering and data science 2020, 3 (2): 106–111 the fourth stage is evaluation, which aims to test the performance of the proposed method. the opinion of experts is used as a reference. a confusion matrix is applied to calculate accuracy and precision [20]. at this stage, all cases are tested to decide the best stopwords set based on the kmeans clustering technique. the accuracy is obtained by dividing the number of only correct documents by all documents [21]. the true value means that the clustering results have the same class as the reference. on the other hand, precision is the comparison of true positive (tp) with the total of true positive and false positive (fp) [21]. tp means that when the result of clustering is a stopword and it is the same as the reference. fp means that the predicted result is stopwords while the reference is non-stopwords. table 2. example of linguists’ classified words stopwords non-stopwords aku artane ana birahine apa cungkup dadi dhialog iki endhog ing garwamu kang jaitan sing karak wae langgananku yen macak table 3. stopwords and non –stopwords when k=7 k frequency distribution number of stopwords number of non-stopwords 0 1-25 0 13704 1 2000-3000 3 2 650-1050 13 3 290-600 28 4 1100-1600 5 5 26-100 531 6 105-290 100 total 680 13704 input: d = {d1, d2 … dn} data used. k = {2, 3, 4 … n) desired number of clusters output: one set k cluster. steps 1: randomly select k centroid from d as the initial centroid (center of the initial cluster) step 2: determine each item in the cluster that has the closest cluster center; calculate new averages for each cluster; step 3: repeat step 2 until the centroid cluster value does not change or until the maximum number of iterations is reached fig. 2. k-means clustering algorithm a.p. wibawa et al. / knowledge engineering and data science 2020, 3 (2): 106–111 109 iii. results and discussions table 4 shows the performance of the stopword list using k-means algorithm. the accuracy and precision represent the method performance by comparing the result with javanese linguists’ manual classification. in table 4, the highest accuracy is 78.2%, with 57.3% precision. the cluster supports this result with a value of k = 7. the result consists of 680 stopwords and 13704 non-stopwords, while the experts identify 3,224 and 11,160 of the same categories. the cluster can correctly indicate 11,030 of 14,384 words, which is dominated by non-stopwords category. figure 3 shows the distribution of the word based on the first assumption that the first cluster is the non-stopwords. as seen in figure 3, experts recognize most words as non-stop words. the k-means wrongly categorized the non-stopwords into stopwords category (area within the grey line). on the other hand, the precision is 57.3% of the orange and gray areas, which means that most stopwords are categorized as non-stopwords. the lowest performance is when k=5. the accuracy is 25 %, and the precision 21.7%. only 3089 is true stopwords, and 65 words are true non-stopwords. the second assumption is then applied for comparison. table 5 shows the result, assuming the first cluster is the stopwords, while the rest is non-stopwords. table 4. stopword generator performance with the first asumption k stopwords non-stopwords accuracy precision 3 49 14335 77.9% 100.0% 5 14184 200 21.9% 21.7% 7 680 13704 78.2% 57.3% 9 13281 1103 25.0% 21.5% 11 1500 12884 75.6% 40.6% 13 2145 12239 73.4% 36.1% 15 1750 12634 74.9% 39.2% table 5. stopword generator performance with the second assumption k stopwords non stopwords accuracy precision 3 14335 49 22.07% 22.1% 5 200 14184 78.07% 67.5% 7 13704 680 21.7% 20.68% 9 1103 13281 74.9% 32.6% 11 12884 1500 24.3% 20.2% 13 12239 2145 26.5% 20.0% 15 12634 1750 25.02% 20.08% fig. 3. words distribution based on the first assumption 110 a.p. wibawa et al. / knowledge engineering and data science 2020, 3 (2): 106–111 the best performance in table 5 is when k = 5, where the accuracy value is 78.07% and the precision value is 67.5%. this case indicates 135 true stopwords and 11095 true non-stopwords. the obtained precision is 67.5%, which is equal to 135 of 200 stopwords. the accuracy of both scenarios (table 4 and table 5) is similar. however, the precision of the best scenario (k=5) in table 5 is higher than the best of table 4 (k=3). it means that the performance second assumption is more promising than the first in recognizing the stopwords. therefore, kmeans locates stopwords in the first cluster while the-nonstopwords are in other clusters. iv. conlusion k-means is applicable for javanese stopwords list generation. the algorithm indicates the stopword location is in the first cluster of the words list. however, the current promising result is still possible to be improved. further research should consider the balance of frequency distribution and the implementation of word stemming in the preprocessing. the use of more training data may balance the frequency, while the stemming may combine the unique words and unites the occurances of combined words. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] r. t. lo, b. he, and i ounis “automatically building a stopword list for an information retrieval system,” j. digit. inf. manag. vol. 3, no. 1, pp. 3–8, 2005. [2] j. kaur, “a systematic review on stopword removal algorithms,” int. j. future revolut. comp. sci. comm. eng. vol. 4, no. 4, pp. 207–210, 2018. [3] j. kaur and p.k. buttar, 2018. “stopwords removal and its algorithms based on different methods”. international journal of advanced research in computer science, vol. 9, no. 5, pp. 81–88, oct. 2018. [4] s. vijayarani, m. j. ilamathi, and m. nithya, “preprocessing techniques for text mining an overview,” int. j. of comp. sci. comm. net. vol. 5, no. 1, pp. 7–16, 2015. [5] l. dolamic and j. savoy, “when stopword lists make the difference,” j. am. soc. inf. sci. technol., vol. 61, no. 1, pp. 200–203, 2010. [6] f. zou, f. l. wang, x. deng, s. han, and l. s. wang, “automatic construction of chinese stop word list,” proc. 5th wseas int. conf. appl. comp. sci. 2006, pp. 1010–1015, 2006. [7] j. k. raulji, “stop-word removal algorithm and its implementation for sanskrit language,” int. j. comp. applica. vol. 150, no. 2, pp. 15–17, 2016. [8] r. m. duwairi, “arabic sentiment analysis using supervised classification,” int. conf. futur. internet things cloud, pp. 579–583, 2014. [9] r. m. rakholia and j. r. saini, “a rule-based approach to identify stop words for gujarati language,” proc. 5th int. conf. front. intell. comput. theory appl. adv. intell. syst. comput., p. 515, 2017. [10] m. c. kirana, n. p. perkasa, m. z. lubis, and m. fani, “visualisasi kualitas penyebaran informasi gempa bumi di indonesia menggunakan twitter,” journal of applied informatics and computing, vol. 3, no. 1, pp. 23–32, 2019. [11] a. p. wibawa, a. nafalski, j. tweedale, n. murray, and a. e. kadarisman, “hybrid machine translation for javanese speech levels,” proc. 5th int. conf. knowl. smart technol., pp. 64–69, 2013. [12] s. poedjosoedarmo, “javanese speech levels,” indonesia, vol. 6, no. 6, pp. 54–81, 1968. [13] a. p. wibawa, a. nafalski, a. e. kadarisman, and w. f. mahmudy, “indonesian-to-javanese machine translation,” int. j. innov. manag. tech., vol. 4, no. 4, pp. 451–454, 2013. [14] s. v. s. gunasekara and p. s. haddela, “context aware stopwords for sinhala text classification,” 2018 natl. inf. technol. conf., pp. 1–6, 2018. https://dblp.org/db/journals/jdim/jdim3.html https://dblp.org/db/journals/jdim/jdim3.html http://www.ijfrcsce.org/index.php/ijfrcsce/article/view/1499 http://www.ijfrcsce.org/index.php/ijfrcsce/article/view/1499 https://doi.org/10.26483/ijarcs.v9i5.6301 https://doi.org/10.26483/ijarcs.v9i5.6301 https://www.researchgate.net/publication/339529230_preprocessing_techniques_for_text_mining_-_an_overview https://www.researchgate.net/publication/339529230_preprocessing_techniques_for_text_mining_-_an_overview https://doi.org/10.1002/asi.21186 https://doi.org/10.1002/asi.21186 http://www.cs.cityu.edu.hk/~lwang/research/hangzhou06.pdf http://www.cs.cityu.edu.hk/~lwang/research/hangzhou06.pdf https://doi.org/10.5120/ijca2016911462 https://doi.org/10.5120/ijca2016911462 https://doi.org/10.1109/ficloud.2014.100 https://doi.org/10.1109/ficloud.2014.100 https://doi.org/10.1007/978-981-10-3153-3_79 https://doi.org/10.1007/978-981-10-3153-3_79 https://doi.org/10.30871/jaic.v0i0.1246 https://doi.org/10.30871/jaic.v0i0.1246 https://doi.org/10.1109/kst.2013.6512789 https://doi.org/10.1109/kst.2013.6512789 https://ecommons.cornell.edu/bitstream/handle/1813/53445/indo_6_0_1107138592_54_81.pdf?sequence=1&isallowed=y https://doi.org/10.7763/ijimt.2013.v4.440 https://doi.org/10.7763/ijimt.2013.v4.440 https://doi.org/10.1109/nitc.2018.8550073 https://doi.org/10.1109/nitc.2018.8550073 a.p. wibawa et al. / knowledge engineering and data science 2020, 3 (2): 106–111 111 [15] t. m. kodinariya, “review on determining number of cluster in k-means clustering,” international journal of advance research in computer science and management studies, vol. 1, no. 6, pp. 90–95, 2013. [16] k. a. a. nazeer and m. p. sebastian, “improving the accuracy and efficiency of the k-means clustering algorithm,” proceedings of the world congress on engineering 2009, vol. i, pp. 1–5, 2009. [17] d. t. pham, s. s. dimov, and c. d. nguyen, “selection of k in k -means clustering,” proc. inst. mech. eng. part c: j. mech. eng. sci., vol. 219, no 1, may 2004, pp. 103–119, 2005. [18] f. leisch, “a toolbox for k -centroids cluster analysis,” computational statistics & data analysis, vol. 51, no 2, pp. 526–544, 2006. [19] n. grozavu, y. bennani, and m. lebbah, “from variable weighting to cluster characterization in topographic unsupervised learning,” proc. int. jt. conf. neural networks, pp. 1005–1010, 2009. [20] v. m. patro and m. r. patra, “augmenting weighted average with confusion matrix to enhance classification accuracy,” transactions on machine learning and artificial intelligence, vol. 2, no. 4, pp. 77–91, 2014. [21] a. mishra and s. vishwakarma, “analysis of tf-idf model and its variant for document retrieval,” int. conf. comput. intell. commun. networks anal., pp. 772–776, 2015. http://www.ijarcsms.com/docs/paper/volume1/issue6/v1i6-0015.pdf http://www.ijarcsms.com/docs/paper/volume1/issue6/v1i6-0015.pdf https://www.researchgate.net/publication/44260003_improving_the_accuracy_and_efficiency_of_the_k-means_clustering_algorithm https://www.researchgate.net/publication/44260003_improving_the_accuracy_and_efficiency_of_the_k-means_clustering_algorithm https://doi.org/10.1243%2f095440605x8298 https://doi.org/10.1243%2f095440605x8298 https://doi.org/10.1016/j.csda.2005.10.006 https://doi.org/10.1016/j.csda.2005.10.006 https://doi.org/10.1109/ijcnn.2009.5178666 https://doi.org/10.1109/ijcnn.2009.5178666 https://doi.org/10.14738/tmlai.24.328 https://doi.org/10.14738/tmlai.24.328 https://doi.org/10.1109/cicn.2015.157 https://doi.org/10.1109/cicn.2015.157 i. introduction ii. materials and methods iii. results and discussions iv. conlusion declarations author contribution funding statement conflict of interest additional information references [1] r. t. lo, b. he, and i ounis “automatically building a stopword list for an information retrieval system,” j. digit. inf. manag. vol. 3, no. 1, pp. 3–8, 2005. [2] j. kaur, “a systematic review on stopword removal algorithms,” int. j. future revolut. comp. sci. comm. eng. vol. 4, no. 4, pp. 207–210, 2018. [3] j. kaur and p.k. buttar, 2018. “stopwords removal and its algorithms based on different methods”. international journal of advanced research in computer science, vol. 9, no. 5, pp. 81–88, oct. 2018. [4] s. vijayarani, m. j. ilamathi, and m. nithya, “preprocessing techniques for text mining an overview,” int. j. of comp. sci. comm. net. vol. 5, no. 1, pp. 7–16, 2015. [5] l. dolamic and j. savoy, “when stopword lists make the difference,” j. am. soc. inf. sci. technol., vol. 61, no. 1, pp. 200–203, 2010. [6] f. zou, f. l. wang, x. deng, s. han, and l. s. wang, “automatic construction of chinese stop word list,” proc. 5th wseas int. conf. appl. comp. sci. 2006, pp. 1010–1015, 2006. [7] j. k. raulji, “stop-word removal algorithm and its implementation for sanskrit language,” int. j. comp. applica. vol. 150, no. 2, pp. 15–17, 2016. [8] r. m. duwairi, “arabic sentiment analysis using supervised classification,” int. conf. futur. internet things cloud, pp. 579–583, 2014. [9] r. m. rakholia and j. r. saini, “a rule-based approach to identify stop words for gujarati language,” proc. 5th int. conf. front. intell. comput. theory appl. adv. intell. syst. comput., p. 515, 2017. [10] m. c. kirana, n. p. perkasa, m. z. lubis, and m. fani, “visualisasi kualitas penyebaran informasi gempa bumi di indonesia menggunakan twitter,” journal of applied informatics and computing, vol. 3, no. 1, pp. 23–32, 2019. [11] a. p. wibawa, a. nafalski, j. tweedale, n. murray, and a. e. kadarisman, “hybrid machine translation for javanese speech levels,” proc. 5th int. conf. knowl. smart technol., pp. 64–69, 2013. [12] s. poedjosoedarmo, “javanese speech levels,” indonesia, vol. 6, no. 6, pp. 54–81, 1968. [13] a. p. wibawa, a. nafalski, a. e. kadarisman, and w. f. mahmudy, “indonesian-to-javanese machine translation,” int. j. innov. manag. tech., vol. 4, no. 4, pp. 451–454, 2013. [14] s. v. s. gunasekara and p. s. haddela, “context aware stopwords for sinhala text classification,” 2018 natl. inf. technol. conf., pp. 1–6, 2018. [15] t. m. kodinariya, “review on determining number of cluster in k-means clustering,” international journal of advance research in computer science and management studies, vol. 1, no. 6, pp. 90–95, 2013. [16] k. a. a. nazeer and m. p. sebastian, “improving the accuracy and efficiency of the k-means clustering algorithm,” proceedings of the world congress on engineering 2009, vol. i, pp. 1–5, 2009. [17] d. t. pham, s. s. dimov, and c. d. nguyen, “selection of k in k -means clustering,” proc. inst. mech. eng. part c: j. mech. eng. sci., vol. 219, no 1, may 2004, pp. 103–119, 2005. [18] f. leisch, “a toolbox for k -centroids cluster analysis,” computational statistics & data analysis, vol. 51, no 2, pp. 526–544, 2006. [19] n. grozavu, y. bennani, and m. lebbah, “from variable weighting to cluster characterization in topographic unsupervised learning,” proc. int. jt. conf. neural networks, pp. 1005–1010, 2009. [20] v. m. patro and m. r. patra, “augmenting weighted average with confusion matrix to enhance classification accuracy,” transactions on machine learning and artificial intelligence, vol. 2, no. 4, pp. 77–91, 2014. [21] a. mishra and s. vishwakarma, “analysis of tf-idf model and its variant for document retrieval,” int. conf. comput. intell. commun. networks anal., pp. 772–776, 2015. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 1, july 2020, pp. 19–27 eissn 2597-4637 https://doi.org/10.17977/um018v3i12020p1-27 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) human intestinal condition identification based-on blended spatial and morphological feature using artificial neural network classifier ummi athiyah a, 1, *, arif wirawan muhammad b, c, 2 , ahmad azhari d, 3 a department of data science, institut teknologi telkom purwokerto jl di pandjaitan 128 karangreja, banyumas 53147, indonesia b department of informatics, institut teknologi telkom purwokerto jl di pandjaitan 128 karangreja, banyumas 53147, indonesia c fakulti sains komputer dan teknologi maklumat, universiti tun hussein onn malaysia jl delta 6 parit raja, johor, 86400, malaysia d department of informatics, universitas ahmad dahlan jl ringroad selatan, tamanan, banguntapan, bantul, yogyakarta 55166, indonesia 1 ummi@ittelkom-pwt.ac.id *; 2 arif@ittelkom-pwt.ac.id; 3 ahmad.azhari@tif.uad.ac.id * corresponding author i. introduction colorectal cancer is a type of cancer that attacks the intestinal walls cell of humans. wisconsin reporting system (wrs), states that the type of colorectal cancer is the third highest cause of death after types of lung cancer and breast cancer, with a case of death of 9.5% of the total world population [1]. in indonesia, colorectal cancer itself is ranked as the third cause of death after breast cancer and cervical cancer. therefore, the abnormalities in the human intestinal wall need to be identified early to minimize the growth of cancerous cells that are more virulent, which can cause death. health screening, such as endoscopy, is a simple step to detect abnormal growths in human intestinal cells wall [2][3]. also, the process of early detection plays an essential role in health practitioners/gynecologists to determine the prognosis and type of treatment that patients must receive [4]. the right prognosis is accompanied by the right dosage of the drug to help speed the recovery of patients from colorectal cancer outbreaks. endoscopic screening technique is a common step carried out by the health experts/gynecologist to determine the condition of the human intestine by inserting a camera through the rectum to get an intestine picture [5]. the result of endoscopic screening is a digital image of the area around the intestinal wall [6]. during this time, the interpretation of endoscopic images is carried out manually or taken by the naked eye, thus requires quite a long time to interpret and produce results. this article info a b s t r a c t article history: received 15 june 2020 revised 23 june 2020 accepted 30 june 2020 published online 17 august 2020 colon cancer is a type of disease that attacks the intestinal walls cell of humans. colorectal endoscopic screening technique is a common step carried out by the health expert/gynecologist to determine the condition of the human intestine. manual interpretation requires quite a long time to reach a result. along with the development of increasingly advanced digital computing techniques, then some of the weaknesses of the manually endoscopic image interpretation analysis model can be corrected by automating the detection process of the presence or absence of cancerous cells in the gut. identification of human intestinal conditions using an artificial neural network method with the blended input feature produces a higher accuracy value compared to the artificial neural network with the non-blended input feature. the difference in classifier performance produced between the two is quite significant, that is equal to 0.065 (6.5%) for accuracy; 0.074 (7.4%) for recall; 0.05 (5.0%) for precision; and 0.063 (6.3%) for f-measure. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: intestinal condition blended spatial morphological feature neural-networks classifier mailto:ummi@ittelkom-pwt.ac.id mailto:arif@ittelkom-pwt.ac.id mailto:ahmad.azhari@tif.uad.ac.id https://creativecommons.org/licenses/by-sa/4.0/ 20 u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 manual interpretation is the cause of the long duration of time needed by the patient to find out the results of endoscopic screening [7]. this image interpretation has a weakness; it depends on the gynecologist’s expertise and experience [8]. along with the development of increasingly advanced digital computing techniques, then some of the weaknesses of the manually endoscopic image interpretation analysis model mentioned previously can be corrected by automating the detection process of the presence or absence of cancer cells in the gut, by utilizing digital image processing techniques supported by the machine method learning. automating the detection process can speeding up in production results and minimizing errors arising from the manual analysis model of endoscopic image interpretation. ii. method the identification of the intestinal condition to find cancerous colon condition in this study was divided into several steps presented in fig. 1. the stages of fig. 1 are explained in the following subsection points: the first step, retrieving dataset is to get a colorectal endoscopic dataset. the colorectal endoscopy dataset consists of 200 image files generated from the colorectal endoscopic screening process with a .png format with vga resolution (640 × 480 pixels). the images in the dataset are divided into two categories: (1) endoscopic images for normal intestinal wall conditions and (2) endoscopic images for abnormal intestinal wall conditions that are the origin of cancerous conditions. examples images of normal intestinal wall conditions can be seen in fig. 2. fig. 1. colon cancer identification approach fig. 2. endoscopic imagery of normal intestinal wall conditions u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 21 while examples of endoscopic images for abnormal intestinal wall conditions can be seen in fig. 3. the overall endoscopic dataset was obtained from the internal medicine laboratory dr. sardjito hospital, yogyakarta-indonesia, under supervision of dr. putut bayupurnama sp.pd. the preprocessing aims out to ensure that the original image is ready for further processing at the feature extraction stage. preprocessing also plays a vital role in avoiding bias in the output of a machine learning classifier. preprocessing in this study is divided into several stages: 1. convert image rgb to grayscale at this stage, the conversion of rgb channel images into grayscale channels is carried out by using the formula presented in (1) by taking the red color conformity of 30%, green by 59%, and blue by 11% [2]. the rgb conversion extracts features of spatial images in the form of circularity, aspect ratio, triangularity, and cooccurence matrix values. (1) where r indicates the value of the red channel, g the green channel, and b the blue channel, an example of the results of converting an rgb image of endoscopic images under normal conditions to grayscale is presented in fig. 4. 2. image resize the image size reduction speeds up image processing and reduces the computational burden by change the pixel size of the original image from 640 × 480 pixels to 320 × 240 with the .png format [9]. 3. histogram normalization at this stage, the image histogram normalization is carried out which aims to equalize the brightness and contrast patterns that are owned by the image and the distribution of pixel intensities [10]. histogram normalization is carried out using the formula presented in (2). fig. 3. endoscopic imagery of abnormal intestinal wall conditions fig. 4. converted image from rgb to grayscale 22 u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 equation (2) ensures that the interval of gray image values [0-255] is mapped in the range of values 0 until 1 only [11]. ∑ (2) where inew is the value of pixel normalization mapping from a range [0-255] to range [0-1]; ni is the number of pixels in an image i(x,y) with gray level degrees of (i); while p(i) represents the pixel probability in gray level degrees of (i). feature extraction from endoscopic images is carried out to retrieve relevant information from each image so it can be used as input to the machine learning classifier. in this study, relevant information extracted from endoscopic imagery includes morphological information and spatial information. morphological information, taken based on the size of circularity that is the result of the division between the pixel area value and the pixel perimeter on the region of interest (roi). the circularity formula is presented in (3) [6]. (3) whereas spatial feature information, extracted based on the co-occurrence matrix, which will produce information includes: 1. energy energy is a measure of pixel conformity in an image. energy reflects the degree of texture smoothness of an image. the lower the energy value, the rougher the surface texture of the image and vice versa [12]. the calculation of the energy value is presented in (4) ∑ (4) 2. contrast contrast value is the simple comparison between foreground objects and image background. contrast is a unit of local image variation values [12][13]. the calculation of the contrast value is presented in (5) ∑ (5) 3. correlation correlation is a gray-level linearity value of two or more adjacent pixels in an image [14]. the calculation of the correlation value is presented in (6) ∑ ( ) (6) 4. homogeneity it is a value of the distance between elements in the co-occurrence matrix in gray images [11]. the calculation of the homogeneity value is presented in (7) ∑ (7) where p(i,j) is the elements of the co-occurrence matrix; μi and μj express the mean value and σi, and σj reflect the standard deviation in row i and column j in. the machine learning model built in this study is an artificial neural network that utilizes the backpropagation function. the architecture of artificial neural networks is presented in table 1. at the input layer of the artificial neural network architecture, 4 neurons and 5 neurons are used following the number of feature extractions from the colonic endoscopy image. while the hidden layer / intermediate layer uses 9 and 11 neurons following the equation stated by [12][15] that the use of a hidden layer of 2n+1 (where n is the number of input neurons) can accelerate the training process and the generalization results of neural networks. at the output layer, the binary-shot coding concept is used where 1 represents the condition of the cancerous image, while 0 represents the normal image condition [16]. u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 23 validation of the classification results is accuracy, recall, precision, and f-measure. the three parameters are obtained from true positive (tp), true negative (tn), false positive (fp), and falsenegative (fn) metrics [17]. 1. tp is a condition when the input “a” identified by machine learning matches the ground truth “a”. 2. tn is a condition when the “non-a” input identified by machine learning matches the “nona” groundtruth. 3. fp is a condition when input “a” is identified by machine learning as a “non-a” groundtruth. 4. fn is a condition when the input “non-a” is identified by machine learning as the “a” groundtruth. from the indicators mentioned above, an equation can be formed, stating the accuracy presented in (8) until (11) [18]. (8) (9) (10) (11) iii. results and discussions the matlab 2015 r programming platform that runs on the windows 10 64bit operating system was used as the experimental base for this research. endoscopic images used in this study have a total of 200 images with two categories: 100 endoscopic images of cancer category and 100 endoscopic images of the normal category. the dataset is divided into three parts to avoid bias on the results of artificial neural network training. the first part is the training dataset (70%), the second part is the testing dataset (15%), and the third part is the validation dataset (15%). the default function matlab 2015r (dividerand) is used as a dataset divider. feature extraction produces two kinds of features: morphological and spatial features. morphological features produce information on circularity values. on the other hand, spatial features product information on energy values, contrast, correlation, and homogeneity. the values of spatial and morphological features are used as input from the artificial neural network classifier. some morphological and spatial extraction values are presented in table 2. although there are many algorithm choices available in the artificial neural network training process [19][20], this research uses quasi-newtonian (matlab: trainlm) algorithm because the quasinewtonian algorithm can produce an optimal artificial neural network learning process and faster to achieve generalization of output values compared to training algorithms such as scaled-conjugate or resilient-propagation [21]. the parameters of the artificial neural network (ann) training process are presented in table 3. training and testing are carried out in the matlab r2016a environment that runs on an operating system platform on windows 10 (64-bit) with an intel® core i5® 4310 processor computer; 8 gb memory; intel hd vga card. for simplification purpose, there will be presented only the results of table 1. the architecture of artificial neural network point layer information number of neuron activation function a layer input 4 neurons a and 5 neurons b b layer intermediate (hidden) 9 neurons a and 11 neurons b (using 2n+1 equation, whereas “n” is the number of input feature) logsig c layer output 2 neurons (using binary-shot coding) purelin a non-blended feature (spatial feature only) b blended feature (spatial & morphological feature) 24 u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 the artificial neural network training process with 5-(11)-2 architecture with blended input feature (spatial and morphological) are presented in fig. 5 from fig. 6 it can be seen that the training process does not experience overfitting conditions. the absence of overfitting is indicated by the blue line = train; green = validation; red = test that decreases simultaneously and does not intersect each other. a summary of the metrics for the results of artificial neural network training is presented in table 4. a summary of the confusion matrix of the artificial neural network classifier for blended (spatial and morphological) input features is presented in table 5. the summary of the confusion matrix of the artificial neural network classifier for the non-blended (spatial only) input feature is presented in table 6. the artificial neural network training process with a blended (spatial and morphological) input feature produces a regression value of 0.97232 presented in fig. 6. otherwise, the artificial neural network training process with a non-blended (spatial only) input feature produces 0.89151 regression value. table 2. spatial and morphological feature extraction result image condition energy contrast correlation homogeneity circularity normal image-1 0.18948 0.96020 0.20885 0.93779 0.95982 normal image-2 0.18523 0.96077 0.17188 0.93618 0.93752 normal image-3 0.19485 0.97017 0.21369 0.94112 0.85313 normal image-4 0.20669 0.95677 0.17129 0.93349 0.77344 normal image-5 0.16762 0.96966 0.17314 0.94374 0.90534 polyp image-1 0.29893 0.94862 0.15432 0.91132 0.73008 polyp image-2 0.20979 0.96249 0.21411 0.94192 0.91683 polyp image-3 0.26171 0.96698 0.15399 0.92710 0.79852 polyp image-4 0.24069 0.95111 0.18719 0.92481 0.66581 polyp image-5 0.29893 0.94862 0.15432 0.91132 0.82152 table 3. ann training set parameters no parameter value 1. epoch 25,000 2. performance function mse (mean squared error) 3. goal 0.01 4. max. fail 6 (matlab default) 5. min. gradient 1.00e -07 6. µ 1.00e 10 fig. 5. artificial neural network training result for blended feature u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 25 based on table 5 and table 6, it can be evaluated the performance of artificial neural network classifiers with blended and non-blended input features with indicators of accuracy, recall, precision, and f-measure. the values of the performance indicators artificial neural network classifiers with blended input feature, derived from table 5 and table 6 are presented in table 7. for ease of use, a comparison from table 7 also presented on fig. 7. table 4. summary of artificial neural network training no parameter original set after training 1. epoch 25,000 14,218 2. performance function mse (mean squared error) mse 3. goal 0.01 0.00969 4. max. fail 6 (matlab default) 2 5. min. gradient 1.00e-07 1.00e -07 6. µ 1.00e10 1.00e 10 table 5. confusion matrix for blended (spatial and morphological) input feature image condition detected as total cancer normal cancer 94 (tp) 6 (fp) 100 normal 8 (fn) 92 (tn) 100 total ∑ = 200 table 6. confusion matrix for non-blended (spatial only) input feature image condition detected as total cancer normal cancer 89 (tp) 11 (fp) 100 normal 16 (fn) 84 (tn) 100 total ∑ = 200 fig. 6. artificial neural network training regression result (blended input feature) 26 u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 iv. conclusion identification of human intestinal conditions using an artificial neural network method with the blended input feature produces a higher accuracy value compared to the artificial neural network with the non-blended input feature. the difference in classifier performance produced between the two is quite significant, that is equal to 0.065 (6.5%) for accuracy; 0.074 (7.4%) for recall; 0.05 (5.0%) for precision; and 0.063 (6.3%) for f-measure. so it can be concluded that the use of blended features as neural network inputs sufficiently influences the results of identification of the condition of the human intestine. acknowledgment many thanks to dr. putut bayupurnama sp.pd from internal medicine laboratory of dr. sardjito hospital, yogyakarta-indonesia, were pleased to provide colorectal endoscopic image data, which is done as the basis of this research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. table 7. artificial neural network classifier performace indicator no perf. indicator input feature gap blended non-blended 1. accuracy 0.930 (93.0%) 0.865 (86.5%) 0.065 (6.5%) 2. recall 0.921 (92.1%) 0.847 (84.7%) 0.074 (7.4%) 3. precision 0.940 (94.0%) 0.890 (89.0%) 0.050 (5.0%) 4. f-measure 0.930 (93.0%) 0.867 (86.7%) 0.063 (6.3%) fig. 7. comparison graph u. athiyah et al. / knowledge engineering and data science 2020, 3 (1): 19–27 27 references [1] n. sengar, n. mishra, m. k. dutta, j. prinosil, and r. burget, “grading of colorectal cancer using histology images,” 2016 39th int. conf. telecommun. signal process. tsp 2016, pp. 529–532, 2016, doi: 10.1109/tsp.2016.7760936. [2] a. ratheesh, p. soman, m. revathy nair, r. g. devika, and r. p. aneesh, “advanced algorithm for polyp detection using depth segmentation in colon endoscopy,” 2016 int. conf. commun. syst. networks, comnet 2016, no. july, pp. 179–183, 2017, doi: 10.1109/csn.2016.7824010. [3] u. athiyah, i. muhimmah, and e. marfianti, “ekstraksi ciri polip dan pendarahan berdasarkan citra endoskopi kolorektal,” j. inform. j. pengemb. it, vol. 3, no. 1, pp. 81–85, 2018. [4] y. shin, h. a. qadir, and i. balasingham, “abnormal colon polyp image synthesis using conditional adversarial networks for improved detection performance,” ieee access, vol. 6, pp. 56007–56017, 2018, doi: 10.1109/access.2018.2872717. [5] x. wei, j. xie, w. he, m. min, z. ma, and j. guo, “quantitative comparisons of linked color imaging and whitelight colonoscopy for colorectal polyp analysis,” proc. 2018 6th ieee int. conf. netw. infrastruct. digit. content, ic-nidc 2018, pp. 140–144, 2018, doi: 10.1109/icnidc.2018.8525753. [6] g. tarik, a. khalid, k. jamal, and d. a. benajah, “polyps’s region of interest detection in colonoscopy images by using clustering segmentation and region growing,” colloq. inf. sci. technol. cist, pp. 455–459, 2017, doi: 10.1109/cist.2016.7805090. [7] o. bardhi, d. sierra-sosa, b. garcia-zapirain, and a. elmaghraby, “automatic colon polyp detection using convolutional encoder-decoder model,” 2017 ieee int. symp. signal process. inf. technol. isspit 2017, pp. 445– 448, 2018, doi: 10.1109/isspit.2017.8388684. [8] q. li et al., “colorectal polyp segmentation using a fully convolutional neural network,” proc. 2017 10th int. congr. image signal process. biomed. eng. informatics, cisp-bmei 2017, vol. 2018-janua, pp. 1–5, 2018, doi: 10.1109/cisp-bmei.2017.8301980. [9] i. o. petre and c. buiu, “a colon cancer microarray analysis technique,” 2017 e-health bioeng. conf. ehb 2017, pp. 265–268, 2017, doi: 10.1109/ehb.2017.7995412. [10] n. tajbakhsh, s. r. gurudu, and j. liang, “automated polyp detection in colonoscopy videos using shape and context information,” ieee trans. med. imaging, vol. 35, no. 2, pp. 630–644, 2016, doi: 10.1109/tmi.2015.2487997. [11] y. hu et al., “texture feature extraction and analysis for polyp differentiation via computed tomography colonography,” ieee trans. med. imaging, vol. 35, no. 6, pp. 1522–1531, 2016, doi: 10.1109/tmi.2016.2518958. [12] a. w. muhammad, g. w. sasmito, and i. riadi, “colorectal polyp detection using feedforward neural network with image feature selection,” proceeding 2018 int. symp. adv. intell. informatics revolutionize intell. informatics spectr. humanit. sain 2018, pp. 26–31, 2019, doi: 10.1109/sain.2018.8673371. [13] s. dutta, p. sasmal, m. k. bhuyan, and y. iwahori, “automatic segmentation of polyps in endoscopic image using level-set formulation,” 2018 int. conf. wirel. commun. signal process. networking, wispnet 2018, pp. 1–5, 2018, doi: 10.1109/wispnet.2018.8538615. [14] j. qu, n. hiruta, k. terai, h. nosato, m. murakawa, and h. sakanashi, “gastric pathology image classification using stepwise fine-tuning for deep neural networks,” j. healthc. eng., vol. 2018, 2018, doi: 10.1155/2018/8961781. [15] a. k. palit and d. popovic, computational intelligence in time series forecasting, advances in industrial control, 2005. [16] y. h. hu and j.-n. hwang, “handbook of neural network signal processing.” crc press, london, united kingdom, 2002. [17] l. c. jain, “recent advances in artificial neural networks,” recent adv. artif. neural networks, 2018, doi: 10.1201/9781351076210. [18] a. geron, hands-on machine learing with scikit-learn & tensor flow. o’reilly media, 2017. [19] i. riadi, sunardi, and a. w. muhammad, “ddos detection using artificial neural network regarding variation of training function,” adv. sci. lett., vol. 24, no. 12, pp. 9163–9167, 2018, doi: 10.1166/asl.2018.12117. [20] i. riadi, a. wirawan, and s. -, “network packet classification using neural network based on training function and hidden layer neuron number variation,” int. j. adv. comput. sci. appl., vol. 8, no. 6, pp. 248–252, 2017, doi: 10.14569/ijacsa.2017.080631. [21] a. azhari, a. w. muhammad, and c. f. m. foozy, “machine learning-based distributed denial of service attack detection on intrusion detection system regarding to feature selection,” international journal of artificial intelligence research, vol. 4, no. 1, feb. 2020. https://doi.org/10.1109/tsp.2016.7760936 https://doi.org/10.1109/tsp.2016.7760936 https://doi.org/10.1109/csn.2016.7824010 https://doi.org/10.1109/csn.2016.7824010 https://doi.org/10.1109/csn.2016.7824010 https://ejournal.poltektegal.ac.id/index.php/informatika/article/view/704 https://ejournal.poltektegal.ac.id/index.php/informatika/article/view/704 https://doi.org/10.1109/access.2018.2872717 https://doi.org/10.1109/access.2018.2872717 https://doi.org/10.1109/access.2018.2872717 https://doi.org/10.1109/icnidc.2018.8525753 https://doi.org/10.1109/icnidc.2018.8525753 https://doi.org/10.1109/icnidc.2018.8525753 https://doi.org/10.1109/cist.2016.7805090 https://doi.org/10.1109/cist.2016.7805090 https://doi.org/10.1109/cist.2016.7805090 https://doi.org/10.1109/isspit.2017.8388684 https://doi.org/10.1109/isspit.2017.8388684 https://doi.org/10.1109/isspit.2017.8388684 https://doi.org/10.1109/cisp-bmei.2017.8301980 https://doi.org/10.1109/cisp-bmei.2017.8301980 https://doi.org/10.1109/cisp-bmei.2017.8301980 https://doi.org/10.1109/ehb.2017.7995412 https://doi.org/10.1109/ehb.2017.7995412 https://doi.org/10.1109/tmi.2015.2487997 https://doi.org/10.1109/tmi.2015.2487997 https://doi.org/10.1109/tmi.2016.2518958 https://doi.org/10.1109/tmi.2016.2518958 https://doi.org/10.1109/sain.2018.8673371 https://doi.org/10.1109/sain.2018.8673371 https://doi.org/10.1109/sain.2018.8673371 https://doi.org/10.1109/wispnet.2018.8538615 https://doi.org/10.1109/wispnet.2018.8538615 https://doi.org/10.1109/wispnet.2018.8538615 https://doi.org/10.5220/0007356100920099 https://doi.org/10.5220/0007356100920099 https://doi.org/10.5220/0007356100920099 https://doi.org/10.1007/1-84628-184-9 https://doi.org/10.1007/1-84628-184-9 https://lrn.no-ip.info/other/books/neural/3739a1b59b490b5d2c8e61626aacd81c.pdf https://lrn.no-ip.info/other/books/neural/3739a1b59b490b5d2c8e61626aacd81c.pdf https://doi.org/10.1201/9781351076210 https://doi.org/10.1201/9781351076210 https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ https://doi.org/10.1166/asl.2018.12117 https://doi.org/10.1166/asl.2018.12117 https://doi.org/10.14569/ijacsa.2017.080631 https://doi.org/10.14569/ijacsa.2017.080631 https://doi.org/10.14569/ijacsa.2017.080631 http://dx.doi.org/10.29099/ijair.v4i1.156 http://dx.doi.org/10.29099/ijair.v4i1.156 http://dx.doi.org/10.29099/ijair.v4i1.156 i. introduction ii. method iii. results and discussions iv. conclusion acknowledgment declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 138–144 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p138-144 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology cnn based face recognition system for patients with down and william syndrome endang setyati a, 1, *, suharyono az a, 2 , subroto prasetya hudiono b, 3 , fachrul kurniawan c, 4 a pascasarjana teknologi informasi, institut sains dan teknologi terpadu surabaya, jl. ngagel jaya tengah 73-77, surabaya, indonesia b school of media science, tokyo university of technology 192-0914 tokyo, hachioji, katakuramachi, 1 4 0 4−1, japan c teknik informatika, uin maulana malik ibrahim jl. gajayana no. 50 malang 65144, indonesia 1 endang@stts.edu*; 2 suharyono@gmail.com; 3 subrotoph@gmail.com; 4 fachrulk@ti.uin-malang.ac.id * corresponding author i. introduction down syndrome, also known as trisomy genetic condition, is a genetic disorder that affects many people. an additional chromosome 21 causes trisomy genetic disease [1]. the extra chromosome increases the number of particular proteins, interfering with the body's natural growth. it may also result in predetermined changes in brain development. these abnormalities can also lead to developmental delays, learning impairments, heart problems, and blood cancer. race, country, religion, or socioeconomic level have no bearing on this illness [2][3]. williams syndrome is a hereditary disorder that can affect anyone at birth. it characterizes by medical and cognitive issues, such as cardiovascular illness, developmental delays, and learning impairments [4]. this is accompanied by exceptional verbal abilities, a gregarious attitude, and a passion for music. williams syndrome is a neurogenetic condition that has been extensively researched. williams syndrome, also known as williams-beuren syndrome or infantile hypercalcemia, is a multisystem neurodevelopmental condition that causes intellectual impairment article info a b s t r a c t article history: submitted 16 november 2021 revised 19 december 2021 accepted 27 december 2021 published online 31 december 2021 down syndrome, also known as trisomy genetic condition, is a genetic disorder that affects many people. williams syndrome is a hereditary disorder that can affect anyone at birth. it marks medical and cognitive issues, such as cardiovascular illness, developmental delays, and learning impairments. this is accompanied by exceptional verbal abilities, a gregarious attitude, and a passion for music. down syndrome and william syndrome are both genetic illnesses. however, it can be distinguished from the arrangement of chromosome 21. down syndrome and william syndrome can also be identified by recognizing faces, or facial characteristics, such as observing particular facial features. therefore, this research develops convolutional neural network (cnn) architectures to recognize down syndrome and william syndrome using a facial recognition approach. a total of 480 facial photos were used in the study, with 390 images used for training data and 90 images used for testing data. the identification class is divided into three categories, down syndrome, william syndrome, and normal. there are 160 photos in each patient class. this research presents two cnn architectures using a grayscale image of 256×256 pixels. the first cnn architecture comprises 12 layers, while the second comprises 15 layers. the average accuracy results with 12 layers were 91% by attempting to train and test six times. with 15 layers, the average accuracy value is 89%. in comparison, the first architecture has the highest accuracy value. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: convolutional neural network down syndrome face recognition william syndrome http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p138-144 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ e. setyati et al. / knowledge engineering and data science 2021, 4 (2): 138–144 139 [5][6][7]. williams syndrome is a rare genetic illness that manifests in various symptoms and learning difficulties [8]. the heart, blood vessels, kidneys, and other body organs can all be affected in children with williams syndrome. in addition, people with williams syndrome have distinct traits on their faces, such as a different nose, mouth, and facial features [9][10][11][12]. moreover, the children with williams syndrome have a flat nasal bridge, short nose with a large tip, large mouth with full lips, small chin, small and spaced teeth, missing or crooked teeth, uneven eye, creases covering the corners of the eyes, and a white starburst pattern surrounds the iris or colored area of the eye [13][14]. moreover, down syndrome has physical characteristics such as a smaller head shape than normal people with a flat area at the nape of the neck, a bigger fontanelle that closes slower (average age two years), a slanted eye shape with a corner of the eye, and the centerfolds into the shape of a small mouth with a long tongue, giving it a projecting appearance [15][16]. physical and mental characteristics of people with down syndrome include: face characteristics that are flat, head and ears are small, short neckline, a huge tongue, upward-slanting eyes, muscle tone is poor, and the person is short [17]. furthermore, the mental characteristics of down syndrome include short attention span, impulsive behavior, sluggish learning, and delayed language and speech development [18]. cognitive problems, such as intellectual and developmental delays, learning disabilities, and speech disorders, are unique to down syndrome. down syndrome impairs the hippocampus, which is essential for memory and learning [19]. people with down syndrome are more likely to have the following health problems: thyroid disease, leukemia, obesity, chronic constipation, sleep apnea, poor vision, cataracts, strabismus, anemia, congenital heart defects, and hearing loss [20][21]. down syndrome and williams syndrome can be diagnosed by detecting the face, or facial characteristics, such as picking up particular facial traits [22][23]. downs syndrome and william syndrome are both genetic illnesses. however, it can be precisely detected with the arrangement of chromosome 21. there is an additional chromosome 21 in down syndrome. williams syndrome can manifest in many body parts, including the face, heart, and other organs. it can also have an impact on a child's learning abilities [24]. facial landmarks can be used to identify people with down syndrome automatically using a non-standard snapshot of a patient's face. facial landmarks are used for a local model, and then geometric feature extraction is performed based on anatomy landmarks and texture features from binary patterns. after feature selection, multiple classifiers are utilized to distinguish between down syndrome and normal cases. the accuracy of the svm classification with the rbf kernel employing texture characteristics was 94.6% [20]. other studies have found that facial characteristics can identify people with down syndrome. down syndrome affects one out of every 1000 newborns born across the world. in the recent decade, the number of people diagnosed with down syndrome has increased. it has been noted that people with down syndrome exhibit various face traits. a face feature-based approach for detecting patients with down's syndrome. the deep convolutional neural network extracts and merges deep representations of different face areas. the random forest-based pipeline was then used to classify the merged representations. this model was evaluated on a dataset of over 800 people with down syndrome and recognized 98.47% of them [19]. previous studies have shown that face characteristics and landmark features can distinguish down syndrome and normal. therefore, this research will design a system that can recognize down syndrome and william syndrome using a facial recognition approach and a deep learning algorithm convolutional neural network. previous research has already performed similar cases. however, this study distinguishes the syndrome into three categories: down syndrome, william syndrome, and normal. in addition, the cnn network design used in this study differs from earlier studies. ii. method in this study, three categories of data were used: down syndrome, william syndrome, and normal. input data or datasets, as well as test data, are retrieved via the website. the down syndrome, william syndrome, and normal classifications were created due to this research. the training and testing data storage for convolutional neural network (cnn) technique is illustrated in figure 1. 140 e. setyati et al. / knowledge engineering and data science 2021, 4 (2): 138–144 this system will detect down syndrome, william syndrome, and normal using the cnn technique. in figure 2, the stages of cnn's identification of these three classifications are as follows: 1. the rgb picture dataset is processed, transformed to grayscale, and size equalized, with a 256×256 input size. 2. following processing, each class is organized into a folder, subsequently collected into a train folder, as shown in figure 1. 3. cnn training is responsible for all photos in each folder and the train folder. several distinct architectures will be used during the training phase. figure 2 shows the cnn architectural model. 4. the fourth step is testing the data. the architectures of the cnn network design, as indicated in figure 2, will be as follows: convolutional layer, activation layer, max pooling layer, and fully connected layer. in terms of functionality, a convolutional neural network is similar to a neural network (artificial neural network) [25]. weights, biases, and activation functions are examples of neurons in cnn. a cnn has a convolution layer, a pooling layer, an activation layer, and a fully linked layer. figures 2 and figure 3 present the cnn network architecture. the following process is the training and testing process. this research employs two alternative architectures. figure 3 shows the first architecture, while figure 4 shows the second architecture. the system to be constructed is a cnn algorithm-based identification system for classified down's syndrome, william's syndrome, and normal. the total number of images used in this investigation was 480, with a grayscale resolution of 256×256. table 1 shows the complete data distribution and is separated into two categories: training and testing. fig. 2. cnn model for syndrome identification fig. 1. image data storage for cnn algorithm e. setyati et al. / knowledge engineering and data science 2021, 4 (2): 138–144 141 table 2 shows the image dataset that comprises three sorts of photographs: down syndrome, william syndrome, and normal facial images. a grayscale image is used as the training and testing input. all photos are equal in size to 256×256 pixels. the images are initially processed by being converted to grayscale and equalizing their sizes. fig. 3. cnn first architecture fig. 4. cnn second architecture 142 e. setyati et al. / knowledge engineering and data science 2021, 4 (2): 138–144 iii. results and discussions an rgb image is used, transformed to grayscale, and equalized to 256×256 pixels. two alternative architectures are used in the cnn training and testing procedure. figure 3 illustrates the first architecture, and figure 4 shows the second architecture. the first architecture uses 20 epochs (figure 5) for training and 50 epochs for testing (figure 6). the same epochs are also applied in the second architecture. the first layer in figure 4 is the input layer, and the input image is [70 80 1] and grayscale image with a resolution of 256×256 pixels. each convolution layer has various nodes or filters. for example, the convolution layer in figure 4 has 32 nodes with a 3×3 filter size. figure 4 further indicates that the size of the pooling layer, for example, differs from other pooling layers in terms of nodes, yet both have a 2×2 filter size. the pooling layer's results will change the image's size from 256×256 to 35×40. cnn was tested six times for training and testing and for identifying test data. table 3 shows the test results, which show the average accuracy of the two cnn architectural models. in general, it shows that the 12 layer’s accuracy outperforms the networks with 15 layers. table 1. dataset distribution no classification training testing 1 down syndrome 130 30 2 william syndrome 130 30 3 normal 130 30 total 390 90 table 2. example of image dataset no down syndrome william syndrome normal 1 2 3 table 3. cnn accuracy 12 layer (%) 15 layer (%) 0.94 0.88 0.92 0.91 0.86 0.89 0.91 0.89 0.94 0.88 0.92 0.91 e. setyati et al. / knowledge engineering and data science 2021, 4 (2): 138–144 143 iv. conclusion two cnn architectures are proposed in this study. with an average accuracy of 91%, the first cnn architecture was trained and evaluated six times. while, the accuracy of the second architecture, which includes 15 layers, is 89% percent on average. the first cnn architecture has the highest accuracy to provide effective identification. the future research will identify down syndrome, william syndrome, and normal using cnn transfer learning. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] p. n. alexandrov, m. e. percy, and w. j. lukiw, “chromosome 21-encoded micrornas (mrnas): impact on down’s syndrome and trisomy-21 linked disease,” cell. mol. neurobiol., vol. 38, no. 3, pp. 769–774, jul. 2017. [2] v. dima, a. ignat, and c. rusu, “identifying down syndrome cases by combined use of face recognition methods,” in advances in intelligent systems and computing, 2018, pp. 472–482. [3] j. y. r. cornejo, h. pedrini, a. machado-lima, and f. de l. dos s. nunes, “down syndrome detection based on facial features using a geometric descriptor,” j. med. imaging, vol. 4, no. 04, p. 1, dec. 2017. [4] r. p. thom, b. r. pober, and c. j. mcdougle, “psychopharmacology of williams syndrome: safety, tolerability, and effectiveness,” expert opin. drug saf., vol. 20, no. 3, pp. 293–306, mar. 2021. fig. 5. 20 epoch for training the first architecture fig. 6. 50 epoch for testing the first architecture http://journal2.um.ac.id/index.php/keds https://doi.org/10.1007/s10571-017-0514-0 https://doi.org/10.1007/s10571-017-0514-0 https://doi.org/10.1007/978-3-319-62524-9_35 https://doi.org/10.1007/978-3-319-62524-9_35 https://doi.org/10.1117/1.jmi.4.4.044008 https://doi.org/10.1117/1.jmi.4.4.044008 https://doi.org/10.1080/14740338.2021.1867535 https://doi.org/10.1080/14740338.2021.1867535 144 e. setyati et al. / knowledge engineering and data science 2021, 4 (2): 138–144 [5] m. lugo et al., “social, neurodevelopmental, endocrine, and head size differences associated with atypical deletions in williams–beuren syndrome,” am. j. med. genet. part a, vol. 182, no. 5, pp. 1008–1020, may 2020. [6] s. kaya, k. orhan, and f. tulga öz, “williams-beuren syndrome: a case report,” cumhur. dent. j., pp. 481–485, dec. 2019. [7] c. g. del cole, s. c. caetano, w. ribeiro, a. m. e. e. kümmer, and a. p. jackowski, “adolescent adaptive behavior profiles in williams–beuren syndrome, down syndrome, and autism spectrum disorder,” child adolesc. psychiatry ment. health, vol. 11, no. 1, p. 40, dec. 2017. [8] h. liu et al., “automatic facial recognition of williams-beuren syndrome based on deep convolutional neural networks,” front. pediatr., vol. 9, no. may, pp. 1–7, 2021. [9] d. dimitriou, h. c. leonard, a. karmiloff-smith, m. h. johnson, and m. s. c. thomas, “atypical development of configural face recognition in children with autism, down syndrome and williams syndrome,” j. intellect. disabil. res., vol. 59, no. 5, pp. 422–438, may 2015. [10] a. santos, c. silva, d. rosset, and c. deruelle, “just another face in the crowd: evidence for decreased detection of angry faces in children with williams syndrome,” neuropsychologia, vol. 48, no. 4, pp. 1071–1078, mar. 2010. [11] n. shalev, a. steele, a. c. nobre, a. karmiloff-smith, k. cornish, and g. scerif, “dynamic sustained attention markers differentiate atypical development: the case of williams syndrome and down’s syndrome,” neuropsychologia, vol. 132, p. 107148, sep. 2019. [12] c. ji, d. yao, m. li, w. chen, s. lin, and z. zhao, “a study on facial features of children with williams syndrome in china based on three‐dimensional anthropometric measurement technology,” am. j. med. genet. part a, vol. 182, no. 9, pp. 2102–2109, sep. 2020. [13] vincy devi v.k and rajesh r, “a study on down syndrome detection based on artificial neural network in ultra sonogram images,” in 2016 international conference on data mining and advanced computing (sapience), mar. 2016, pp. 204–209. [14] p. shukla, t. gupta, a. saini, p. singh, and r. balasubramanian, “a deep learning frame-work for recognizing developmental disorders,” in 2017 ieee winter conference on applications of computer vision (wacv), mar. 2017, pp. 705–714. [15] ş. saraydemir, n. taşpınar, o. eroğul, h. kayserili, and n. dinçkan, “down syndrome diagnosis based on gabor wavelet transform,” j. med. syst., vol. 36, no. 5, pp. 3205–3213, oct. 2012. [16] r. al-shawaf and w. al-faleh, “craniofacial characteristics in saudi down’s syndrome,” king saud univ. j. dent. sci., vol. 2, no. 1–2, pp. 17–22, jul. 2011. [17] s. o. wajuihian, “down syndrome: an overview,” african vis. eye heal., vol. 75, no. 1, mar. 2016. [18] j. d. santoro et al., “neurologic complications of down syndrome: a systematic review,” j. neurol., vol. 268, no. 12, pp. 4495–4509, dec. 2021. [19] a. mittal, h. gaur, and m. mishra, “detection of down syndrome using deep facial recognition,” in advances in intelligent systems and computing, 2020, pp. 119–130. [20] q. zhao, k. rosenbaum, r. sze, d. zand, m. summar, and m. g. linguraru, “down syndrome detection from facial photographs using machine learning techniques,” in medical imaging 2013: computer-aided diagnosis, feb. 2013, p. 867003. [21] q. zhao et al., “automated down syndrome detection using facial photographs,” in 2013 35th annual international conference of the ieee engineering in medicine and biology society (embc), jul. 2013, pp. 3670–3673. [22] s. m. tabatabaei and a. chalechale, “using dlbp texture descriptors and svm for down syndrome recognition,” in 2014 4th international conference on computer and knowledge engineering (iccke), oct. 2014, pp. 554–558. [23] w. song et al., “multiple facial image features-based recognition for the automatic diagnosis of turner syndrome,” comput. ind., vol. 100, pp. 85–95, sep. 2018. [24] j. grieco, m. pulsifer, k. seligsohn, b. skotko, and a. schwartz, “down syndrome: cognitive and behavioral functioning across the lifespan,” am. j. med. genet. part c semin. med. genet., vol. 169, no. 2, pp. 135–149, jun. 2015. [25] a. qayyum, s. m. anwar, m. awais, and m. majid, “medical image retrieval using deep convolutional neural network,” neurocomputing, vol. 266, pp. 8–20, nov. 2017. https://doi.org/10.1002/ajmg.a.61522 https://doi.org/10.1002/ajmg.a.61522 https://doi.org/10.7126/cumudj.642349 https://doi.org/10.7126/cumudj.642349 https://doi.org/10.1186/s13034-017-0177-0 https://doi.org/10.1186/s13034-017-0177-0 https://doi.org/10.1186/s13034-017-0177-0 https://doi.org/10.3389/fped.2021.648255 https://doi.org/10.3389/fped.2021.648255 https://doi.org/10.1111/jir.12141 https://doi.org/10.1111/jir.12141 https://doi.org/10.1111/jir.12141 https://doi.org/10.1016/j.neuropsychologia.2009.12.006 https://doi.org/10.1016/j.neuropsychologia.2009.12.006 https://doi.org/10.1016/j.neuropsychologia.2019.107148 https://doi.org/10.1016/j.neuropsychologia.2019.107148 https://doi.org/10.1016/j.neuropsychologia.2019.107148 https://doi.org/10.1002/ajmg.a.61750 https://doi.org/10.1002/ajmg.a.61750 https://doi.org/10.1002/ajmg.a.61750 https://doi.org/10.1109/sapience.2016.7684172 https://doi.org/10.1109/sapience.2016.7684172 https://doi.org/10.1109/sapience.2016.7684172 https://doi.org/10.1109/wacv.2017.84 https://doi.org/10.1109/wacv.2017.84 https://doi.org/10.1109/wacv.2017.84 https://doi.org/10.1007/s10916-011-9811-1 https://doi.org/10.1007/s10916-011-9811-1 https://doi.org/10.1016/j.ksujds.2010.12.001 https://doi.org/10.1016/j.ksujds.2010.12.001 https://doi.org/10.4102/aveh.v75i1.346 https://doi.org/10.1007/s00415-020-10179-w https://doi.org/10.1007/s00415-020-10179-w https://doi.org/10.1007/978-981-32-9088-4_11 https://doi.org/10.1007/978-981-32-9088-4_11 https://doi.org/10.1117/12.2007267 https://doi.org/10.1117/12.2007267 https://doi.org/10.1117/12.2007267 https://doi.org/10.1109/embc.2013.6610339 https://doi.org/10.1109/embc.2013.6610339 https://doi.org/10.1109/iccke.2014.6993392 https://doi.org/10.1109/iccke.2014.6993392 https://doi.org/10.1016/j.compind.2018.03.021 https://doi.org/10.1016/j.compind.2018.03.021 https://doi.org/10.1002/ajmg.c.31439 https://doi.org/10.1002/ajmg.c.31439 https://doi.org/10.1016/j.neucom.2017.05.025 https://doi.org/10.1016/j.neucom.2017.05.025 i. introduction ii. method iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] p. n. alexandrov, m. e. percy, and w. j. lukiw, “chromosome 21-encoded micrornas (mrnas): impact on down’s syndrome and trisomy-21 linked disease,” cell. mol. neurobiol., vol. 38, no. 3, pp. 769–774, jul. 2017. [2] v. dima, a. ignat, and c. rusu, “identifying down syndrome cases by combined use of face recognition methods,” in advances in intelligent systems and computing, 2018, pp. 472–482. [3] j. y. r. cornejo, h. pedrini, a. machado-lima, and f. de l. dos s. nunes, “down syndrome detection based on facial features using a geometric descriptor,” j. med. imaging, vol. 4, no. 04, p. 1, dec. 2017. [4] r. p. thom, b. r. pober, and c. j. mcdougle, “psychopharmacology of williams syndrome: safety, tolerability, and effectiveness,” expert opin. drug saf., vol. 20, no. 3, pp. 293–306, mar. 2021. [5] m. lugo et al., “social, neurodevelopmental, endocrine, and head size differences associated with atypical deletions in williams–beuren syndrome,” am. j. med. genet. part a, vol. 182, no. 5, pp. 1008–1020, may 2020. [6] s. kaya, k. orhan, and f. tulga öz, “williams-beuren syndrome: a case report,” cumhur. dent. j., pp. 481–485, dec. 2019. [7] c. g. del cole, s. c. caetano, w. ribeiro, a. m. e. e. kümmer, and a. p. jackowski, “adolescent adaptive behavior profiles in williams–beuren syndrome, down syndrome, and autism spectrum disorder,” child adolesc. psychiatry ment. health, vol. 11, ... [8] h. liu et al., “automatic facial recognition of williams-beuren syndrome based on deep convolutional neural networks,” front. pediatr., vol. 9, no. may, pp. 1–7, 2021. [9] d. dimitriou, h. c. leonard, a. karmiloff-smith, m. h. johnson, and m. s. c. thomas, “atypical development of configural face recognition in children with autism, down syndrome and williams syndrome,” j. intellect. disabil. res., vol. 59, no. 5, p... [10] a. santos, c. silva, d. rosset, and c. deruelle, “just another face in the crowd: evidence for decreased detection of angry faces in children with williams syndrome,” neuropsychologia, vol. 48, no. 4, pp. 1071–1078, mar. 2010. [11] n. shalev, a. steele, a. c. nobre, a. karmiloff-smith, k. cornish, and g. scerif, “dynamic sustained attention markers differentiate atypical development: the case of williams syndrome and down’s syndrome,” neuropsychologia, vol. 132, p. 107148, ... [12] c. ji, d. yao, m. li, w. chen, s. lin, and z. zhao, “a study on facial features of children with williams syndrome in china based on three‐dimensional anthropometric measurement technology,” am. j. med. genet. part a, vol. 182, no. 9, pp. 2102–21... [13] vincy devi v.k and rajesh r, “a study on down syndrome detection based on artificial neural network in ultra sonogram images,” in 2016 international conference on data mining and advanced computing (sapience), mar. 2016, pp. 204–209. [14] p. shukla, t. gupta, a. saini, p. singh, and r. balasubramanian, “a deep learning frame-work for recognizing developmental disorders,” in 2017 ieee winter conference on applications of computer vision (wacv), mar. 2017, pp. 705–714. [15] ş. saraydemir, n. taşpınar, o. eroğul, h. kayserili, and n. dinçkan, “down syndrome diagnosis based on gabor wavelet transform,” j. med. syst., vol. 36, no. 5, pp. 3205–3213, oct. 2012. [16] r. al-shawaf and w. al-faleh, “craniofacial characteristics in saudi down’s syndrome,” king saud univ. j. dent. sci., vol. 2, no. 1–2, pp. 17–22, jul. 2011. [17] s. o. wajuihian, “down syndrome: an overview,” african vis. eye heal., vol. 75, no. 1, mar. 2016. [18] j. d. santoro et al., “neurologic complications of down syndrome: a systematic review,” j. neurol., vol. 268, no. 12, pp. 4495–4509, dec. 2021. [19] a. mittal, h. gaur, and m. mishra, “detection of down syndrome using deep facial recognition,” in advances in intelligent systems and computing, 2020, pp. 119–130. [20] q. zhao, k. rosenbaum, r. sze, d. zand, m. summar, and m. g. linguraru, “down syndrome detection from facial photographs using machine learning techniques,” in medical imaging 2013: computer-aided diagnosis, feb. 2013, p. 867003. [21] q. zhao et al., “automated down syndrome detection using facial photographs,” in 2013 35th annual international conference of the ieee engineering in medicine and biology society (embc), jul. 2013, pp. 3670–3673. [22] s. m. tabatabaei and a. chalechale, “using dlbp texture descriptors and svm for down syndrome recognition,” in 2014 4th international conference on computer and knowledge engineering (iccke), oct. 2014, pp. 554–558. [23] w. song et al., “multiple facial image features-based recognition for the automatic diagnosis of turner syndrome,” comput. ind., vol. 100, pp. 85–95, sep. 2018. [24] j. grieco, m. pulsifer, k. seligsohn, b. skotko, and a. schwartz, “down syndrome: cognitive and behavioral functioning across the lifespan,” am. j. med. genet. part c semin. med. genet., vol. 169, no. 2, pp. 135–149, jun. 2015. [25] a. qayyum, s. m. anwar, m. awais, and m. majid, “medical image retrieval using deep convolutional neural network,” neurocomputing, vol. 266, pp. 8–20, nov. 2017. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 1, july 2021, pp. 14–28 eissn 2597-4637 https://doi.org/10.17977/um018v4i12021p14-28 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by indonesian ministry of research & technology backpropagation neural network with combination of activation functions for inbound traffic prediction purnawansyah a, 1, *, haviluddin b, 2, herdianti darwis a, 3, huzain azis a, 4, yulita salim a, 5 a department of informatics, universitas muslim indonesia jl. urip sumoharjo km 5, makassar, 90231 indonesia b department of informatics, mulawarman university jl. kuaro no.1, samarinda, 75123 indonesia 1 purnawansyah@umi.ac.id *; 2 haviluddin@gmail.com; 3 herdianti.darwis@umi.ac.id; 4 huzain.azis@umi.ac.id; 5 yulita.salim@umi.ac.id * corresponding author i. introduction numerous research has been conducted regarding traffic measurements, whether in terms of traffic patterns, volumes, applications, and user activity characteristics [1][2]. predicting network traffic is one of the crucial works to perform when it comes to network management as a consideration for admission and congestion control, anomaly detection, and bandwidth allocation to gain superior quality of service and cost reduction. traffic itself could be formed in two dimensions, what time people actively engage to the internet and how much the work capacity the users engage where the two matters could be presented in one series of traffic data. time series is data focusing on the recorded values for a given measurement at several points when the data is expressed as 𝑦1, 𝑦2, … , 𝑦𝑛 𝑦𝑖 denotes value measured at a time 𝑖 [3]. in dealing with time series forecasting, conventional statistical methods are popularly used, such as arima and its various models, decomposition, and winter’s exponential smoothing [4], hidden markov model [5], and threshold autoregressive (tar) [6]. besides, in machine learning, the neural network has been widely developed to deal with network data prediction [7][8][9][10]. furthermore, a hybrid method has been extensively conducted, particularly for time series forecasting, such as a hybrid of neural network and arima by [11][12], and a hybrid of hmm and multilayer perceptron [13]. backpropagation neural networks itself is one of multilayer perceptron algorithm that has been widely studied for forecasting and classifying various cases, including an analysis was successfully performed by [14] in modeling ten risky factors of traffic accidents to elderly female and male drivers in west midlands of the uk, and in the estimation of nuclear accident source [15]. article info a b s t r a c t article history: received 26 march 2021 revised 12 april 2021 accepted 09 june 2021 published online 17 august 2021 predicting network traffic is crucial for preventing congestion and gaining superior quality of network services. this research aims to use backpropagation to predict the inbound level to understand and determine internet usage. the architecture consists of one input layer, two hidden layers, and one output layer. the study compares three activation functions: sigmoid, rectified linear unit (relu), and hyperbolic tangent (tanh). three learning rates: 0.1, 0.5, and 0.9 represent low, moderate, and high rates, respectively. based on the result, in terms of a single form of activation function, although sigmoid provides the least rmse and mse values, the relu function is more superior in learning the high traffic pattern with a learning rate of 0.9. in addition, re-lu is more powerful to be used in the first order in terms of combination. hence, combining a high learning rate and pure relu, relu-sigmoid, or relu-tanh is more suitable and recommended to predict upper traffic utilization. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: backpropagation combination of activation function forecasting inbound traffic relu-sigmoid relu-tanh http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i12021p14-28 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.ristekbrin.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 15 the architecture of backpropagation neural networks still becomes the preferable topic in neural networks, such as optimization in the number of hidden layers [16][17], and another research in sensors, modeling backpropagation based on ga by specifying the number of hidden layer neurons [18]. despite its slow training, a backpropagation neural network is easy to use and design depending on the input characteristics, whether univariate or multivariate inputs [19]. thus, backpropagation is proposed to predict inbound traffic to understand and determine the internet usage through the network. three different activation functions are implemented, i.e., sigmoid, relu, and tanh function. the implementation is both in a single form and combination, making up nine permutation models to optimize the weights between layers. ii. methodology a. backpropagation neural network the neural network is a reliable nonlinear technique for modeling a wide range of applications due to the flexibility in terms of architecture. the neural network architecture could be two or more layers. neural network applied in this study is backward propagation of errors or backpropagation, a supervised learning algorithm in neural networks, which is a multilayer perceptron. backpropagation in the networks is simply a gradient descent method aiming to optimize the weights connecting the adjacent layers among the input layer, hidden layer(s), and output layer. by the optimized weights, the errors between the observed data and the prediction can be minimized. figure 1 shows a neural network 2-hidden-layer. this study uses this architecture since two hidden layers are more superior to those with one hidden layer [20]. this architecture was applied with dense networks, which means that each unit (node) in a layer is densely connected with all other units in the neighboring layers. each connection is associated with a weight (𝑤𝑖𝑗 ) reflecting the strength of the connection between the units. the inputs of 𝑥1, 𝑥2, … , 𝑥𝑛 the hidden unit value (ℎ𝑗) is determined by applying a weighted sum of all inputs plus a bias as written in (1), while the output unit (𝑦𝑖 ) is defined by (2) [21]. 𝑛𝑒𝑡 (ℎ𝑗 |𝑥) = 𝑓(∑ 𝑤𝑖𝑗 . 𝑥𝑖𝑖=1 + 𝑏) (1) 𝑛𝑒𝑡 (𝑦𝑖 |ℎ) = 𝑓(∑ 𝑤𝑖𝑗 . ℎ𝑗𝑗=1 + 𝑏) (2) in a backpropagation neural network, firstly, the signal propagates forward from the input layer to the output layer through the hidden layer. after that, the error is calculated, moving vice versa from the output layer to the input layer through the hidden layer. after the iterative training process, the neural network achieves the optimal weight and threshold to reduce the error to the desired level [15]. the weight parameter is updated with the rate change as in (3), where 𝑦𝑖 denotes the observed data while 𝑦�̂� is the predicted values. ∆𝑤𝑖𝑗 = 𝜖(〈𝑦𝑖 − 𝑦�̂�〉) (3) fig. 1. proposed backpropagation neural network architecture 16 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 furthermore, 𝑓 itself is an activation function to map the values to a nonlinear (figure 2). sigmoid function is one of commonly used mainly for forecasting probability-based-output as expressed in (4), and relu is another function widely used representing a nearly linear function and preserving the properties of linear models that made them easy to optimize, with gradient-descent method [22] as given by (5). besides, hyperbolic tangent, known as tanh function, is a zero-centered function that provides better training performance for multilayer neural networks formulated as in (6). 𝑓(𝑥) = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 (𝑥) = 1 1+𝑒 −𝑥 (4) 𝑓(𝑥) = max(0, 𝑥) = { 𝑥 , 𝑥 ≥ 0 0 , 𝑥 < 0 (5) 𝑓(𝑥) = tanh(𝑥) = 𝑒 𝑥−𝑒 −𝑥 𝑒 𝑥+𝑒 −𝑥 (6) b. experimental design this paper collected a time series network inbound traffic data from a backbone network using cacti and a traffic controller applied in mulawarman university in indonesia. the series was weekdays inbound traffic accounted daily ranging from 27 august 2019 to 17 february 2021. traffic data measured in bits/second were then normalized on a scale of 0 to 1 to prevent huge numbers in the process of bpnn. the study uses the first 80% as training data, and the rest is testing data. figure 3 illustrates the research flow implemented in this paper, whereas each hidden layer is an applied activation function. there were three various activation functions used, i.e., sigmoid, relu, and tanh function designed in a single form and combination, making up nine permutation models in terms of order, i.e., the usage of pure sigmoid function; relu function; tanh function; sigmoid-relu; sigmoid-tanh; relu-sigmoid; relu-tanh; tanh-sigmoid; and tanh-relu function. furthermore, table 1 depicts the preferable setting of bpnn utilized. in order to conduct a comparative analysis in the usage if learning rate, this paper was designed with three kinds of learning rate, i.e., 0.1; 0.5; and 0.9 reflecting a low, middle, and high rate, respectively. c. accuracy metrics in terms of accuracy comparison, mean square error (mse) and root mean square error (rmse) were used as expressed in (7) and (8) consecutively, where 𝑦𝑖 denotes the observed data while 𝑦�̂� is the predicted values. the smaller the value, the less the error is. 𝑀𝑆𝐸 = 1 𝑛 ∑ (𝑦𝑖 − 𝑦�̂�) 2𝑛 𝑖=1 (7) 𝑅𝑀𝑆𝐸 = √ 1 𝑛 ∑ (𝑦𝑖 − 𝑦�̂�) 2𝑛 𝑖=1 (8) fig. 2. activation function; sigmoid, relu, and tanh purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 17 iii. results and discussions nine permutation models of activation function were used in this paper with three various learning rates. table 2 and table 3 show the mse and rmse values for each model and learning rate used based on the simulation performed. overall, it can be clearly seen that in terms of the usage of a single activation function, although the usage of pure sigmoid function provided the smallest rmse, the rmse gained from the three models were not significantly different. the results of single form activation function are shown in figure 4, figure 5, and figure 6 for sigmoid, relu, and tanh respectively. the usage of sigmoid-sigmoid and tanh-tanh function could not recognize higher traffic pattern. on the contrary, relu-relu worked more superior in terms of pattern recognition although the rmse was not the smallest one, but it was still more superior than tanh function. table 1. parameter setting of backpropagation neural network parameters value maximum iteration 3000 momentum 0.9 epoch 3000 hidden layer 2 learning rate 0.1; 0.5; 0.9 fig. 3. research flow table 2. parameter setting of backpropagation neural network with single form activation function activation function form activation function’s order learning rate mse rmse single form activation function sigmoid – sigmoid 0.1 0.01519463 0.12326648 0.5 0.01474695 0.12143701 0.9 0.01314681 0.11465952 relu – relu 0.1 0.01679769 0.12960591 0.5 0.01536174 0.12394247 0.9 0.01373752 0.11720719 tanh – tanh 0.1 0.01555421 0.12471651 0.5 0.01402152 0.11841251 0.9 0.01653229 0.12857797 18 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 on the other hand, when it comes to combination form between two different activation functions in the architecture, sigmoid could not recognize the high pattern properly whether in a single form or combination as presented in figure 4, figure 7, and figure 8, unless it was mixed with relu, with relu placed in the first order as shown in figure 9. relu is remained powerful whether combined with sigmoid or tanh function in the architecture by implementing it in the first order. combined relu in the first order can be seen in figure 9 and figure 10. in terms of order, the performance of tanh function seems similar with relu that the accuracy result was not significantly changed whether in a single form usage or in combination as long as it was put in the first place then followed by other activation functions as illustrated in figure 11 and figure 12. iv. conclusion backpropagation neural network is applied to predict the inbound traffic designed in one input layer, two hidden layers, and one output layer with three kinds of activation functions, i.e., sigmoid function, rectified linear unit (relu), and hyperbolic tangent (tanh) function. the design is used in single and combination forms obtaining nine permutations with three kinds of learning rates, i.e., 0.1; 0.5; and 0.9 representing a low, middle, high rate. based on the result, it can be seen that relu works more superior in recognizing the inbound traffic pattern than sigmoid and tanh function in the similar architectures and parameters used in the analysis. hence, an interesting result could be concluded that in regards to the usage of two different activation functions in bpnn architecture, the selection of first-order activation function is crucial in order to gain superior prediction result and relu function is recommended to be used in the initial order to catch the high pattern in the data. in addition, in terms of predicting upper traffic utilization, the combination of a high learning rate and pure relu or a combination of relu-sigmoid or relu-tanh is more suitable and recommended. as for future work, it is recommended to optimize the architecture and parameters, particularly in the number of neurons in the hidden layers and learning rate, respectively. nevertheless, overfitting and convergence could be problems encountered in the process so that a proper architecture, activation function’s order, and parameter determination should be carefully performed. table 3. parameter setting of backpropagation neural network with combined activation function activation function form activation function’s order learning rate mse rmse combined activation function sigmoid – relu 0.1 0.01608174 0.12681380 0.5 0.01583461 0.12583564 0.9 0.01518564 0.12323002 sigmoid – tanh 0.1 0.01629544 0.12765358 0.5 0.01575635 0.12552430 0.9 0.01553580 0.12464268 relu – sigmoid 0.1 0.01380587 0.11749840 0.5 0.01430379 0.11959846 0.9 0.01541954 0.12417544 relu – tanh 0.1 0.01376507 0.11732464 0.5 0.01501619 0.12254058 0.9 0.01799144 0.13413217 tanh – sigmoid 0.1 0.01602616 0.12659446 0.5 0.01476072 0.12149372 0.9 0.01651042 0.12849286 tanh – relu 0.1 0.01481934 0.12173470 0.5 0.01708461 0.13070810 0.9 0.01541536 0.12415862 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 19 (a) (b) (c) fig. 4. prediction result of single form activation function of sigmoid function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 20 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 (a) (b) (c) fig. 5. prediction result of single form activation function of relu function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 21 (a) (b) (c) fig. 6. prediction result of single form activation function of tanh function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 22 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 (a) (b) (c) fig. 7. prediction result of combined activation function of sigmoid-relu function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 23 (a) (b) (c) fig. 8. prediction result of combined activation function of sigmoid-tanh function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 24 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 (a) (b) (c) fig. 9. prediction result of combined activation function of relu-sigmoid function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 25 (a) (b) (c) fig. 10. prediction result of combined activation function of relu-tanh function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 26 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 (a) (b) (c) fig. 11. prediction result of combined activation function of tanh-sigmoid function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 27 (a) (b) (c) fig. 12. prediction result of combined activation function of tanh-relu function for each learning rate in bpnn: (a) learning rate 0.1, (b) learning rate 0.5, and (c) learning rate 0.9 28 purnawansyah et al. / knowledge engineering and data science 2021, 4 (1): 14–28 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering unversitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] m. kihl, p. ödling, c. lagerstedt, and a. aurelius, “traffic analysis and characterization of internet user behavior,” 2010 int. congr. ultra mod. telecommun. control syst. work. icumt 2010, no. november, pp. 224–231, 2010. [2] v. j. ribeiro, z. l. zhang, s. moon, and c. diot, “small-time scaling behavior of internet backbone traffic,” comput. networks, vol. 48, no. 3, pp. 315–334, 2005. [3] j. k. taylor and c. cihon, statistical techniques for data analysis, second edition. 2004. [4] p. purnawansyah, h. haviluddin, r. alfred, and a. f. o. gaffar, “network traffic time series performance analysis using statistical methods,” knowl. eng. data sci., vol. 1, no. 1, p. 1, 2017. [5] m. hanif, f. sami, m. hyder, and m. i. ch, “hidden markov model for time series prediction,” j. asian sci. res., vol. 7, no. 5, pp. 196–205, 2017. [6] c. you & k. chandra, “time series models for internet data traffic,” conf. local comput. networks, 164–171, 1999. [7] m. s. mahdavinejad, m. rezvan, m. barekatain, p. adibi, p. barnaghi, and a. p. sheth, “machine learning for internet of things data analysis: a survey,” digit. commun. networks, vol. 4, no. 3, pp. 161–175, 2018. [8] m. wang, y. cui, x. wang, s. xiao, and j. jiang, “machine learning for networking: workflow, advances and opportunities,” arxiv, pp. 1–8, 2017. [9] e. s. yu and c. y. r. chen, “traffic prediction using neural networks,” ieee glob. telecommun. conf., vol. 2, no. may, pp. 991–995, 1993. [10] n. boutaba, r. salahuddin, m. a. limam, n. ayoubi, s. shahriar and m. solano, f, e. aicedo, o, “a comprehensive survey on machine learning for networking: evolution, applications and research opportunities,” j. internet serv. appl., vol. 9, no. 5, pp. 1–99, 2018. [11] c. n. babu and b. e. reddy, “a moving-average filter based hybrid arima-ann model for forecasting time series data,” appl. soft comput. j., vol. 23, pp. 27–38, 2014. [12] c. narendra babu and b. eswara reddy, “performance comparison of four new arima-ann prediction models on internet traffic data,” j. telecommun. inf. technol., vol. 2015, no. 1, pp. 67–75, 2015. [13] j. rynkiewicz, “hybrid hmm / mlp models for time series prediction,” eur. symp. artif. neural networks, no. april, pp. 455–462, 1999. [14] s. amin, “backpropagation – artificial neural network (bp-ann): understanding gender characteristics of older driver accidents in west midlands of united kingdom,” saf. sci., vol. 122, no. july 2019, p. 104539, 2020. [15] y. ling, q. yue, c. chai, q. shan, d. hei, and w. jia, “nuclear accident source term estimation using kernel principal component analysis, particle swarm optimization, and backpropagation neural networks,” ann. nucl. energy, vol. 136, p. 107031, 2020. [16] j. n. ogunbo, o. a. alagbe, m. i. oladapo, and c. shin, “n-hidden layer artificial neural network architecture computer code: geophysical application example,” heliyon, vol. 6, no. 6, p. e04108, 2020. [17] m. lopez-martin, b. carro, and a. sanchez-esguevillas, “neural network architecture based on gradient boosting for iot traffic prediction,” futur. gener. comput. syst., vol. 100, pp. 656–673, 2019. [18] s. wang, w. zhu, y. shen, j. ren, h. gu, and x. wei, “temperature compensation for mems resonant accelerometer based on genetic algorithm optimized backpropagation neural network,” sens. actuators, a phys., 316, p.112393, 2020. [19] g. panchal, a. ganatra, y. p. kosta, and d. panchal, “behaviour analysis of multilayer perceptronswith multiple hidden neurons and hidden layers,” int. j. comput. theory eng., no. june 2017, pp. 332–337, 2011. [20] a. j. thomas, m. petridis, s. d. walters, s. m. gheytassi, and r. e. morgan, “two hidden layers are usually better than one,” commun. comput. inf. sci., vol. 744, pp. 279–290, 2017. [21] s. narejo and e. pasero, “an application of internet traffic prediction with deep neural network,” smart innov. syst. technol., vol. 69, pp. 139–149, 2017. [22] c. e. nwankpa, w. ijomah, a. gachagan, and s. marshall, “activation functions: comparison of trends in practice and research for deep learning,” arxiv, pp. 1–20, 2018. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1109/icumt.2010.5676633 https://doi.org/10.1109/icumt.2010.5676633 https://doi.org/10.1016/j.comnet.2004.11.012 https://doi.org/10.1016/j.comnet.2004.11.012 https://www.routledge.com/statistical-techniques-for-data-analysis/taylor-cihon/p/book/9780367578435 https://doi.org/10.17977/um018v1i12018p1-7 https://doi.org/10.17977/um018v1i12018p1-7 https://ideas.repec.org/a/asi/joasrj/2017p196-205.html https://ideas.repec.org/a/asi/joasrj/2017p196-205.html https://doi.org/10.1109/lcn.1999.802013 https://doi.org/10.1016/j.dcan.2017.10.002 https://doi.org/10.1016/j.dcan.2017.10.002 https://arxiv.org/abs/1709.08339 https://arxiv.org/abs/1709.08339 https://doi.org/10.1109/glocom.1993.318226 https://doi.org/10.1109/glocom.1993.318226 https://doi.org/10.1186/s13174-018-0087-2 https://doi.org/10.1186/s13174-018-0087-2 https://doi.org/10.1186/s13174-018-0087-2 https://doi.org/10.1016/j.asoc.2014.05.028 https://doi.org/10.1016/j.asoc.2014.05.028 https://core.ac.uk/download/pdf/235207347.pdf https://core.ac.uk/download/pdf/235207347.pdf https://www.researchgate.net/publication/221165175_hybrid_hmmmlp_models_for_time_series_prediction https://www.researchgate.net/publication/221165175_hybrid_hmmmlp_models_for_time_series_prediction https://doi.org/10.1016/j.ssci.2019.104539 https://doi.org/10.1016/j.ssci.2019.104539 https://doi.org/10.1016/j.anucene.2019.107031 https://doi.org/10.1016/j.anucene.2019.107031 https://doi.org/10.1016/j.anucene.2019.107031 https://doi.org/10.1016/j.heliyon.2020.e04108 https://doi.org/10.1016/j.heliyon.2020.e04108 https://doi.org/10.1016/j.future.2019.05.060 https://doi.org/10.1016/j.future.2019.05.060 https://doi.org/10.1016/j.sna.2020.112393 https://doi.org/10.1016/j.sna.2020.112393 https://doi.org/10.7763/ijcte.2011.v3.328 https://doi.org/10.7763/ijcte.2011.v3.328 https://doi.org/10.1007/978-3-319-65172-9_24 https://doi.org/10.1007/978-3-319-65172-9_24 https://doi.org/10.1007/978-3-319-56904-8_14 https://doi.org/10.1007/978-3-319-56904-8_14 https://arxiv.org/abs/1811.03378 https://arxiv.org/abs/1811.03378 i. introduction ii. methodology a. backpropagation neural network b. experimental design c. accuracy metrics iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] m. kihl, p. ödling, c. lagerstedt, and a. aurelius, “traffic analysis and characterization of internet user behavior,” 2010 int. congr. ultra mod. telecommun. control syst. work. icumt 2010, no. november, pp. 224–231, 2010. [2] v. j. ribeiro, z. l. zhang, s. moon, and c. diot, “small-time scaling behavior of internet backbone traffic,” comput. networks, vol. 48, no. 3, pp. 315–334, 2005. [3] j. k. taylor and c. cihon, statistical techniques for data analysis, second edition. 2004. [4] p. purnawansyah, h. haviluddin, r. alfred, and a. f. o. gaffar, “network traffic time series performance analysis using statistical methods,” knowl. eng. data sci., vol. 1, no. 1, p. 1, 2017. [5] m. hanif, f. sami, m. hyder, and m. i. ch, “hidden markov model for time series prediction,” j. asian sci. res., vol. 7, no. 5, pp. 196–205, 2017. [6] c. you & k. chandra, “time series models for internet data traffic,” conf. local comput. networks, 164–171, 1999. [7] m. s. mahdavinejad, m. rezvan, m. barekatain, p. adibi, p. barnaghi, and a. p. sheth, “machine learning for internet of things data analysis: a survey,” digit. commun. networks, vol. 4, no. 3, pp. 161–175, 2018. [8] m. wang, y. cui, x. wang, s. xiao, and j. jiang, “machine learning for networking: workflow, advances and opportunities,” arxiv, pp. 1–8, 2017. [9] e. s. yu and c. y. r. chen, “traffic prediction using neural networks,” ieee glob. telecommun. conf., vol. 2, no. may, pp. 991–995, 1993. [10] n. boutaba, r. salahuddin, m. a. limam, n. ayoubi, s. shahriar and m. solano, f, e. aicedo, o, “a comprehensive survey on machine learning for networking: evolution, applications and research opportunities,” j. internet serv. appl., vol. 9, no. 5... [11] c. n. babu and b. e. reddy, “a moving-average filter based hybrid arima-ann model for forecasting time series data,” appl. soft comput. j., vol. 23, pp. 27–38, 2014. [12] c. narendra babu and b. eswara reddy, “performance comparison of four new arima-ann prediction models on internet traffic data,” j. telecommun. inf. technol., vol. 2015, no. 1, pp. 67–75, 2015. [13] j. rynkiewicz, “hybrid hmm / mlp models for time series prediction,” eur. symp. artif. neural networks, no. april, pp. 455–462, 1999. [14] s. amin, “backpropagation – artificial neural network (bp-ann): understanding gender characteristics of older driver accidents in west midlands of united kingdom,” saf. sci., vol. 122, no. july 2019, p. 104539, 2020. [15] y. ling, q. yue, c. chai, q. shan, d. hei, and w. jia, “nuclear accident source term estimation using kernel principal component analysis, particle swarm optimization, and backpropagation neural networks,” ann. nucl. energy, vol. 136, p. 107031, ... [16] j. n. ogunbo, o. a. alagbe, m. i. oladapo, and c. shin, “n-hidden layer artificial neural network architecture computer code: geophysical application example,” heliyon, vol. 6, no. 6, p. e04108, 2020. [17] m. lopez-martin, b. carro, and a. sanchez-esguevillas, “neural network architecture based on gradient boosting for iot traffic prediction,” futur. gener. comput. syst., vol. 100, pp. 656–673, 2019. [18] s. wang, w. zhu, y. shen, j. ren, h. gu, and x. wei, “temperature compensation for mems resonant accelerometer based on genetic algorithm optimized backpropagation neural network,” sens. actuators, a phys., 316, p.112393, 2020. [19] g. panchal, a. ganatra, y. p. kosta, and d. panchal, “behaviour analysis of multilayer perceptronswith multiple hidden neurons and hidden layers,” int. j. comput. theory eng., no. june 2017, pp. 332–337, 2011. [20] a. j. thomas, m. petridis, s. d. walters, s. m. gheytassi, and r. e. morgan, “two hidden layers are usually better than one,” commun. comput. inf. sci., vol. 744, pp. 279–290, 2017. [21] s. narejo and e. pasero, “an application of internet traffic prediction with deep neural network,” smart innov. syst. technol., vol. 69, pp. 139–149, 2017. [22] c. e. nwankpa, w. ijomah, a. gachagan, and s. marshall, “activation functions: comparison of trends in practice and research for deep learning,” arxiv, pp. 1–20, 2018. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 145–152 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p145-152 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology stress classification using deep learning with 1d convolutional neural networks abdulrazak yahya saleh 1, *, lau khai xian 2 faculty of cognitive sciences & human development (fcshd), universiti malaysia sarawak (unimas) kota samarahan, sarawak, malaysia 1 ysahabdulrazak@unimas.my*; 2 lkhixian@hotmail.com * corresponding author i. introduction the new covid-19 has contributed to psychological disorders among people, leading to a higher risk of suicide. stress endurance varies between individuals; some people can handle it, and others cannot [1]. some research has been conducted to analyze covid19 impact on an individual's stress [2][3][4][5][6]. in the new era, stress has been extended related to finances, work, and relationships; it is now a familiar feeling people face each day in their lives. people facing stress are increasing at a fast rate. willistowerswatson [7] reported that "75% of the u.s. employers ranked stress as their top health and productivity concern, but employers and employees disagreed on its causes". these findings are based on responses from 487 u.s. employers in willis towers watson's 2015/2016 global staying@work survey and more than 5000 u.s. employees in willis towers watson's 2015/2016 global benefits attitudes survey. american psychological association [8] reported that the percentage of americans who experienced at least one symptom of stress over the past month rose from 71 percent in august 2016 to 80 percent in january 2017. according to cacioppo, tassinary, and berntson [9], chronic psychological stress carries many pathophysiological risks, such as cerebrovascular disease, cardiovascular disease, immune deficiencies, and diabetes. francis [10] said that young adults spend more than six hours a day "stressed out." it is essential to classify whether someone is stressed before becoming a serious illness [11]. this problem called for a solution where stress can be recognized before it worsens. machine learning is a branch of artificial intelligence that focuses on machines' ability to learn from experience [12]. a computer with the ability has rules put into it to allow it to resolve problems article info a b s t r a c t article history: submitted 18 october 2021 revised 27 october 2021 accepted 29 october2021 published online 31 december 2021 stress has been a major problem impacting people in various ways, and it gets serious every day. identifying whether someone is suffering from stress is crucial before it becomes a severe illness. artificial intelligence (ai) interprets external data, learns from such data, and uses the learning to achieve specific goals and tasks. deep learning (dl) has created an impact in the field of artificial intelligence as it can perform tasks with high accuracy. therefore, the primary purpose of this paper is to evaluate the performance of 1d convolutional neural networks (1d cnns) for stress classification. a psychophysiological stress (ps) dataset is utilized in this paper. the ps dataset consists of twelve features obtained from the expert. the 1d cnns are trained and tested using 10-fold cross-validation using the ps dataset. the algorithm performance is evaluated based on accuracy and loss matrices. the 1d cnns outputs 99.7% in stress classification, which outperforms the backpropagation (bp), only 65.57% in stress classification. therefore, the findings yield a promising outcome that the 1d cnns effectively classify stress compared to bp. further explanation is provided in this paper to prove the efficiency of 1d cnn for the classification of stress. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: 1d convolutional neural networks artificial intelligence classification deep learning stress http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p145-152 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 146 a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 without human intervention [13]. machine learning provides the ability for machines to train on and accordingly adapt when there is new data input, enables the prediction from known data, and learns from the new unknown data. one of the machine learning methods that can be used to recognize stress is artificial neural networks (anns). anns are a non-linear data-driven self-adaptive approach instead of the traditional model-based methods" by solomatine & ostfeld [14]. "anns can determine and study correlated patterns between input data and corresponding target values. it can use to predict the outcome of new independent input data after training. neural networks (nns) "have been used in a wide variety of applications where statistical methods are traditionally employed like classifying problems or predicting outcomes as said by solomatine & ostfeld [14]. for a very long period, anns had to be limited in size due to difficulties in training. however, hinton, osindero, and teh [15] suggested a procedure that could be used with much larger networks known as deep learning (dl). lecun, bengio, and hinton [16] said deep learning allows computational models composed of multiple processing layers to learn the representation of data with multiple levels of abstraction. convolutional neural networks (cnns), one of the most well-known dl models, have become the de facto standard for many computer vision and machine learning operations over the last decade. cnns are feed-forward artificial neural networks (anns) with convolutional and subsampling layers that alternate. when trained on a large visual database with ground-truth labels, deep 2d cnns with many hidden layers and millions of parameters can learn complex objects and patterns. with proper training, this one-of-a-kind ability transforms them into the primary tool for various engineering applications involving 2d signals such as images and video frames. as an alternative, 1d convolutional neural networks (1d cnns) have recently been developed as modified 2d cnns. 1d cnns have been utilized in several classification problems like heart sound and soil texture [17][18][19]. it is believed that the classification between stress and nonstress will be more accurate with multiple processing layers compared to the anns that only consist of a single processing layer. smets et al. [20] said the previous approach to measure stress does not allow for continuous monitoring and often suffers from biases such as demand effects, response, and memory biases; therefore, the focus has been shifted towards measuring bodily responses as indicators of stress. this study aims to use stress datasets to anticipate an accurate output; society will use the outcomes. the remaining parts of this paper are organized as follows: methodology and empirical studies are presented in section 2; findings and the analysis of the results are presented in sections 3 and 4, and finally, section 5 concludes the paper's future works. ii. methods this section introduces the research methodology that had been used in this research. the components in this section include dataset preparation, data pre-processing, research design, research implementation, and instruments, as in figure 1. dl has become a growing research interest, and the method has shown certain advantages of learning performance. dl can learn from the past data to solve complex problems and has been widely used in classification. it allows the computational models with various processing layers to learn the data representation with various levels of abstraction. the result of the dl method is compared to one of the neural network standards, the backpropagation (bp) algorithm, which was used in previous studies related to stress classification [21][22]. this comparison is to analyze the effectiveness of the dl algorithms in this study. the research focuses on psychophysiological stress, which looks at heart rate variability (hrv), galvanic skin response (gsr), skin temperature (temp), and respiration. a. 1d convolutional neural networks (1d cnns) 1d cnns is very similar to cnns where the input of 1d cnns is dimensional, whereas an ordinary cnns is two-dimensional. a convolutional neural network (cnn) is very similar to ordinary neural networks as they are made up of neurons that have learnable weight and biases. each neuron receives some inputs, performs a dot product, and optionally follows it with a non-linearity. loss function still can be found in the last layer or fully connected layer, and all the tips developed for learning regular neural networks still apply. figure 1 shows the schematic representation of 1d cnns. the first hidden layer is always a convolutional layer, followed by a pooling layer. the convolutional and pooling layers can be present multiple times in the network until the final hidden a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 147 layer is the full-connected layer. the schematic representation of 1d convolutional neural networks is as shown in figure 2. the convolution layer is where the previous layer's feature maps are folded with learnable kernels and execute the activation function to form the output feature map. each output map may combine convolutions with multiple input maps. the mathematical formula for constructing the convolution layer is used (1). 𝑥𝑗 ℓ = 𝑓 (∑ 𝑥𝑖 ℓ−1 𝑖∈𝑀𝑗 ∗ 𝑘𝑖𝑗 ℓ + 𝑏𝑗 ℓ) (1) where 𝑀𝑗 is the selection of input maps, each output map is given an additive bias 𝑏, for an output map, the input maps will be folded with different kernels. so, if output map 𝑗 and map 𝑘 both sums over input map 𝑖, the kernels applied to map 𝑖 are different for output maps 𝑗 and 𝑘. the pooling layer produces downsampled versions of the input maps. if there are n input maps, there will be exactly n output maps, although the output map is smaller (2). fig. 1. an overview flow of the research methodology fig. 2. schematic representation of 1d convolutional neural networks 148 a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 𝑥𝑗 ℓ = 𝑓(𝛽𝑗 ℓ𝑑𝑜𝑤𝑛(𝑥𝑗 ℓ−1) + 𝑏𝑗 ℓ (2) where 𝑑𝑜𝑤𝑛(𝑥𝑗 ℓ−1) represents a sub-sampling function. this function will sum over each distinct nby-n block in the input so that the output is n-times smaller along both spatial dimensions. each output map is given its multiplicative bias 𝛽 and an additive bias 𝑏. the fully connected layer will combine features learned by different convolution kernels so that the network can build a global representation of the holistic input. let 𝛼𝑖𝑗 denote the weight given to input map 𝑖 when forming the output map 𝑗. the output map 𝑗 is (3). 𝑥𝑗 ℓ = 𝑓(∑ 𝛼𝑖𝑗 𝑁𝑖𝑛 𝑖=1 (𝑥𝑖 ℓ−1 ∗ 𝑘𝑖 ℓ) + 𝑏𝑗 ℓ (3) subject to the constraints (4). ∑ 𝛼𝑖𝑗𝑖 = 1, and 0 ≤ 𝛼𝑖𝑗 ≤ 1 (4) these constraints can be enforced by setting the 𝛼𝑖𝑗 variables equal to the softmax over a set of unconstrained, underlying weight 𝑐𝑖𝑗 use (5). 𝑎𝑖𝑗 = 𝑒𝑥𝑝⁡(𝑐𝑖𝑗) ∑ 𝑒𝑥𝑝⁡(𝑐𝑘𝑗)𝑘 (5) each set of weights 𝑐𝑖𝑗 for fixed j are independent of all other such sets for any other 𝑗, the single map is updated, and the subscript 𝑗. each map is updated in the same way except with different 𝑗 indices. b. empirical studies several experiments are conducted on the psychophysiological stress dataset. the implementation is multivariate analysis to evaluate 1d cnns deep learning model. there is a total of twelve features in the raw data given by dr. elena smets [20], which include skin conductance / galvanic skin response (sc/gsr), skin temperature (temp), blood volume pulse (bvp), respiration changes (rsp), heart rate (hr), electromyography amplitude (emg amp), blood volume pulse amplitude (bvp amplitude), respiration changes rate (rsp-rate), respiration changes amplitude (rspamplitude), heart rate variability amplitude (hrv amplitude), respiration changes + heart rate coherence (rsp+hr coherence), and segments. the segments will display the state of the respondents at that moment. for each sensing modality, features have been calculated on a sliding window of 30 s with 29 s overlap. the illustration of the datasets is shown in figure 3. the dataset had been separated into two parts: a training set and a testing set. the training set usually consists of input vectors and corresponding output vectors that fit into the model to produce a result compared with the target for each input vector in the training dataset. the testing set is utilized fig. 3. psychophysiological stress dataset a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 149 to provide an unbiased evaluation of a final model fit on the training dataset. the dataset had been normalized using the z-score normalization technique to scale the dataset value from 0 to 1. the normalized data then fits into the proposed deep learning model for classification. the classification performance of the deep learning model will be evaluated by utilizing the k-fold cross-validation. brownlee [23] said that cross-validation is a resampling procedure used to evaluate a limited data sample of machine learning models. in k-fold cross-validation, the dataset is divided into k subsets of equal size. the model is built fork times, leaving one of the subsets from training and using it as the test set. python is a hugely popular general-purpose programming language, so it is beneficial to apply solidly to ai and profound learning principles. keras is a high-level neural networks api, written in python and capable of running on top of tensorflow, cntk, or theano. tensorflow serves as the backend engine for the keras api to handle the low-level operations such as tensor products and convolutions. the machine used is intel® core™ i7-4720hq cpu @ 2.60ghz 2.59ghz with 8.00gb ram. the proposed method uses the 1d cnns deep learning model to classify. 1d cnns works precisely like normal cnns but with different input dimension. the input for normal cnns usually be two-dimensional or image datasets, but the input for 1d cnns is one-dimensional or signal datasets. table 1 shows the dataset, batch size, epochs, activation function, and optimizer used in this study. the batch size used is 256. batch size is the number of training examples utilized in one iteration. the data was run for 15 epochs to see the model's performance. epoch is one forward pass and one backward pass of all the training examples. the activation functions are relu and softmax. relu or rectified-linear unit layer will provide a threshold at zero. softmax function or normalized exponential function is used to represent categorical distribution. the optimizer function is sgd. sgd or stochastic gradient descent is a simple yet efficient approach to discriminative learning of linear classifiers under convex loss functions such as support vector machine or linear regression. figure 4 shows the schematic representation of the 1d cnns model. the classification accuracy of the output is calculated according to the use of equation (6). 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑁𝑢𝑚𝑏𝑒𝑟⁡𝑜𝑓⁡𝑐𝑜𝑟𝑟𝑒𝑐𝑡⁡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑇𝑜𝑡𝑎𝑙⁡𝑁𝑢𝑚𝑏𝑒𝑟⁡𝑜𝑓⁡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 (6) fig. 4. schematic representation of 1d cnns table 1. hyper parameter of 1d cnns dataset batch size epochs activation optimizer psychophysiological stress dataset 256 15 relu softmax sgd 150 a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 iii. results and discussions table 2 shows the results of the 1d cnns deep learning model with 10-fold cross-validation. in each fold, the data had been divided into ten groups. the first group will be used as testing in the first fold, and the other groups will be used as training. in the second fold, the second group will be used as testing, and the other groups will be used as training, followed by the third group for testing in the third fold. this will recursively continue until the 10th fold is completed. besides that, the plotting of accuracy and loss function for each training process and testing process also can be seen in table 2. the final accuracy for 1d cnns is 99.97%. the dataset is tested in a second experiment with the backpropagation (bp) algorithm for further proof. bp is fit into the same data to compare the performance of both models. in table 2, it is clearly stated that the training classification performance of 1d cnns is 100% and the testing classification performance of 1d cnns is 99.97%. however, the training classification performance of bp is 65.85%, and the testing classification performance of bp is 65.57%. figure 5 clearly shows that the 1d cnns deep learning model outperforms the bp algorithms. moreover, our algorithm achieved higher accuracy than the methods mentioned in [24]. table 3 compares the implementation of the cnn and other recent works based on different machine and deep learning algorithms. as shown in table 3, our work achieves better accuracy than most of the other recent works (which have been reported in this paper) except that of [20], which achieved 99.8% using the physiological signals dataset. however, our work's accuracy gains better results than the same dataset. the datasets must be structured appropriately to fit perfectly into the model and provide a good analysis of the results. every step, from loading the dataset from a text file or csv file to evaluating the trained model's performance, is critical because simple errors can cause errors when the code is run. the parameters used to build the model must be tested repeatedly to find the best parameter. for example, a sigmoid may be better when relu is a terrible choice and vice versa. building the best model requires a lot of trial and error. the deep learning model is well-known for its high accuracy, regardless of whether it can solve a classification problem or a regression problem. table 2. summary of the accuracy 1d cnns fold accuracy 1st 99.98 2nd 100.00 3rd 99.99 4th 99.98 5th 99.97 6th 99.93 7th 99.99 8th 99.97 9th 99.94 10th 99.98 final accuracy 99.97 table 3. comparison with other published works published work dataset accuracy deep neural networks [25] physiological signals 99.80% ann [26] wesad dataset 95.21% svm [20] psychophysiological stress 93.4 ± 3.2% convolutional neural network (cnn) [27] the heart rate signal 98.69 ± 0.45% machine learning [28] the mind-brain-body dataset 81.33% machine learning [29] questionnaire (dass 21) between 0.628 0.798 hybrid deep learning technique [30] employee data 96.2 % this work psychophysiological stress 99.7% a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 151 iv. conclusion in this study, the 1d cnns model was utilized to classify stress and compared with bp and various techniques from the literature review. a psychophysiological stress dataset was used in this study. the performance of the developed model is evaluated by comparing it with the bp model and other works. the result obtained in this study shows that the 1d cnns model is reliable in classifying stress with low lost function, which is 0.001, and high accuracy of 99.97%. more datasets are required to enhance the result of stress classification in the future. skill and experience are essential in assisting the person in selecting and checking the best method in the modeling process. spiking neural network as the latest generation of neural network and other areas of study will be applied to do stress classification. acknowledgment the authors wish to thank dr. elena smets for sharing the datasets to complete this study. universiti malaysia sarawak (unimas) supported and funded this work under the cdrg crossdisciplinary research (f04/cdrg/1839/2019). declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] p. priya, "how to overcome stress during covid19 pandemic," international journal of entrepreneurship and economic issues, vol. 4, no. 1, pp. 75-78, 2020. [2] m. fawaz and a. samaha, "covid-19 quarantine: post-traumatic stress symptomatology among lebanese citizens," international journal of social psychiatry, vol. 66, no. 7, pp. 666-674, 2020. [3] l. liang, t. gao, h. ren, r. cao, z. qin, y. hu, c. li, and s. mei, “post-traumatic stress disorder and psychological distress in chinese youths following the covid-19 emergency,” journal of health psychology, vol. 25, no. 9, pp. 1164–1175, jul. 2020. fig. 5. comparison in training and testing accuracy of 1d cnns http://journal2.um.ac.id/index.php/keds https://doi.org/10.32674/ijeei.v4i1.41 https://doi.org/10.32674/ijeei.v4i1.41 https://doi.org/10.1177/0020764020932207 https://doi.org/10.1177/0020764020932207 https://doi.org/10.1177/1359105320937057 https://doi.org/10.1177/1359105320937057 https://doi.org/10.1177/1359105320937057 152 a.y. saleh and l.k. xian / knowledge engineering and data science 2021, 4 (2): 145–152 [4] r. n. kumar anil, s. c. karumaran, d. kattula, r. thavarajah, and a. m. anusa,, "perceived stress and psychological (dis) stress among indian endodontists during covid19 pandemic lock down," medrxiv, 2020. [5] d. choudhury, s. bhowmick, s. parolia, s. jana, d. kundu, n. das, k. ray, and s. karpurkaysatha, “a study on the anxiety level and stress during covid19 lockdown among the general population of west bengal, indiaa must know for primary care physicians,” journal of family medicine and primary care, vol. 10, no. 2, p. 978, 2021. [6] e. esterwood and s. a. saeed, "past epidemics, natural disasters, covid19, and mental health: learning from history as we deal with the present and prepare for the future," psychiatric quarterly, pp. 1-13, 2020. [7] willis towers watson public limited company, "seventy-five percent of u.s. employers say stress is their number one workplace health concern," globe newswire, june 29, 2016. [online]. available: https://www.globenewswire.com/news-release/2016/06/29/852338/0/en/seventy-five-percent-of-u-s-employers-saystress-is-their-number-one-workplace-health-concern.html. [8] american psychological association, "stress in america: coping with change," stress in america survey, psycextra dataset, 2017. [9] j. t. cacioppo, l. g. tassinary, and g. berntson, handbook of psychophysiology. cambridge university press, 2007. [10] g. francis, "young adults spend more than six hours per day feeling 'stressed out', finds mental health study," the independent, 2018. [online]. available: https://www.independent.co.uk/life-style/mental-health-young-adults-stressdepression-anxiety-ocd-study-a8233046.html . [11] k. g. kim, "book review: deep learning," healthcare informatics research, vol. 22, no. 4, pp. 351-354, 2016. [12] k. das and r. n. behera, "a survey on machine learning: concept, algorithms and applications," international journal of innovative research in computer and communication engineering, vol. 5, no. 2, pp. 1301-1309, 2017. [13] i. arel, d. c. rose, and t. p. karnowski, "deep machine learning-a new frontier in artificial intelligence research [research frontier]," ieee computational intelligence magazine, vol. 5, no. 4, pp. 13-18, 2010. [14] d. p. solomatine and a. ostfeld, "data-driven modelling: some past experiences and new approaches," journal of hydroinformatics, vol. 10, no. 1, pp. 3-22, 2008. [15] g. e. hinton, s. osindero, and y.-w. teh, "a fast learning algorithm for deep belief nets," neural computation, vol. 18, no. 7, pp. 1527-1554, 2006. [16] y. lecun, y. bengio, and g. hinton, "deep learning," nature, vol. 521, no. 7553, pp. 436-444, 2015. [17] s. kiranyaz, o. avci, o. abdeljaber, t. ince, m. gabbouj, and d. j. inman, "1d convolutional neural networks and applications: a survey," mechanical systems and signal processing, vol. 151, p. 107398, 2021. [18] f. li et al., "feature extraction and classification of heart sound using 1d convolutional neural networks," eurasip journal on advances in signal processing, vol. 2019, no. 1, pp. 1-11, 2019. [19] f. m. riese and s. keller, "soil texture classification with 1d convolutional neural networks based on hyperspectral data," arxiv preprint arxiv:1901.04846, 2019. [20] e. smets et al., "comparison of machine learning techniques for psychophysiological stress detection," in international symposium on pervasive computing paradigms for mental health, 2015: springer, pp. 13-22. [21] z. qin, m. li, l. huang, and y. zhao, "stress level evaluation using bp neural network based on time-frequency analysis of hrv," in 2017 ieee international conference on mechatronics and automation (icma), 2017: ieee, pp. 1798-1803. [22] c. liu, y. feng, and y. wang, "an innovative evaluation method for undergraduate education: an approach based on bp neural network and stress testing," studies in higher education, pp. 1-17, 2020. [23] j. brownlee, "a gentle introduction to k-fold cross-validation," machine learning mastery, vol. 2019, 2018. [24] k. sardeshpande and v. r. thool, "psychological stress detection using deep convolutional neural networks," in international conference on computer vision and image processing, 2019: springer, pp. 180-189. [25] r. li and z. liu, "stress detection using deep neural networks," bmc medical informatics and decision making, vol. 20, no. 11, pp. 1-10, 2020. [26] p. bobade and m. vani, "stress detection with machine learning and deep learning using multimodal physiological data," in 2020 second international conference on inventive research in computing applications (icirca), 2020: ieee, pp. 51-57. [27] n. hakimi, a. jodeiri, m. mirbagheri, and s. k. setarehdan, "proposing a convolutional neural network for stress assessment by means of derived heart rate from functional near infrared spectroscopy," computers in biology and medicine, vol. 121, p. 103810, 2020. [28] h. baumgartl, e. fezer, and r. buettner, "two-level classification of chronic stress using machine learning on restingstate eeg recordings," 2020. [29] a. priya, s. garg, and n. p. tigga, "predicting anxiety, depression and stress in modern life using machine learning algorithms," procedia computer science, vol. 167, pp. 1258-1267, 2020. [30] r. reshma, "emotional and physical stress detection and classification using thermal imaging technique," annals of the romanian society for cell biology, pp. 8364-8374, 2021. https://doi.org/10.1101/2020.05.06.20092601 https://doi.org/10.1101/2020.05.06.20092601 https://doi.org/10.4103/jfmpc.jfmpc_1385_20 https://doi.org/10.4103/jfmpc.jfmpc_1385_20 https://doi.org/10.4103/jfmpc.jfmpc_1385_20 https://doi.org/10.1007/s11126-020-09808-4 https://doi.org/10.1007/s11126-020-09808-4 https://www.globenewswire.com/news-release/2016/06/29/852338/0/en/seventy-five-percent-of-u-s-employers-say-stress-is-their-number-one-workplace-health-concern.html https://www.globenewswire.com/news-release/2016/06/29/852338/0/en/seventy-five-percent-of-u-s-employers-say-stress-is-their-number-one-workplace-health-concern.html https://www.globenewswire.com/news-release/2016/06/29/852338/0/en/seventy-five-percent-of-u-s-employers-say-stress-is-their-number-one-workplace-health-concern.html https://www.globenewswire.com/news-release/2016/06/29/852338/0/en/seventy-five-percent-of-u-s-employers-say-stress-is-their-number-one-workplace-health-concern.html https://doi.org/10.1037/e501752017-001 https://doi.org/10.1037/e501752017-001 https://doi.org/10.1017/9781107415782 https://www.independent.co.uk/life-style/mental-health-young-adults-stress-depression-anxiety-ocd-study-a8233046.html https://www.independent.co.uk/life-style/mental-health-young-adults-stress-depression-anxiety-ocd-study-a8233046.html https://www.independent.co.uk/life-style/mental-health-young-adults-stress-depression-anxiety-ocd-study-a8233046.html https://doi.org/10.4258/hir.2016.22.4.351 http://ijircce.com/admin/main/storage/app/pdf/9m2ixxwvsxub3gbloy705xnzrvx5cfgtizfe2sxf.pdf http://ijircce.com/admin/main/storage/app/pdf/9m2ixxwvsxub3gbloy705xnzrvx5cfgtizfe2sxf.pdf https://doi.org/10.1109/mci.2010.938364 https://doi.org/10.1109/mci.2010.938364 https://doi.org/10.2166/hydro.2008.015 https://doi.org/10.2166/hydro.2008.015 https://doi.org/10.1162/neco.2006.18.7.1527 https://doi.org/10.1162/neco.2006.18.7.1527 https://doi.org/10.1038/nature14539 https://doi.org/10.1016/j.ymssp.2020.107398 https://doi.org/10.1016/j.ymssp.2020.107398 https://doi.org/10.1186/s13634-019-0651-3 https://doi.org/10.1186/s13634-019-0651-3 https://doi.org/10.5194/isprs-annals-iv-2-w5-615-2019 https://doi.org/10.5194/isprs-annals-iv-2-w5-615-2019 https://doi.org/10.1007/978-3-319-32270-4_2 https://doi.org/10.1007/978-3-319-32270-4_2 https://doi.org/10.1109/icma.2017.8016090 https://doi.org/10.1109/icma.2017.8016090 https://doi.org/10.1109/icma.2017.8016090 https://doi.org/10.1080/03075079.2020.1739013 https://doi.org/10.1080/03075079.2020.1739013 https://machinelearningmastery.com/k-fold-cross-validation/ https://www.springerprofessional.de/en/psychological-stress-detection-using-deep-convolutional-neural-n/17841058 https://www.springerprofessional.de/en/psychological-stress-detection-using-deep-convolutional-neural-n/17841058 https://doi.org/10.1186/s12911-020-01299-4 https://doi.org/10.1186/s12911-020-01299-4 https://doi.org/10.1109/icirca48905.2020.9183244 https://doi.org/10.1109/icirca48905.2020.9183244 https://doi.org/10.1109/icirca48905.2020.9183244 https://doi.org/10.1016/j.compbiomed.2020.103810 https://doi.org/10.1016/j.compbiomed.2020.103810 https://doi.org/10.1016/j.compbiomed.2020.103810 http://www.prof-buettner.de/downloads/baumgartl2020f.pdf http://www.prof-buettner.de/downloads/baumgartl2020f.pdf https://doi.org/10.1016/j.procs.2020.03.442 https://doi.org/10.1016/j.procs.2020.03.442 https://www.annalsofrscb.ro/index.php/journal/article/view/2378 https://www.annalsofrscb.ro/index.php/journal/article/view/2378 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 1, july 2021, pp. 29–37 eissn 2597-4637 https://doi.org/10.17977/um018v4i12021p29-37 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by indonesian ministry of research & technology do missing link community smell affect developers productivity: an empirical study toukir ahammed 1, *, sumon ahmed 2, mohammed shafiul alam khan 3 institute of information technology, university of dhaka suhrawardi udyan rd, dhaka, 1000, bangladesh 1 bsse0806@iit.du.ac.bd *; 2 sumon@du.ac.bd; 3 shafiul@du.ac.bd * corresponding author i. introduction community smells can be referred to as organizational and social anti-patterns in a development community, leading to unforeseen project costs [1]. although community smells may not be an immediate obstacle for software development, these can affect software maintenance negatively in the long run [2]. the missing link is one of the most common community smells that occurred in the software development community. this smell occurs when developers contribute to the same source code but do not communicate with each other [3]. the productivity of developers is one of the essential factors of software development since it is connected to the cost of the software project. the personnel-related factors are among the ones found to affect productivity most in the literature [4]. missing link community smell can create the knowledge gap among developers in the development community due to a lack of communication [5]. as a software product can be thought of as the combined effort of all developers, the lack of communication and cooperation can negatively affect mutual awareness and trust among developers [3]. thus, it can affect the development of software products. this raises the need to understand how missing link smell relates to productivity to manage development productivity more effectively. the research community has been studied community smells from different perspectives. some studies worked with the definition [1][6], and detection [3][7] of community smells, while others studied the diffuseness [5] and variability [8] of community smells. a few studies [9][10][11] worked on the prediction of community smells. the effect of community smells on predicting the intensity of code smell [2][12], and bug [13] is also studied. the role of gender diversity on community smells is studied in [14][15]. the refactoring of community smells was investigated in [16]. however, there has been no study investigating the impact of missing link smell on developers' productivity. in this context, the current study analyzes the productivity of developers involved in missing link smell and who is not. seven open-source projects such as activemq and cassandra are selected for analysis based on several criteria (e.g., availability of developer mailing list). first, missing link smells are identified in each project, finding cases where a collaboration link does not have its communication article info a b s t r a c t article history: received 05 june 2021 revised 24 june 2021 accepted 20 july 2021 published online 17 august 2021 missing link smell occurs when developers contribute to the same source code without communicating with each other. existing studies have analyzed the relationship of missing link smells with code smell and developer contribution. however, the productivity of developers involved in missing link smell has not been explored yet. this study investigates how productivity differs between smelly and non-smelly developers. for this purpose, the productivity of smelly and non-smelly developers of seven open-source projects are analyzed. the result shows that the developers not involved in missing link smell have more productivity than the developers involved in smells. the observed difference is also found statistically significant. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: community smell empirical study missing link smell productivity http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i12021p29-37 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.ristekbrin.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 30 t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 counterpart. then, the developers involved with each smell are identified by extracting the instance of smell. then, the developers are categorized into smelly and non-smelly developers. besides, the productivity of individual developers is measured by the number of changes per active day. finally, statistical analysis is performed on the productivity of smelly and non-smelly developers. the study results show that there is a significant difference between the productivity of smelly and non-smelly developers. the average productivity of non-smelly developers is significantly higher than smelly developers. ii. methods a. missing link community smell. missing link community smell refers to when two developers collaborate in a part of source code but do not communicate with each other [3]. this smell can be detected by finding those collaborations for which no communication is found in the defined communication channel, e.g., mailing list. the occurrence of missing link smell is described below with a sample software development community. a sample software development community of six developers is illustrated in figure 1. the example is taken from [17]. developers are connected through the solid line in the network if they communicate with each other. the dashed lines connect developers to the source code on which they work. the development community can be used to generate two types of developer social network (dsn), such as communication dsn and collaboration dsn. firstly, the communication dsn can be generated from figure 1 by considering only communication links, which are displayed in figure 2. then, the collaboration network can be generated by linking developers who work in the same part of the source code. figure 3 represents the collaboration dsn for the considered development community. for example, developer a and developer b work in the same source code file (figure 1), so they are connected in the collaboration dsn (figure 3). missing link smell now can be detected by comparing the collaboration network with the communication network. it can be easily observed that one link, ef, in the collaboration network (figure 3) does not have the corresponding counterpart in the communication network (figure 2). hence, it represents an instance of a missing link smell between developer e and developer f. in recent times, community smells are studied to incorporate the organizational and social aspects of the software development community in software engineering research. some studies [1][6] fig. 1. software development community fig. 2. communication network t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 31 focused on defining different types of community smell, while others focused on identifying [3][5] and predicting [9][10][11] these smells in open-source projects. besides, a few studies investigated the relationship and the impact of community smells on different software artifacts such as code smell and bug [2][13][18]. the concept of community smell is first introduced in an industrial case study [1]. the authors defined nine different community smells and proposed a list of possible mitigations of these smells, such as learning community, cultural conveyor, stand-up voting, etc. later, magnoni [3] proposed the identification pattern of four community smells and developed a tool named codeface4smells (https://github.com/maelstromdat/codeface4smells), extending an existing socio-technical network analysis tool codeface (http://siemens.github.io/codeface). the enhanced tool detected both communities smells and code smells in an automated approach [7]. besides detection, a few studies [9][10][11] tried to predict the community smells. palomba et al. [9] worked on the prediction of community smells from socio-technical factors. almarimi et al. [11] also built a model to predict community smells using ensemble classifier chain (ecc) and genetic programming (gp) techniques. tamburri et al. [5] explored the diffuseness of community smells and developer's perception about the presence and effect of community smells. the authors found that the diffuseness of community smells high in open-source projects, and developers recognized community smells as an obstacle that may hinder software evolution. the authors also analyzed the relationship between community smells and different socio-technical factors, such as socio-technical congruence, turnover, and truck factor. catolino et al. [14] investigated the role of gender diversity and women's participation in community smells. the authors found that gender-diverse teams had fewer community smells than non-gender-diverse teams, and the involvement of women in teams can reduce the number of community smells. in another study, catolino et al. [16] suggested some refactoring strategies to deal with community smells in practice, such as mentoring, creating communication plans, and restructuring the development community. in a recent study, catolino et al. [8] investigated the impact of socio-technical factors on community smells and found that communicability is essential in most cases to prevent the increase of community smells. ahammed et al. [18] investigated how missing link community smell was related to the introduction of bugs, i.e., fix-inducing changes (fic) in the system. the authors found that the number of smelly commits (developers involved in community smells) and fic commits are positively correlated. the authors also found that the severity of bugs was most significant that were introduced by developers involved in missing link smells. in another study [17], the same authors made an exploratory study on seven projects from apache on the engagement of developers in missing link community smell. they found that the contribution activities of developers are positively correlated with their involvement in missing link smell. the existing studies investigated the impact of community smell on technical artifacts such as code smell intensity [2] or bug [13] by employing a community-aware prediction model. palomba et al. [2] conducted an empirical study on nine open-source projects. they also measured how community smells impact the code smell intensity by proposing a code of smell intensity prediction model. they found that community smells contribute to the intensity of code smell. eken et al. [13] conducted an empirical investigation on ten open-source projects to find how community smells can predict bugs. fig. 3. collaboration network 32 t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 the authors found the impact of community smells as a contributing factor in predicting bug-prone classes. the current study aims at understanding the impact of community smell from the perspective of developers on how they perform in the software project. the study performs an empirical investigation on 1004 developers from 7 open-source projects where the projects are divided into a six-month window. the study reveals how missing link community smell affects the productivity of developers in open-source projects by measuring the productivity in terms of the number of changes per active day. b. proposed framework this study aims to understand how missing link smell affects the productivity of developers. first, missing link smells are detected from the project repository and mailing list. then, the developers were involved with extracted missing link smells. thus, the developers of the project can be divided into two categories: smelly and non-smelly developers. next, the number of changes made by individual developers to the repository is computed. the productivity of individual developers is calculated as the number of changes per active day. finally, the productivity of smelly and non-smelly developers is compared to identify the effect of missing link smell. the overview of the methodology is illustrated in figure 4. 1) data collection the data is collected from 7 open-source projects for the analysis of the study. the choice of these projects is guided by the availability of source code and developer mailing list archive. the source code of the selected projects is available in github, and the development mailing list archive is available in gmane, a mailing list archive. the name of the projects, source code repository, number of commits, number of files, lines of code, analyzed periods, project ages, number of developers are reported in table 1. the analyzed projects have different sizes in terms of kloc (ranging from 483 to 1392 klocs) and different community sizes (from 44 to 438 developers). 2) missing link smell detection missing link smells are detected in the projects according to the identification pattern introduced by [3]. first, the source code repository of a project is cloned locally from github (https://github.com/), and the mailing list archive is downloaded from gmane (http://gmane.io/). the projects are analyzed using a six-month window. for each window, a collaboration dsn is generated by analyzing the project's repository. all commits are analyzed. developers who contribute to the same part of source code within that window are connected through an edge. next, a communication dsn is constructed analyzing the mailing list of the project. all emails in the mailing list are analyzed, and developers who replied in the same email within a given window are connected. finally, collaboration dsn and communication dsn are compared to find missing link smell. for each edge in the collaboration network, the corresponding communication part is searched in the communication dsn. any edge that is present in collaboration dsn but absent in communication dsn is identified as missing link smell. fig. 4. proposed framework t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 33 the steps mentioned above are performed on selected projects using codeface4smells tool. the tool preprocesses the provided artifacts, i.e., source code repository and mailing list, and generates developers' collaboration and communication network [3]. the generated networks are then used to detect the occurrence of missing link smells. the tool returns the list of missing link smells along with the corresponding developers involved with these smells for each evaluated project. the developers involved in at least one missing link smell are identified as smelly developers, and the rest are considered non-smelly developers. figure 5 illustrates the collaboration network and the instances of missing link smell for a sixmonth window of mahout project. there are 15 developers in the collaboration network for this specific window. the original name of the developers is not disclosed due to privacy reasons. the instances of missing links are marked in the network, and the developers involved with missing link smells are marked with red. there are three instances of missing link smell, i.e., b-i, b-g, k-l. there are five developers involved with these smells, i.e., developer b, i, g, k, l. these five developers are considered smelly developers. table 1. list of analyzed projects # project repository commits files kloc analysis period authors age (year) 1 activemq github.com/apache/activemq 10771 5454 970 apr 2006 to jan 2021 143 15 2 cassandra github.com/apache/cassandra 25896 3989 989 oct 2009 to sep 2020 438 11 3 cayenne github.com/apache/cayenne 6644 5093 539 nov 2007 to aug 2020 62 13 4 cxf github.com/apache/cxf 16080 11701 1392 nov 2010 to sep 2020 203 10 5 jackrabbit github.com/apache/jackrabbit 8848 3610 660 dec 2005 to aep 2020 50 15 6 mahout github.com/apache/mahout 4480 2095 483 oct 2008 to aug 2020 64 12 7 pig github.com/apache/pig 3696 2458 591 oct 2010 to aug 2020 44 10 average 10916 4914 803 143 12 fig. 5. instances of missing link smells in a window of mahout project 34 t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 3) measuring productivity the productivity of an individual can be measured as the amount of output generated per unit time [19]. the most straightforward approach to measure the contribution of a developer is to count the number of commits. however, assessing the contribution of developers using the number of commits is not a viable measurement because all commits are not equal in size. therefore, the size of commits should be taken into account while measuring the developer's contribution. the total of modified lines in a commit is used to measure the size of that commit. the previous study also used a similar approach to measure the developer's contribution [20]. the contribution of a developer is extracted from the project repository. first, all the commits of an individual developer and all the files modified in these commits are identified. then the number of changes, i.e., the sum of added and deleted lines, in the modified files are calculated. then, the total number of changes is computed as the sum of all changes of a developer. next, the number of active days of the individual developer is measured by analyzing the commit history of that developer. the number of active days is the count of days the developer made at least one commit in the repository. then the productivity is calculated as the number of changes per active day by a developer. equation (1) shows how productivity is measured. productivity  numberoftotalchangesactivedays (1) 4) data analysis this study aims at understanding whether smelly developers exhibit different productivity compared to non-smelly developers. the following null hypothesis is formulated to investigate the impact of missing link smell on developers productivity: h0: the productivity of smelly and non-smelly developers is not significantly different. to attempt rejecting h0, wilcoxon rank sum test, a non-parametric statistical test, is used. this test can determine whether the difference of two ordinal or interval non-parametric distributions is significantly different. the test statistic (w) indicates a significant difference between two sample sets if the ranks of the two sets significantly differ. the test is used to assess whether the productivity of developers differs between smelly and non-smelly developer groups. the test will also reveal whether the observed difference between the productivity of smelly developers and non-smelly developers is statistically significant. the result is considered significant if the p-value is less than 0.01. iii. results and discussions this section presents and discusses the results obtained through the experimentation on the selected projects. the experimentation is performed according to the methodology stated above. the resulting dataset consists of 1004 developers from seven different projects. the number of smelly and nonsmelly developers of all evaluated projects is reported in table 2. the total number of smelly developers is 468, and the number of non-smelly developers is 536 in the evaluated projects. figure 6 illustrates the project-wise ratio of smelly and non-smelly developers. the productivity of both smelly and non-smelly developers is measured; the number of changes per active day. thus, the dataset contains two developer groups, i.e., smelly and non-smelly, with their table 2. number of smelly and non-smelly developers # project #committers #non-smelly #smelly 1 activemq 143 74 69 2 cassandra 438 233 205 3 cayenne 62 25 37 4 cxf 203 116 87 5 jackrabbit 50 26 24 6 mahout 64 40 24 7 pig 44 22 22 average 143 536 468 t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 35 corresponding productivity value. then the wilcoxon rank sum test is performed to assess the null hypothesis, h0, which states the productivity does not differ between these two groups. the p-value obtained from the test is used to accept or reject the null hypothesis. the mean productivity of these two groups is also calculated. the productivity of smelly and non-smelly developers is reported in table 3. the mean productivity of smelly developers is 333.90, whereas the mean productivity of non-smelly developers is 445.84. the observed difference is identified significant from wilcoxon rank sum test (w = 72374, p-value < 0.01). the p-value indicates that the null hypothesis h0 can be rejected. thus, the result implies that the productivity of smelly developers and non-smelly developers is significantly different. the productivity (mean) of non-smelly developers is significantly higher than smelly developers. table 3. mean productivity of smelly and non-smelly developers developer productivity (mean) p-value decision smelly 333.90 < 0.01 effect exist non-smelly 445.84 fig. 6. number of smelly and non-smelly developers 36 t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 the result suggests that the developers involved in missing link smell show lower productivity in terms of the number of changes per active day than the developers who are not involved in missing link smell. these results indicate that missing link smell affects the productivity of developers negatively. the lower productivity of developers can increase the cost of the software project. hence, missing links should be monitored carefully, and steps are taken to mitigate these smells if necessary. iv. conclusion this study investigates the effect of missing link smell on developers' productivity. the productivity of 1004 developers from seven open-source projects is analyzed. missing link smells are identified in these projects, and the developers are categorized into two groups, i.e., smelly and nonsmelly. productivity is measured as the number of changes performed by a developer per active day. the wilcoxon rank sum test result shows that the productivity differs significantly between smelly and non-smelly developers. the developers who are not involved in any missing link smell show higher productivity than the developers involved in smell. the result suggests that missing link smells should be taken care of to manage development productivity effectively. missing link smell should be monitored, and necessary steps should be taken to mitigate this smell to maintain productivity and software cost. the missing link smells detected by codeface4smells are directly included in the study without further verification. moreover, this tool uses a mailing list to generate the communication network as the source of communication data. the result can be different if other communication channels exist, such as skype and slack. however, according to contribution guidelines of evaluated projects, a mailing list is the primary communication channel in these communities. in the future, more open-source projects can be analyzed to generalize the result. moreover, other types of community smell such as organizational silo, radio silence can also be considered to see their effect on productivity. acknowledgment bangladesh research and education network (bdren) provides the virtual machine facility used in this research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering unversitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] d. a. tamburri, p. kruchten, p. lago, and h. van vliet, “social debt in software engineering: insights from industry,” j. internet serv. appl., vol. 6, no. 1, pp. 1–17, 2015. [2] f. palomba, d. andrew tamburri, f. arcelli fontana, r. oliveto, a. zaidman, and a. serebrenik, “beyond technical aspects: how do community smells influence the intensity of code smells?,” ieee trans. softw. eng., vol. 47, no. 1, pp. 108–129, 2018, doi: 10.1109/tse.2018.2883603. [3] s. magnoni, “an approach to measure community smellsin software development communities,” politecnico di milano, italy, 2016. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1186/s13174-015-0024-6 https://doi.org/10.1186/s13174-015-0024-6 https://doi.org/10.1109/tse.2018.2883603 https://doi.org/10.1109/tse.2018.2883603 https://doi.org/10.1109/tse.2018.2883603 https://www.politesi.polimi.it/bitstream/10589/123826/3/2016_07_magnoni.pdf https://www.politesi.polimi.it/bitstream/10589/123826/3/2016_07_magnoni.pdf t. ahammed et al. / knowledge engineering and data science 2021, 4 (1): 29–37 37 [4] a. trendowicz and j. münch, “factors influencing software development productivity-state-of-the-art and industrial experiences,” advances in computers, vol. 77. elsevier, pp. 185–241, 2009, doi: 10.1016/s0065-2458(09)01206-6. [5] d. a. tamburri, f. palomba, and r. kazman, “exploring community smells in open-source: an automated approach,” ieee trans. softw. eng., vol. 47, no. 3, pp. 630–652, 2021, doi: 10.1109/tse.2019.2901490. [6] d. a. tamburri, “software architecture social debt: managing the incommunicability factor,” ieee trans. comput. soc. syst., vol. 6, no. 1, pp. 20–37, 2019, doi: 10.1109/tcss.2018.2886433. [7] f. giarola, “detecting code and community smells in open-source: an automated approach,” politecnico di milano, italy, 2018. [8] g. catolino, f. palomba, d. a. tamburri, and a. serebrenik, “understanding community smells variability: a statistical approach,” in 2021 ieee/acm 43rd international conference on software engineering: software engineering in society (icse-seis), 2021, pp. 77–86, doi: 10.1109/icse-seis52602.2021.00017. [9] f. palomba and d. a. tamburri, “predicting the emergence of community smells using socio-technical metrics: a machine-learning approach,” j. syst. softw., vol. 171, p. 110847, 2021, doi: 10.1016/j.jss.2020.110847. [10] n. almarimi, a. ouni, and m. w. mkaouer, “learning to detect community smells in open source software projects,” knowledge-based syst., vol. 204, p. 106201, 2020, doi: 10.1016/j.knosys.2020.106201. [11] n. almarimi, a. ouni, m. chouchen, i. saidani, and m. w. mkaouer, “on the detection of community smells using genetic programming-based ensemble classifier chain,” in proceedings 2020 acm/ieee 15th international conference on global software engineering, icgse 2020, 2020, pp. 43–54, doi: 10.1145/3372787.3390439. [12] f. palomba, d. a. tamburri, a. serebrenik, a. zaidman, f. a. fontana, and r. oliveto, “how do community smells influence code smells?,” in proceedings international conference on software engineering, 2018, pp. 240–241, doi: 10.1145/3183440.3194950. [13] b. eken, f. palma, b. ayşe, and t. ayşe, “an empirical study on the effect of community smells on bug prediction,” softw. qual. j., vol. 29, no. 1, pp. 159–194, 2021. [14] g. catolino, f. palomba, d. a. tamburri, a. serebrenik, and f. ferrucci, “gender diversity and women in software teams: how do they affect community smells?,” in proceedings 2019 ieee/acm 41st international conference on software engineering: software engineering in society, icse-seis 2019, 2019, pp. 11–20, doi: 10.1109/icseseis.2019.00010. [15] g. catolino, f. palomba, d. a. tamburri, a. serebrenik, and f. ferrucci, “gender diversity and community smells: insights from the trenches,” ieee softw., vol. 37, no. 1, pp. 10–16, 2020, doi: 10.1109/ms.2019.2944594. [16] g. catolino, f. palomba, d. a. tamburri, a. serebrenik, and f. ferrucci, “refactoring community smells in the wild: the practitioner’s field manual,” in proceedings of the acm/ieee 42nd international conference on software engineering: software engineering in society, 2020, pp. 25–34. [17] t. ahammed, m. asad, and k. sakib, “understanding the involvement of developers in missing link communitysmell: an exploratory study on apache projects,” in proceedings of the 8th international workshop on quantitative approachesto software quality co-located with apsec 2020, singapore (virtual), 2020, pp. 64–70. [18] t. ahammed., m. asad., and k. sakib., “understanding the relationship between missing link community smell and fix-inducing changes,” in proceedings of the 16th international conference on evaluation of novel approaches to software engineering enase, 2021, pp. 469–475, doi: 10.5220/0010500604690475. [19] s. wagner and f. deissenboeck, “defining productivity in software engineering,” in rethinking productivity in software engineering, springer, 2019, pp. 29–38. [20] g. gousios, e. kalliamvakou, and d. spinellis, “measuring developer contribution from software repository data,” in proceedings international conference on software engineering, 2008, pp. 129–132, doi: 10.1145/1370750.1370781. https://doi.org/10.1016/s0065-2458(09)01206-6 https://doi.org/10.1016/s0065-2458(09)01206-6 https://doi.org/10.1109/tse.2019.2901490 https://doi.org/10.1109/tse.2019.2901490 https://doi.org/10.1109/tcss.2018.2886433 https://doi.org/10.1109/tcss.2018.2886433 https://www.politesi.polimi.it/handle/10589/140195 https://www.politesi.polimi.it/handle/10589/140195 https://doi.org/10.1109/icse-seis52602.2021.00017 https://doi.org/10.1109/icse-seis52602.2021.00017 https://doi.org/10.1109/icse-seis52602.2021.00017 https://doi.org/10.1016/j.jss.2020.110847 https://doi.org/10.1016/j.jss.2020.110847 https://doi.org/10.1016/j.knosys.2020.106201 https://doi.org/10.1016/j.knosys.2020.106201 https://doi.org/10.1145/3372787.3390439 https://doi.org/10.1145/3372787.3390439 https://doi.org/10.1145/3372787.3390439 https://doi.org/10.1145/3183440.3194950 https://doi.org/10.1145/3183440.3194950 https://doi.org/10.1145/3183440.3194950 https://doi.org/10.1007/s11219-020-09538-7 https://doi.org/10.1007/s11219-020-09538-7 https://doi.org/10.1109/icse-seis.2019.00010 https://doi.org/10.1109/icse-seis.2019.00010 https://doi.org/10.1109/icse-seis.2019.00010 https://doi.org/10.1109/icse-seis.2019.00010 https://doi.org/10.1109/ms.2019.2944594 https://doi.org/10.1109/ms.2019.2944594 https://ieeexplore.ieee.org/document/9276528 https://ieeexplore.ieee.org/document/9276528 https://ieeexplore.ieee.org/document/9276528 http://ceur-ws.org/vol-2767/08-quasoq-2020.pdf http://ceur-ws.org/vol-2767/08-quasoq-2020.pdf http://ceur-ws.org/vol-2767/08-quasoq-2020.pdf https://doi.org/10.5220/0010500604690475 https://doi.org/10.5220/0010500604690475 https://doi.org/10.5220/0010500604690475 https://doi.org/10.1007/978-1-4842-4221-6_4 https://doi.org/10.1007/978-1-4842-4221-6_4 https://doi.org/10.1145/1370750.1370781 https://doi.org/10.1145/1370750.1370781 i. introduction ii. methods a. missing link community smell. b. proposed framework 1) data collection 2) missing link smell detection 3) measuring productivity 4) data analysis iii. results and discussions iv. conclusion acknowledgment declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 2, december 2020, pp. 60–66 eissn 2597-4637 https://doi.org/10.17977/um018v3i22020p60-66 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) efficient scheduling of plantation company workers using genetic algorithm wayan firdaus mahmudy 1, *, andreas pardede 2, agus wahyu widodo 3, muh arif rahman 4 faculty of computer science, brawijaya university jl. veteran no. 8, malang 65145, indonesia 1 wayanfm@ub.ac.id *; 2 andreas.pgpard@gmail.com; 3 a_wahyu_w@ub.ac.id; 4 m_arif@ub.ac.id * corresponding author i. introduction scheduling is an activity carried out by allocating certain resources intended to perform a job or task related to time. scheduling is a part of the industry's decision-making process to allocate existing data or resources to be utilized more optimally [1]. good scheduling is required by a company that works in the field of plantation and garden management. the company has a number of tasks such as planning all the care of plants and green areas, providing all fertilizer and plant material needs to ensure all plants are growing well and productive, harvesting crops, and maintaining the stock of fertilizer. plantation activities are carried out by plantation workers whose schedules are determined by the company. the density of worker activities must be balanced with efficient and fair work scheduling. a good schedule will minimize worker dissatisfaction and work stress while maintaining physical health [2][3]. the schedule should consider a fair allocation of working time in holidays, ensuring all tasks are assigned to a worker, and even the number of working days of each worker every month. scheduling that involves complex constraints is a challenging task. several methods have been proposed in the literature. meta-heuristic algorithms are often applied as is have the capability to deal with complex constraints. examples of the algorithms that have been applied for the complex scheduling are ant colony optimization algorithm [4][5], simulated annealing [6][7], tabu search [8], particle swarm optimization [9][10], variable neighborhood search [11][12][13], and genetic algorithm [14][15][16]. the genetic algorithm is a class of evolutionary algorithm which is the most widely used for optimization [17]. several studies have reported the genetic algorithm's robustness to deal with a complex problem, including scheduling [18][19][20]. this study aims to optimize the plantation company worker's schedules using a genetic algorithm. a proper chromosome representation for the genetic algorithm is key to efficiently explore a large search space of the problem [21]. thus, an article info a b s t r a c t article history: received 23 august 2020 revised 21 october 2020 accepted 02 november 2020 published online 31 december 2020 workers at large plantation companies have various activities. these activities include caring for plants, regularly applying fertilizers according to schedule, and crop harvesting activities. the density of worker activities must be balanced with efficient and fair work scheduling. a good schedule will minimize worker dissatisfaction while also maintaining their physical health. this study aims to optimize workers' schedules using a genetic algorithm. an efficient chromosome representation is designed to produce a good schedule in a reasonable amount of time. the mutation method is used in combination with reciprocal mutation and exchange mutation, while the type of crossover used is one cut point, and the selection method is elitism selection. a set of computational experiments is carried out to determine the best parameters’ value of the genetic algorithm. the final result is a better 30 days worker schedule compare to the previous schedule that was produced manually. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: scheduling genetic algorithm crossover mutation best parameter http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v3i22020p60-66 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ w.f. mahmudy et al. / knowledge engineering and data science 2020, 3 (2): 60–66 61 efficient chromosome representation is designed to produce a good schedule in a reasonable amount of time. this study considers a scheduling problem in a company that has a large garden and plantation area. there is a total plantation area of ± 50 ha and plantation management of ± 200 ha. 18 workers assign to 3 work shifts every day. scheduling used in this study is to process guidelines in terms of work time and vacation time of plantation workers. in the process of scheduling, the components used are workers, days, and shifts. the worker shift schedule has the goal of providing a daily schedule divided into thirty-day shifts for each worker. in this study, thirteen employees are assigned into 3 shifts of working time. the morning shift is started at seven o'clock in the morning until three o'clock in the afternoon, then continued with the afternoon shift that is started at three o'clock in the afternoon until eleven o'clock at night. the night shift is started at eleven at night o'clock and is ended at seven o'clock in the morning. ii. method the current schedule is made manually. some workers feel dissatisfied as they have more working time in national holiday than other workers. other problems are raised when a worker is assigned working time in 2 nights shifts continuously as it will impact worker’s physical health. a genetic algorithm is an algorithm that provides an alternative to the traditional search techniques by adapting the mechanisms found in the genetic world. the genetic algorithm has been successfully applied for complex scheduling [18][19][20]. there are several stages in implementing the genetic algorithm, including determining chromosome structure as solution representation, crossover, and mutation to produce new solutions and selection to pass the chosen solution for the next iterations [22]. determining chromosome representation is a crucial step in implementing the genetic algorithm. the chromosome representation used is an integer representation. chromosomes are made based on the division of the number of plants, the number of workers, and also shift workers. each chromosome will contain numbers from the number of workers that will be entered into each existing shift. an example of part of a chromosome for 2 working days is presented in figure 1. there are three shifts: morning, afternoon, and night. the division of workers will then be divided based on working days and work shifts set by the company. in figure 1, some numbers refer to worker numbers. the number of workers in this calculation is 13. then, the numbers from 1 to 13 will be assigned to 9 cells (each work shift requires three workers). then do as much as 30 days / 1 month each worker must not work on the same day and time, and there must also be a balance in the distribution of employee scheduling systems. therefore, it needs good scheduling in which each employee's composition must be on par with the others. penalty calculation is used to find the fitness value of each possible solution [23]. the assessment is used to measure how good the chromosome is for worker scheduling. the number of violations or penalties that appear on the chromosome is required to calculate the fitness value. for each violation, the penalty value will increase 1. the list of penalties that have been determined is presented in table 1. day 1 day 2 morning afternoon night morning afternoon night chromosomes’ genes 1 4 2 3 1 11 12 6 5 6 7 9 9 1 2 3 8 10 fig. 1. example of a chromosome table 1. list of penalties penalty mean p1 one worker is registered on the same day/shift p2 workers have more vacation days than specified p3 workers who have shifted the night will work again in the morning shift the next day 62 w.f. mahmudy et al. / knowledge engineering and data science 2020, 3 (2): 60–66 how to calculate fitness on the representation of chromosomes is done by first calculating the number of penalties on each chromosome. for example, there are 9, 12, and 4 violations, according to p1, p2, and p3, respectively. thus, the total penalty is calculated in equation (1): 𝑡𝑜𝑡𝑎𝑙𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = 𝑝𝑒𝑛𝑎𝑡𝑦 1 + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 2 + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 3 (1) = 9 + 12 + 4 = 25 the fitness is calculated according to the total penalty as shown in equation (2): 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 = 100 100+25 = 0.8 (2) initial population generation is required for the initial steps in carrying out the counting process. the population is formed from randomized data, with a predetermined number of populations. the amount of the population will be equalized for each generation. the crossover needed in the genetic process is to get offspring or new chromosomes by choosing two parents. one cut point crossover is a process of crossing between 2 chromosomes or random individuals, which cut the two chromosomes and then combine them to the other cut results. the mutation process is used to get offspring by using 1 parent. exchange mutation is used. the method works by randomly selecting two positions (exchange point) and then exchanging the two positions' values. after doing a fitness calculation for all chromosomes resulting from the crossover and mutation process, it is also necessary to select which chromosomes will be passed to the next generation. elitsm selection is implemented by gathering both parent and offspring into a pool, then the highest value of chromosomes will be selected iii. results and discussions a. population size testing testing on this stage is carried out to measure the effect of population size (number of chromosomes in the population) on the fitness values. the test will use the crossover rate with a value of 0.4 and a mutation rate with a value of 0.6. the results of the test are presented in table 2 and figure 2. figure 2 shows that in the low population size, the fitness value is lower than the others. the greater the population's size, the fitness value will also tend to be better, but at some point, the increase in population size does not provide a significant increase in fitness value. in the 140th population and 200th population, fitness value achieves the best value compared to other populations. then the optimal solution in testing the population is a population of 200. therefore, the next test population will be used 200 for further testing. in the 200th population, the fitness value achieved is 0.6944. table 2. result of population size testing population size fitness average 20 0.66225 40 0.66666 60 0.68493 80 0.67567 100 0.68027 120 0.67567 140 0.69444 160 0.68027 180 0.68965 200 0.69444 220 0.68027 240 0.67114 w.f. mahmudy et al. / knowledge engineering and data science 2020, 3 (2): 60–66 63 b. number of generations testing testing the number of generations aims to find the optimum generation known by looking at the best average fitness results. in this stage, the population size used is 200, and the values of the crossover rate and mutation rate are 0.4 and 0.6, respectively. as with the effect of population size, the greater the number of generations, the fitness value will also tend to be better, but at a certain point, the addition of the number of generations does not provide a significant increase in fitness. figure 3 shows that from generation 0 to generation 140, there has been a significant increase in the average fitness value. in the 172nd generation, the average value of fitness has not increased at all. in the 172nd generation, the average fitness value obtained was 0.71942. therefore, the 172nd generation is the best. c. crossover and mutation rate testing tests of crossover rate and mutation rate are made to determine the optimal value to reach the best fitness value. the use of crossovers and mutations is to combine and produce offspring for the next generations from the previous explanation. in this test, the value of the rate is used, starting from 0 to 1 with an interval of 0.1. from these results, it can be seen which combination is the best result in this optimization. the result is presented in table 3. the crossover and mutation rate testing is started from the value mutation rate of 1 and a crossover rate of 0. the first test's value produces a value of 0.609, which is still relatively low compared with the other results. the highest fitness average is found in the mutation rate of 0.4 and a crossover rate of 0.6. therefore, the values will be used for the next stage. d. scheduling result testing at this stage is used to get the best schedule using the best parameters that have been obtained in the previous tests. in the last test, the best population was the 200th population, with 172 generations using a crossover rate and mutation rate of 0.6 and 0.4, respectively. the change of fitness values during generations is shown in figure 4. in the generations of 172, the genetic algorithm obtained a solution with a smaller number of penalties. in this case, the fitness value obtained is 0.7246, and the schedule is shown in table 4. in the worker schedule for 30 days, there are 3 division numbers indicating each shift. the division starts from the morning, afternoon, and night shifts. the worker's id will fill each shift. fig. 2. result of population size testing 0.66 0.665 0.67 0.675 0.68 0.685 0.69 0.695 0.7 0 50 100 150 200 250 300 fi tn e ss population size 64 w.f. mahmudy et al. / knowledge engineering and data science 2020, 3 (2): 60–66 table 3. result of crossover dan mutation rate testing no mutation rate crossover rate fitness average 1 1 0 0.6092052235648612 2 0.9 0.1 0.6244287846532304 3 0.8 0.2 0.6415168703054328 4 0.7 0.3 0.6672733216846897 5 0.6 0.4 0.6775241304953841 6 0.5 0.5 0.6938093972311976 7 0.4 0.6 0.6951620993246688 8 0.3 0.7 0.6950206375328978 9 0.2 0.8 0.6837869424975724 10 0.1 0.9 0.6747470681063829 11 0 1 0.6027972532168469 table 4. worker schedule for 30 days day worker morning afternoon night 1 [3, 9, 13] [12, 5, 2] [6, 8, 7] 2 [12, 10, 1] [13, 4, 6] [9, 5, 8] 3 [2, 3, 1] [4, 6, 8] [11, 5, 10] 4 [1, 9, 12] [6, 7, 3] [13, 10, 5] 5 [8, 2, 11] [10, 12, 3] [5, 1, 9] 6 [7, 11, 13] [2, 6, 3] [4, 8, 5] 7 [12, 1, 3] [9, 4, 7] [8, 10, 6] 8 [3, 5, 12] [7, 9, 8] [4, 2, 6] 9 [11, 13, 5] [1, 4, 2] [6, 10, 7] 10 [11, 1, 12] [4, 2, 6] [5, 8, 9] 11 [6, 10, 3] [13, 8, 2] [9, 7, 4] 12 [8, 12, 3] [1, 4, 11] [13, 10, 7] 13 [1, 12, 3] [8, 9, 5] [13, 4, 6] 14 [8, 11, 9] [2, 4, 5] [13, 7, 3] 15 [10, 6, 9] [1, 12, 3] [2, 13, 11] 16 [7, 9, 6] [13, 5, 4] [2, 8, 12] 17 [1, 10, 7] [9, 11, 3] [13, 5, 12] 18 [9, 1, 4] [8, 12, 7] [11, 3, 2] 19 [7, 4, 5] [2, 1, 3] [13, 10, 9] 20 [11, 7, 1] [12, 2, 6] [10, 9, 3] 21 [6, 12, 7] [5, 2, 4] [1, 13, 3] 22 [8, 12, 5] [7, 3, 6] [9, 11, 10] 23 [8, 12, 6] [10, 4, 9] [3, 7, 1] 24 [11, 6, 5] [1, 8, 13] [12, 4, 10] 25 [13, 11, 5] [10, 6, 1] [3, 2, 4] 26 [13, 1, 9] [11, 2, 8] [4, 12, 6] 27 [2, 13, 11] [5, 8, 7] [12, 4, 10] 28 [2, 13, 11] [5, 8, 7] [12, 4, 10] 29 [1, 9, 6] [5, 11, 8] [2, 3, 12] 30 [4, 11, 1] [2, 8, 12] [7, 10, 5] w.f. mahmudy et al. / knowledge engineering and data science 2020, 3 (2): 60–66 65 iv. conclusion genetic algorithms for optimizing worker scheduling can be implemented with integer chromosome representation. the chromosome representation consists of the worker's id, date of work, and daily shift. an initial stage is required to obtain the best parameter values for the genetic algorithm. the best parameter values are required to ensure that the genetic algorithm produces a good worker’s schedule in a reasonable amount of time. using the best parameter values, the genetic algorithm obtained a solution with a smaller number of penalties in the generations of 172. the next study will consider adding local searches into the genetic algorithm’s cycle to produce a better solution. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. fig. 3. result of number of generations testing fig. 4. result of genetic algorithm running using the best parameter values 66 w.f. mahmudy et al. / knowledge engineering and data science 2020, 3 (2): 60–66 additional information no additional information is available for this paper. references [1] w. f. mahmudy, r. m. marian, and l. h. s. luong, real coded genetic algorithms for solving flexible job-shop scheduling problempart i: modelling, vol. 701. 2013. [2] h. kikuchi et al., “association of overtime work hours with various stress responses in 59,021 japanese workers: retrospective cross-sectional study,” plos one, vol. 15, no. 3, p. e0229506, mar. 2020. [3] k. bhui, s. dinos, m. galant-miecznikowska, b. de jongh, and s. stansfeld, “perceptions of work stress causes and effective interventions in employees working in public, private and non-governmental organisations: a qualitative study,” bjpsych bull., vol. 40, no. 6, pp. 318–325, dec. 2016. [4] z. jia, j. yan, j. y. t. leung, k. li, and h. chen, “ant colony optimization algorithm for scheduling jobs with fuzzy processing time on parallel batch machines with different capacities,” appl. soft comput., vol. 75, pp. 548–561, 2019. [5] l. wang, j. cai, m. li, and z. liu, “flexible job shop scheduling problem using an improved ant colony optimization,” sci. program., vol. 2017, p. 9016303, 2017. [6] c. gallo and v. capozzi, “a simulated annealing algorithm for scheduling problems,” j. appl. math. phys., vol. 7, oct. 2019. [7] f. chahyadi, a. azhari, and h. kurniawan, “hospital nurse scheduling optimization using simulated annealing and probabilistic cooling scheme,” indones. j. comput. cybern. syst., vol. 12, no. 1, pp. 21–32, 2018. [8] a. dabah, a. bendjoudi, and a. aitzai, “an efficient tabu search neighborhood based on reconstruction strategy to solve the blocking job shop scheduling problem,” j. ind. manag. optim., vol. 13, no. 4, pp. 2015–2031, 2017. [9] a. i. awad, n. a. el-hefnawy, and h. m. abdel_kader, “enhanced particle swarm optimization for task scheduling in cloud computing environments,” procedia comput. sci., vol. 65, pp. 920–929, 2015. [10] h. jiang, j. liu, h.-w. cheng, and y. zhang, “particle swarm optimization based space debris surveillance network scheduling,” res. astron. astrophys., vol. 17, no. 3, p. 30, 2017. [11] s. thevenin and n. zufferey, “learning variable neighborhood search for a scheduling problem with time windows and rejections,” discret. appl. math., vol. 261, pp. 344–353, 2019. [12] w. jomaa, m. eddaly, and b. jarboui, “variable neighborhood search algorithms for the permutation flowshop scheduling problem with the preventive maintenance,” oper. res., 2019. [13] m. samà, a. d׳ariano, f. corman, and d. pacciarelli, “a variable neighbourhood search for fast train scheduling and routing during disturbed railway traffic situations,” comput. oper. res., vol. 78, pp. 480–499, 2017. [14] r. rody, w. f. mahmudy, and i. p. tama, “using guided initial chromosome of genetic algorithm for scheduling production-distribution system,” j. inf. technol. comput. sci., vol. 4, no. 1, pp. 26–32, 2019. [15] m. l. seisarrina, i. cholissodin, and h. nurwarsito, “invigilator examination scheduling using partial random injection and adaptive time variant genetic algorithm,” j. inf. technol. comput. sci., vol. 3, no. 2, pp. 113–119, 2018. [16] h. algethami, r. l. pinheiro, and d. landa-silva, “a genetic algorithm for a workforce scheduling and routing problem,” in 2016 ieee congress on evolutionary computation (cec), 2016, pp. 927–934. [17] v. meilia, b. d. setiawan, and n. santoso, “extreme learning machine weights optimization using genetic algorithm in electrical load forecasting,” j. inf. technol. comput. sci., vol. 3, no. 1, pp. 77–87, 2018. [18] a. rahmi, w. f. mahmudy, and m. z. sarwani, “genetic algorithms for optimization of multi-level product distribution,” int. j. artif. intell., vol. 18, no. 1, pp. 135–147, 2020. [19] v. n. wijayaningrum and w. f. mahmudy, “optimization of ship’s route scheduling using genetic algorithm,” indones. j. electr. eng. comput. sci., vol. 2, no. 1, 2016. [20] l. r. abreu, j. o. cunha, b. a. prata, and j. m. framinan, “a genetic algorithm for scheduling open shops with sequence-dependent setup times,” comput. oper. res., vol. 113, p. 104793, 2020. [21] w. f. mahmudy, r. m. marian, and l. h. s. luong, “hybrid genetic algorithms for multi-period part type selection and machine loading problems in flexible manufacturing system,” 2013 ieee int. conf. comp. intl. cyb. (cyberneticscom), dec. 2013. [22] m. gen and r. cheng, genetic algorithms and engineering optimization. new york: john wiley & sons, inc., 2000. [23] b. f. rosa, m. j. f. souza, s. r. de souza, m. f. de frança filho, z. ales, and p. y. p. michelon, “algorithms for job scheduling problems with distinct time windows and general earliness/tardiness penalties,” comput. oper. res., vol. 81, pp. 203–215, 2017. https://doi.org/10.4028/www.scientific.net/amr.701.359 https://doi.org/10.4028/www.scientific.net/amr.701.359 https://doi.org/10.1371/journal.pone.0229506 https://doi.org/10.1371/journal.pone.0229506 https://doi.org/10.1192/pb.bp.115.050823 https://doi.org/10.1192/pb.bp.115.050823 https://doi.org/10.1192/pb.bp.115.050823 https://doi.org/10.1016/j.asoc.2018.11.027 https://doi.org/10.1016/j.asoc.2018.11.027 https://doi.org/10.1155/2017/9016303 https://doi.org/10.1155/2017/9016303 https://doi.org/10.4236/jamp.2019.711176 https://doi.org/10.4236/jamp.2019.711176 https://doi.org/10.22146/ijccs.23056 https://doi.org/10.22146/ijccs.23056 https://doi.org/10.3934/jimo.2017029 https://doi.org/10.3934/jimo.2017029 https://doi.org/10.1016/j.procs.2015.09.064 https://doi.org/10.1016/j.procs.2015.09.064 https://doi.org/10.1088/1674-4527/17/3/30 https://doi.org/10.1088/1674-4527/17/3/30 https://doi.org/10.1016/j.dam.2018.03.019 https://doi.org/10.1016/j.dam.2018.03.019 https://doi.org/10.1007/s12351-019-00507-y https://doi.org/10.1007/s12351-019-00507-y https://doi.org/10.1016/j.cor.2016.02.008 https://doi.org/10.1016/j.cor.2016.02.008 https://doi.org/10.25126/jitecs.20194195 https://doi.org/10.25126/jitecs.20194195 https://doi.org/10.25126/jitecs.20183250 https://doi.org/10.25126/jitecs.20183250 https://doi.org/10.1109/cec.2016.7743889 https://doi.org/10.1109/cec.2016.7743889 https://doi.org/10.25126/jitecs.20183154 https://doi.org/10.25126/jitecs.20183154 http://www.ceser.in/ceserp/index.php/ijai/article/view/6382 http://www.ceser.in/ceserp/index.php/ijai/article/view/6382 https://doi.org/10.11591/ijeecs.v2.i1.pp180-186 https://doi.org/10.11591/ijeecs.v2.i1.pp180-186 https://doi.org/10.1016/j.cor.2019.104793 https://doi.org/10.1016/j.cor.2019.104793 https://doi.org/10.1109/cyberneticscom.2013.6865795 https://doi.org/10.1109/cyberneticscom.2013.6865795 https://doi.org/10.1109/cyberneticscom.2013.6865795 https://doi.org/10.1002/9780470172261 https://doi.org/10.1016/j.cor.2016.12.024 https://doi.org/10.1016/j.cor.2016.12.024 https://doi.org/10.1016/j.cor.2016.12.024 i. introduction ii. method iii. results and discussions a. population size testing b. number of generations testing c. crossover and mutation rate testing d. scheduling result iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] w. f. mahmudy, r. m. marian, and l. h. s. luong, real coded genetic algorithms for solving flexible job-shop scheduling problempart i: modelling, vol. 701. 2013. [2] h. kikuchi et al., “association of overtime work hours with various stress responses in 59,021 japanese workers: retrospective cross-sectional study,” plos one, vol. 15, no. 3, p. e0229506, mar. 2020. [3] k. bhui, s. dinos, m. galant-miecznikowska, b. de jongh, and s. stansfeld, “perceptions of work stress causes and effective interventions in employees working in public, private and non-governmental organisations: a qualitative study,” bjpsych bul... [4] z. jia, j. yan, j. y. t. leung, k. li, and h. chen, “ant colony optimization algorithm for scheduling jobs with fuzzy processing time on parallel batch machines with different capacities,” appl. soft comput., vol. 75, pp. 548–561, 2019. [5] l. wang, j. cai, m. li, and z. liu, “flexible job shop scheduling problem using an improved ant colony optimization,” sci. program., vol. 2017, p. 9016303, 2017. [6] c. gallo and v. capozzi, “a simulated annealing algorithm for scheduling problems,” j. appl. math. phys., vol. 7, oct. 2019. [7] f. chahyadi, a. azhari, and h. kurniawan, “hospital nurse scheduling optimization using simulated annealing and probabilistic cooling scheme,” indones. j. comput. cybern. syst., vol. 12, no. 1, pp. 21–32, 2018. [8] a. dabah, a. bendjoudi, and a. aitzai, “an efficient tabu search neighborhood based on reconstruction strategy to solve the blocking job shop scheduling problem,” j. ind. manag. optim., vol. 13, no. 4, pp. 2015–2031, 2017. [9] a. i. awad, n. a. el-hefnawy, and h. m. abdel_kader, “enhanced particle swarm optimization for task scheduling in cloud computing environments,” procedia comput. sci., vol. 65, pp. 920–929, 2015. [10] h. jiang, j. liu, h.-w. cheng, and y. zhang, “particle swarm optimization based space debris surveillance network scheduling,” res. astron. astrophys., vol. 17, no. 3, p. 30, 2017. [11] s. thevenin and n. zufferey, “learning variable neighborhood search for a scheduling problem with time windows and rejections,” discret. appl. math., vol. 261, pp. 344–353, 2019. [12] w. jomaa, m. eddaly, and b. jarboui, “variable neighborhood search algorithms for the permutation flowshop scheduling problem with the preventive maintenance,” oper. res., 2019. [13] m. samà, a. d׳ariano, f. corman, and d. pacciarelli, “a variable neighbourhood search for fast train scheduling and routing during disturbed railway traffic situations,” comput. oper. res., vol. 78, pp. 480–499, 2017. [14] r. rody, w. f. mahmudy, and i. p. tama, “using guided initial chromosome of genetic algorithm for scheduling production-distribution system,” j. inf. technol. comput. sci., vol. 4, no. 1, pp. 26–32, 2019. [15] m. l. seisarrina, i. cholissodin, and h. nurwarsito, “invigilator examination scheduling using partial random injection and adaptive time variant genetic algorithm,” j. inf. technol. comput. sci., vol. 3, no. 2, pp. 113–119, 2018. [16] h. algethami, r. l. pinheiro, and d. landa-silva, “a genetic algorithm for a workforce scheduling and routing problem,” in 2016 ieee congress on evolutionary computation (cec), 2016, pp. 927–934. [17] v. meilia, b. d. setiawan, and n. santoso, “extreme learning machine weights optimization using genetic algorithm in electrical load forecasting,” j. inf. technol. comput. sci., vol. 3, no. 1, pp. 77–87, 2018. [18] a. rahmi, w. f. mahmudy, and m. z. sarwani, “genetic algorithms for optimization of multi-level product distribution,” int. j. artif. intell., vol. 18, no. 1, pp. 135–147, 2020. [19] v. n. wijayaningrum and w. f. mahmudy, “optimization of ship’s route scheduling using genetic algorithm,” indones. j. electr. eng. comput. sci., vol. 2, no. 1, 2016. [20] l. r. abreu, j. o. cunha, b. a. prata, and j. m. framinan, “a genetic algorithm for scheduling open shops with sequence-dependent setup times,” comput. oper. res., vol. 113, p. 104793, 2020. [21] w. f. mahmudy, r. m. marian, and l. h. s. luong, “hybrid genetic algorithms for multi-period part type selection and machine loading problems in flexible manufacturing system,” 2013 ieee int. conf. comp. intl. cyb. (cyberneticscom), dec. 2013. [22] m. gen and r. cheng, genetic algorithms and engineering optimization. new york: john wiley & sons, inc., 2000. [23] b. f. rosa, m. j. f. souza, s. r. de souza, m. f. de frança filho, z. ales, and p. y. p. michelon, “algorithms for job scheduling problems with distinct time windows and general earliness/tardiness penalties,” comput. oper. res., vol. 81, pp. 203... knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 1, january 2018, pp. 8–19 eissn 2597-4637 https://doi.org/10.17977/um018v1i12018p8-19 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) decision support system determination of main work unit in wpp-711 using fuzzy topsis hozairi a, 1, *, yaser krisnafi b, 2 a informatics eng. study program, universitas islam madura, jl. pp miftahul ulum bettet, pamekasan 69317, indonesia b fishing technology study prog., sekolah tinggi perikanan, jl. aup no. 1 pasar minggu, jakarta 12520, indonesia 1 dr.hozairi@gmail.com*, 2 yaser_bunda@yahoo.co.id * corresponding author i. introduction the marine and fishery sector has a strategic role in supporting the development of the national economy. indonesia has the potential for major fisheries resources, where in fact the production reaches ± 6.26 million tons per year. as a result, indonesia became a target of illegal fishing by fishermen from several neighboring countries. republic of indonesia legislation no. 27 of 2007 mandates on the management of coastal areas and small islands which belongs to the legitimacy of fishery resources control activities. surveillance and law enforcement in the field of fisheries is one of the main tasks and functions for the directorate of fishery patrol boat which is implemented through a patrol boat in conducting a surveillance operation of marine resources and fisheries. in accordance with the main duties and functions of the fishery patrol boat in table 1, the area of operation of the fishery patrol boat is divided into 2 (two): west region (strait of malacca, south china sea, the indian ocean, mentawai of western sumatra to the south of java), and eastern region (ocean indian, northeast flores, banda sea, arafura sea, the maluku sea, gulf of tomini, sulawesi sea and pacific oceans). fisheries management areas are the republic of indonesia or often referred as wpp nri in indonesian (or wpp in english) is a fishery management area for fishing, conservation, research, and development of fisheries covering the waters, the archipelagic waters, territorial sea, contiguous zone and exclusive economic zone of indonesia (eez). pursuant to the regulation of the minister of marine and fisheries no.01 / men / 2009 on regional fisheries management of the republic of indonesia, it has set the division into 11 wpp (indonesia, 2014), in detail wpp division is illustrated in fig. 1 article info a b s t r a c t article history: received 27 august 2017 revised 15 september 2017 accepted 20 october 2017 published online 8 january 2018 decision-making to determine the working units for being prioritized to be developed in order to improve fishery monitoring in wpp-711 is imperative. the ministry of maritime affairs and fisheries should make no mismatch decision-making through long-term calculation and analysis. the problem of determining the priority of working units is a complex problem, thus it is required to find an appropriate method to avoid a mismatch decision. technique for order of preference by similarity to ideal solution (topsis) is a decision-making method capable of solving multi-criteria problems. topsis working principle determines the alternative by considering the shortest distance from the positive ideal alternative and furthest from the ideal negative solution. to improve the performance of topsis, this research is integrated with fuzzy logic with the aim of giving the right numeric value preference. from the test of 11 alternatives of 6 criteria, the priority of development of fishery monitoring in fma 711 is: pontianak working unit= 0.883, batam working unit = 0.767 natuna working unit = 0.681 and tanjung pinang working unit = 0.423. furthermore, the ranking result will be used as the basis for determining the strategy in increasing the monitoring of wpp-711 to minimize state losses due to the illegal fishing within indonesia’s wpp-711 regions. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: decision support system fuzzy topsis working unit wpp-711 http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v1i12018p8-19 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 9 according to the data from the ministry of marine affairs and fisheries (kkp) in 2015, there are 3 (three) areas in indonesian waters with high vulnerability to the illegal fishing by foreign fishing vessels as table 2 presented. from the three prone areas, one region is included in wpp-711, it is natuna sea. based on table 2, it is explained that the highly vulnerable water areas are present in wpp-711 covering several work units spreading in the region. this study aims at determining the priority of work unit area in fma 711 with 11 (eleven) alternatives and 6 (six) criteria, so as to be able to find the work unit area which is very potential to improve the supervision of wild fisheries. criteria and alternative names can be seen in tables 3 and 4. working unit area priority determination is considered as discrete issues. it aims at designating the outstanding alternative from a number of alternatives provided in accordance with several existing particular criteria which at the end of the problems could be immediately resolved using multi criteria decision making (mcdm) method [1], [2]. table 1. fishery patrol boats. no type amount size (m) material 1 fpb hiu macan tutul 2 42 iron + alumunium 2 fpb hiu macan 6 36 iron + alumunium, fiberglass 3 fpb hiu 15 27 alumunium, fiberglass 4 fpb takalamongan 1 21 fiberglass 5 fpb padaido 1 21 fiberglass 6 fpb todak 2 17 fiberglass 7 fpb baracuda 2 17 fiberglass 8 fpb paus 1 36 baja 9 fpb akar bahar 1 15 fiberglass 10 fpb orca 4 60 iron + alumunium source: processed primary data fig. 1. indonesian fishery management areas map table 2. indonesia’s areas prone to illegal fishing area1 area2 area3 natuna sea sulawesi sea arafura sea chinese fishermen philippine fishermen chinese fishermen vietnamese fishermen malaysian fishermen thai fishermen thai fishermen source: dpkp kkp ri 2015 10 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 the method developed to determine working unit priorities in wpp-711 is a technique for order of preference by similarity to ideal solution (topsis). this method work by employing the principle that the alternative chosen should have the shortest distance from the ideal positive solution and farthest from the negative ideal solution by using euclidean distance to determine the relative proximity of an alternative to the optimal solution [3], [4]. topsis considers both, the distance of positive ideal solution and the distance of negative ideal solution by taking the proximity relative to the positive ideal solution. based on a comparison of the relative distance, an alternative set of priorities can be achieved [5], [6]. the main advantage of topsis compared with other mcdm methods in complex problem decision making is simple to use, can take into account all kinds of criteria (subjective and objective), rational logic and easy to understand for practitioners, the calculation of the process is very easy, the concept allows the pursuit of the best alternative criteria depicted in mathematics simply and important weight can be incorporated easily [2], [7], [8], [9]. in topsis, performance rating and the weight of these criteria are given as crisp values. one of the problems of traditional topsis is the use of crisp values in the evaluation process. another difficulty to use crisp values is that some criteria are difficult to measure by crisp values, so during evaluation these criteria are usually ignored [10], [11]. using triangle fuzzy for fuzzy topsis to simplify the process of calculating fuzzy triangle numbers in the decision making process. in addition, it has been verified that modeling with triangular fuzzy numbers is an effective way of formulating decision problems with available information being subjective and inaccurate [12], [13], [14]. ii. methods this research was initiated from distributing questionnaires to 25 respondents (expert) who is acquainted and understand the condition of wpp-711. the purpose of this questionnaire as input data to test the consistency of the assessment of each alternative. the rating assessment are very bad (vb), bad (b), average (a), good (g), and very good (vg). if it is converted to crisp, the value aggregation of the questionnaire uses five fuzzy numbers as shown in fig. 2: the results of the questionnaire assessment based on the predetermined rating can be table 3. working unit determination priority criteria code criteria k1 border area k2 potential fish resources k3 international sea lanes k4 infrastructure k5 patrol ship amount k6 law enforcement table 4. working unit wpp 711 alternative code working unit wpp-711 a1 sdkp pontianak a2 pemangkat a3 teluk batang a4 sungai liat a5 tanjung balai karimun a6 moro a7 batam a8 tarempa a9 natuna a10 pulau kijang a11 tanjung pinang hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 11 seen in table 5 on alternative a1 (pontianak work unit). the next step is to convert it into fuzzy numbers according to the fuzzy set in table 6, obtained the following results:  k1 = 5 = (0.75, 1, 1)  k2 = 5 = (0.75, 1, 1)  k3 = 4 = (0.5, 0.75, 1)  k4 = 4 = (0.5, 0.75, 1)  k5 = 4 = (0.5, 0.75, 1)  k6 = 4 = (0.5, 0.75, 1) furthermore, to form a decision matrix of fuzzy number results, then the next step is the defuzification process that is every alternative in each criteria is taken the average value so that obtained the decision matrix value as in table 7. afterwards, following topsis stage must be completed. 1. create a normalized and weighted decision matrix. 2. determine the matrix of positive and negative ideal solutions. 3. determine the distance between the value of each alternative with the ideal positive solution matrix and the ideal negative solution matrix. 4. specifies the preference value for each alternative. 5. determine the ranking. fig. 2. membership function table 5. questionnaire recapitulation result alternative k1 k2 k3 k4 k5 k6 a1 5 5 4 4 4 4 a2 4 2 4 3 3 1 a3 3 1 4 1 2 1 a4 3 2 3 4 2 1 a5 4 4 3 1 2 1 a6 5 1 3 3 2 1 a7 5 4 4 3 3 3 a8 4 3 4 2 2 1 a9 4 5 5 4 3 3 a10 5 3 3 2 2 1 a11 5 2 3 3 2 3 source: survey result table 6. square and root value results value k1 k2 k3 k4 k5 k6 square 6.951 3.507 4.903 2.826 1.750 1.361 root 2.637 1.873 2.214 1.681 1.323 1.167 12 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 decision matrix with 11 alternatives and 6 criteria such as in table 7. 𝐷 = [ 𝑋11 . . 𝑋1𝑛 … . . … 𝑋𝑚1 . . 𝑋𝑚𝑛 ] (1) where d is matrix, m is alternative, n is criteria, xij is i-th alternative and j-th criteria. the following steps and formulas are aimed at resolving problems using topsis method: a. normalization of decision matrix each element in the matrix d is normalized to obtain normalization matrix r. each normalization of the r value can be calculated for i = 1,2,3,…..,m, and j = 1,2,3,…..,n. the results of the squared and root values in the decision matrix can be seen in table 8. 𝑟𝑖𝑗 = 𝑋𝑖𝑗 √∑ 𝑋𝑖𝑗 2𝑚 𝑖=1 (2) b. normalized weighting matrix assigned weights w = (w1, w2,…,wn), hence weighted normalized matrix v is generated with i=1,2,3,….m and j=1,2,3,….,n. the result of the normalized matrix is presented in table 9 and 10. [ 𝑊11𝑟11 … 𝑊1𝑛 𝑟1𝑛 … … … . … … … … . . 𝑊𝑚1𝑟𝑚1 … 𝑊𝑛𝑚 𝑟𝑛𝑚 ] (3) table 7. normalized matrix results value k1 k2 k3 k4 k5 k6 a1 0.348 0.489 0.339 0.446 0.567 0.643 a2 0.284 0.133 0.339 0.297 0.378 0.071 a3 0.190 0.044 0.339 0.050 0.189 0.071 a4 0.190 0.133 0.226 0.446 0.189 0.071 a5 0.284 0.400 0.226 0.050 0.189 0.071 a6 0.348 0.044 0.226 0.297 0.189 0.071 a7 0.348 0.400 0.339 0.297 0.378 0.429 a8 0.284 0.267 0.339 0.149 0.189 0.071 a9 0.284 0.489 0.414 0.446 0.378 0.429 a10 0.348 0.267 0.226 0.149 0.189 0.071 a11 0.348 0.133 0.226 0.297 0.189 0.429 table 8. normalized weighting matrix results value k1 k2 k3 k4 k5 k6 a1 0.271 0.234 0.221 0.199 0.206 0.166 a2 0.222 0.064 0.221 0.133 0.137 0.018 a3 0.148 0.021 0.221 0.022 0.069 0.018 a4 0.148 0.064 0.147 0.199 0.069 0.018 a5 0.222 0.191 0.147 0.022 0.069 0.018 a6 0.271 0.021 0.147 0.133 0.069 0.018 a7 0.271 0.191 0.221 0.133 0.137 0.110 a8 0.222 0.127 0.221 0.066 0.069 0.018 a9 0.222 0.234 0.270 0.199 0.137 0.110 a10 0.271 0.127 0.147 0.066 0.069 0.018 a11 0.271 0.064 0.147 0.133 0.069 0.110 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 13 c. determining positive ideal solutions and negative ideal solutions the positive ideal solution is denoted by a + and the negative ideal solution is denoted by a-. define the ideal solution (+) & (-). 𝐴+ = {(max 𝑣𝑖𝑗 |𝑗 ∈ 𝐽)(min 𝑣𝑖𝑗 |𝑗 ∈ 𝐽), 𝑖 = 1,2,3, … , 𝑚} = {𝑣1 +, 𝑣2 +, 𝑣3 +} 𝐴− = {(max 𝑣𝑖𝑗 |𝑗 ∈ 𝐽)(min 𝑣𝑖𝑗 |𝑗 ∈ 𝐽), 𝑖 = 1,2,3, … , 𝑚} = {𝑣1 −, 𝑣2 −, 𝑣3 −} (4) where: vij = v matrix element i-th row and j-th column j = {j = 1,2,3, ..., n and j due to benefit criteria) j '= {j = 1,2,3, ..., n and j in connection with the cost criteria) d. separation measure counting separation measure is a measurement of the distance from an ideal alternative solution to the positive and negative ideal solution. the mathematical calculations are:  separation measure for positive ideal solution with i=1, 2, 3,…, n 𝑆𝑖 + = √∑ (𝑉𝑖𝑗 − 𝑉𝑗 + ) 2 𝑛 𝑗=1 (5)  separation measure for negative ideal solution with i=1, 2, 3,…,n 𝑆𝑖 − = √∑ (𝑉𝑖𝑗 − 𝑉𝑗 −) 2𝑛 𝑗=1 (6) the result of separation measure is presented in table 12 and 13. e. calculating proximity relative to positive ideal the relative proximity of an a + alternative with an ideal solution ais represented by: 𝐶𝑖 = 𝑆𝑖 − 𝑆𝑖 −+ 𝑆𝑖 + (7) table 9. maximum and minimum value on each criterion value k1 k2 k3 k4 k5 k6 maximum 0.271 0.234 0.270 0.199 0.206 0.166 minimum 0.148 0.021 0.147 0.022 0.069 0.018 table 10. square value on each alternative value benefit cost square 1 0.002 0.138 square 2 0.065 0.030 square 3 0.135 0.005 square 4 0.100 0.033 square 5 0.091 0.034 square 6 0.105 0.027 square 7 0.016 0.075 square 8 0.074 0.024 square 9 0.010 0.110 square 10 0.085 0.028 square 11 0.070 0.038 14 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 f. sorting options alternatives can be ranked based on the order of ci, therefore, the best alternative is the one that is the shortest of the ideal solution and is the furthest to the negative ideal solution. iii. results and discussion table 7 presents the result of the questionnaire assessment for each criterion in each alternative with predetermined criteria. below is an instance of an assessment on alternative pontianak working unit = [0.92, 0.92, 0.75, 0.75, 0.75, 0.75], meaning: 1. border area [k1] = 0.92 [very good]. 2. potential fish resources [k2] = 0.92 [very good] 3. international sea lanes [k3] = 0.75 [good] 4. infrastructure [k4] = 0.75 [good] 5. patrol ship amount [k5] = 0.75 [good] 6. law enforcement [k6] = 0.75 [good] after weighting the preference of each criterion on each alternative, the next phase is to find the value of squares and roots of each criterion as shown in table 8. below is one of the calculation process of the value of squares and roots on the criteria of the border area [k1]. |k1| = [a1]2 + [a2]2 + [a3]2 + [a4]2 + [a5]2 + [a6]2 + [a7]2 + [a8]2 + [a9]2 + [a10]2 + [a11]2. |k1| = [0.92]2 + [0.75]2 + [0.50]2 + [0.50]2 + [0.75]2 + [0.92]2 + [0.92]2 + [0.75]2 + [0.75]2 + [0.92]2 + [0.92]2. |k1| = 6.951 (square value) |k1| = √6.951 = 2.637 (root value) table 11. root value on each alternative value benefit cost square 1 0.049 0.371 square 2 0.254 0.172 square 3 0.367 0.074 square 4 0.316 0.182 square 5 0.302 0.185 square 6 0.324 0.166 square 7 0.128 0.274 square 8 0.273 0.155 square 9 0.101 0.332 square 10 0.291 0.169 square 11 0.265 0.194 table 12. priority value on each alternative satker a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 priority value 0.883 0.404 0.167 0.366 0.380 0.338 0.681 0.363 0.767 0.367 0.423 table 13. alternative ranking results code a1 a7 a9 a11 a10 a6 a2 a8 a5 a4 a3 priority value 0.883 0.767 0.681 0.423 0.404 0.380 0.367 0.366 0.363 0.338 0.167 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 15 using the similar approach will obtain root value from several criteria as follows: k2 = √3.507 = 1.873 k3 = √4.903 = 2.214 k4 = √2.826 = 1.681 k5 = √1.750 = 1.323 k6 = √1.174 = 1.083 after obtaining the value of squares and root values on each criterion such as in table 8, the next process is calculating the normalization matrix of border areas (k1) in each alternative as in table 9. r11 = x11/k1 = 0.92/2.637 = 0,348 r21 = x21/k1 = 0.75/2.637 = 0,284 r31 = x31/k1 = 0.50/2.637 = 0,190 r41 = x41/k1 = 0.50/2.637 = 0,190 r51 = x51/k1 = 0.75/2.637 = 0,284 r61 = x61/k1 = 0.92/2.637 = 0,348 r71 = x71/k1 = 0.92/2.637 = 0,348 r81 = x81 /k1 = 0.75/2.637 = 0,284 r91 = x91/k1 = 0.75/2.637 = 0,284 r101 = x101/k1 = 0.92/2.637 = 0,348 r111 = x111/k1 = 0.92/2.637 = 0,348 having obtained the normalized matrix value, the next step is to determine the weighted normalization matrix. before calculating the weighted normalization decision matrix, it firstly determines the weight of each criterion. the importance of each criterion can be assessed from range 1 to 5, namely: 1. not at all important 2. slightly important 3. fairly important 4. important 5. very important the value of the initial weight (w) is used to indicate the relative importance of each criterion. the weights of each criterion are listed in table 7. after determining the weight of each criterion, then based on the first step and equation 2, the weighted normalization matrix as in table 8 can be calculated. the following instances are a weighted matrix calculation. y11 = w11*r11 = 0.348 * 0.78 = 0.917 y21 = w11*r21 = 0.284 * 0.78 = 0.750 y31 = w11*r31 = 0.190 * 0.78 = 0.500 y41 = w11*r41 = 0.190 * 0.78 = 0.500 y51 = w11*r51 = 0.284 * 0.78 = 0.750 y61 = w11*r61 = 0.348 * 0.78 = 0.917 y71 = w11*r71 = 0.348 * 0.78 = 0.917 y81 = w11*r81 = 0.284 * 0.78 = 0.750 y91 = w11*r91 = 0.284 * 0.78 = 0.750 y101 = w11*r101 = 0.348 * 0.78 = 0.917 16 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 y111 = w11*r111 = 0.348 * 0.78 = 0.917 the next step is to determine the ideal positive solution matrix and the ideal solution matrix based on equations 3 and 4. positive ideal solution matrix (yij+): a+ = (y1+, y2+, y3+, ……., yn+); a= (y1-, y2-, y3-, ……., yn-); 𝑦𝑗 + = { max 𝑦𝑖𝑗 min 𝑦𝑖𝑗 (8) the positive ideal solution is calculated as follows: 𝑦1 + = max (0.917, 0.750, … ) = 0.917 (9) 𝑦2 + = max (0.234, 0.064, … ) = 0.234 and so on, hence obtained: 𝐴+ = (0.917, 0.234, 0.270, 0.199, 0.209, 0.163) the positive ideal solution is calculated as follows: 𝑦1 − = min (0.917, 0.750, … ) = 0.500 (10) 𝑦2 − = min (0.234, 0.064, … ) = 0.021 and so on, hence obtained: 𝐴− = (0.500, 0.021, 0.147, 0.022, 0.069, 0.018) the final result of a positive ideal solution and a negative ideal solution is illustrated in table 11. the next phase is to determine the distance between the value of each alternative with a positive ideal solution matrix & the negative ideal solution matrix. to discover the distance between alternatives with ideal positive solution matrices, the following equation is applicable: 𝐷𝑖 + = √∑ (𝑦𝑖 + − 𝑦𝑖𝑗 ) 2𝑛 𝑗=1 (11) the distance between alternative a, with the negative ideal solution, is formulated as: 𝐷𝑖 − = √∑ (𝑦𝑖𝑗 − 𝑦𝑖 − ) 2𝑛 𝑗=1 (12)    the result of positive and negative ideal solution distance can be seen in table 8. the next phase is to determine the value of the square and the root of the positive ideal value and the negative ideal value. the results of the square values can be seen in table 12 while the root values can be seen in table 13. the final phase in the calculation of topsis is determining preference value for each alternative given according to the following equation. 𝑉𝑖 = 𝐷𝑖 − 𝐷𝑖 −+ 𝐷𝑖 + (13) the greater vi values indicate that ai alternatives are preferred. g. calculating preference value several preference value will be applied using certain calculation. preference value for each working unit is described as follow. a. preference value of pontianak sdkp working unit 𝑉1 = 𝐷1 − 𝐷1 −+ 𝐷1 + = 0.049 0.547+0.049 = 0.917 (14) hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 17 b. preference value of pemangkat working unit 𝑉2 = 𝐷2 − 𝐷2 −+ 𝐷2 + = 0.298 0.294+0.298 = 0.497 (15) c. preference value of teluk batang working unit 𝑉3 = 𝐷3 − 𝐷3 −+ 𝐷3 + = 0.541 0.074+0.541 = 0.120 (16) d. preference value of sungai liat working unit 𝑉4 = 𝐷4 − 𝐷4 −+ 𝐷4 + = 0.507 0.182+0.507 = 0.264 (17) e. preference value of tanjung balai karimun working unit 𝑉5 = 𝐷5 − 𝐷5 −+ 𝐷5 + = 0.340 0.302+0.340 = 0.470 (18) f. preference value of moro working unit 𝑉6 = 𝐷6 − 𝐷6 −+ 𝐷6 + = 0.323 0.431+0.323 = 0.572 (19) g. preference value of batam working unit 𝑉7 = 𝐷7 − 𝐷7 −+ 𝐷7 + = 0.128 0.483+0.128 = 0.791 (20) h. preference value of tarempa working unit 𝑉8 = 𝐷8 − 𝐷8 −+ 𝐷8 + = 0.314 0.314+0.285 = 0.475 (21) i. preference value of natuna working unit 𝑉9 = 𝐷9 − 𝐷9 −+ 𝐷9 + = 0.188 0.188+0.409 = 0.685 (22) j. preference value of pulau kijang working unit 𝑉10 = 𝐷10 − 𝐷10 − + 𝐷10 + = 0.289 0.289+0.432 = 0.599 (23) k. preference value of tanjung pinang working unit 𝑉11 = 𝐷11 − 𝐷11 − + 𝐷11 + = 0.281 0.281+0.435 = 0.607 (24) the results of the preference value ranking can be recognized in table 14 and 15. pursuant to the preference value ranking on each working unit, four (4) working unit in wpp-711 will be prioritized to be developed, thus increasing security on wpp-711 will be optimized in the long-run. as illustrated in fig.3, four areas which are recommended to be developed as a central monitoring area in wpp-711 are: 1. pontianak sdkp = 0.883 2. batam = 0.767 3. natuna = 0.681 4. tanjung pinang = 0.423 the mentioned four areas are well suited to be developed as a central monitoring area in wpp-711 considering from 6 criteria and already capable of representing some areas in wpp-711. based on fig. 4, it describes the priorities of the working unit area in wpp-711. the determined working units are able to cover other areas based on the distance between regions and the level of vulnerability. the development of these four areas will be able to improve the security of indonesia's marine resources which in the long-run minimizes the loss of the state. 18 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 fig. 3. priority values of each working unit in wpp-711 fig. 4. areas map according to priority values of four working units in wpp711 hozairi and y. krisnafi. / knowledge engineering and data science 2018, 1 (1): 8–19 19 iv. conclusion this study has managed to determine 4 working units from the entire existing working units spread across wpp-711. the basic criteria are taken into consideration to improve fisheries monitoring in wpp-711 is (1) the border area, (2) the potential of fisheries resources, (3) the international sea lanes (4) facilities and infrastructures, and (5) the number of patrol ships and law enforcement. based on the calculation of fuzzy topsis method, the result of working units priority is as follows: (1) sdkp pontianak = 0.883, (2) batam = 0.767, (3) natuna = 0.681 and (4) tanjung pinang = 0.423. the fuzzy topsis calculation results will be taken into consideration to determine the strategy in order to improve the supervision of fishery areas in wpp 711 to reduce the loss of the state due to illegal fishing within indonesia’s legitimate region. references [1] a. n. fitriana, “decision support system to determine student’s academic achievement by topsis method,” 2015, vol. 2, no. 2, pp. 153–164. [2] c. wang and s. chen, “multiple attribute decision making based on interval-valued intuitionistic fuzzy sets, linear programming methodology , and the extended topsis method,” inf. sci. (ny)., vol. 397–398, pp. 155–167, 2017. [3] g. torlak, m. sevkli, m. sanal, and s. zaim, “expert systems with applications analyzing business competition by using fuzzy topsis method : an example of turkish domestic airline industry,” expert syst. appl., vol. 38, no. 4, pp. 3396–3406, 2011. [4] s. k. patil and r. kant, “expert systems with applications a fuzzy ahp-topsis framework for ranking the solutions of knowledge management adoption in supply chain to overcome its barriers,” expert syst. appl., vol. 41, no. 2, pp. 679–693, 2014. [5] s. abdurrahman, book of data and information on marine resource and fishery monitoring. 2013. [6] a. hatami-marbini and f. kangi, “an extension of fuzzy topsis for a group decision making with an application to tehran stock exchange,” appl. soft comput. j., vol. 52, pp. 1084–1097, 2017. [7] e. roszkowska and d. kacprzak, “the fuzzy saw and fuzzy topsis procedures based on ordered fuzzy numbers r , rr,” inf. sci. (ny)., vol. 369, pp. 564–584, 2016. [8] s. h. zyoud, l. g. kaufmann, h. shaheen, s. samhan, and d. fuchs-hanusch, “a framework for water loss management in developing countries under fuzzy environment : integration of fuzzy ahp with fuzzy topsis,” expert syst. appl., vol. 61, pp. 86–105, 2016. [9] a. hozairi, “selection of creative industry sector ict suitable developed in pesantren using fuzzy ahp,” vol. 82, no. 1, pp. 131–136, 2015. [10] p. p. t. samafitro, “admission selection prospective manager using fuzzy-topsis at pt. samafitro,” inf. manag. educ. prof., vol. 1, no. 1, pp. 86–95, 2016. [11] f. t. wulandari, “fuzzy topsis implementation in business strategy planning,” magistra, no. 85, pp. 80–91, 2013. [12] h. gupta and m. k. barua, “supplier selection among smes on the basis of their green innovation ability using bwm and fuzzy topsis,” j. clean. prod., vol. 152, pp. 242–258, 2017. [13] p. u. onu, x. quan, l. xu, j. orji, and e. onu, “evaluation of sustainable acid rain control options utilizing a fuzzy topsis multi-criteria decision analysis model frame work,” j. clean. prod., vol. 141, pp. 612–625, 2017. [14] h. shakerian, h. d. dehnavi, and s. b. ghanad, “the implementation of the hybrid model swot-topsis by fuzzy approach to evaluate and rank the human resources and business strategies in organizations ( case study : road and urban development organization in yazd ),” procedia soc. behav. sci., vol. 230, no. may, pp. 307–316, 2016. https://doi.org/10.1016/j.ins.2017.02.045 https://doi.org/10.1016/j.ins.2017.02.045 https://doi.org/10.1016/j.eswa.2010.08.125 https://doi.org/10.1016/j.eswa.2010.08.125 https://doi.org/10.1016/j.eswa.2010.08.125 https://doi.org/10.1016/j.eswa.2013.07.093 https://doi.org/10.1016/j.eswa.2013.07.093 https://doi.org/10.1016/j.eswa.2013.07.093 https://doi.org/10.1016/j.asoc.2016.09.021 https://doi.org/10.1016/j.asoc.2016.09.021 https://doi.org/10.1016/j.ins.2016.07.044 https://doi.org/10.1016/j.ins.2016.07.044 https://doi.org/10.1016/j.eswa.2016.05.016 https://doi.org/10.1016/j.eswa.2016.05.016 https://doi.org/10.1016/j.eswa.2016.05.016 http://www.jatit.org/volumes/vol82no1/14vol82no1.pdf http://www.jatit.org/volumes/vol82no1/14vol82no1.pdf http://ejournal-binainsani.ac.id/index.php/imbi/article/view/221 http://ejournal-binainsani.ac.id/index.php/imbi/article/view/221 https://doi.org/10.1016/j.jclepro.2017.03.125 https://doi.org/10.1016/j.jclepro.2017.03.125 https://doi.org/10.1016/j.jclepro.2016.09.065 https://doi.org/10.1016/j.jclepro.2016.09.065 https://doi.org/10.1016/j.sbspro.2016.09.039 https://doi.org/10.1016/j.sbspro.2016.09.039 https://doi.org/10.1016/j.sbspro.2016.09.039 microsoft word adi-pembetulan knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 1, july 2020, pp. 40–49 eissn 2597-4637 https://doi.org/10.17977/um018v3i12020p40-49 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) parallelization of partitioning around medoids (pam) in k-medoids clustering on gpu adhi prahara a, 1, *, dewi pramudi ismi a, 2, ahmad azhari a, 3 a informatics department, faculty of industrial technology, universitas ahmad dahlan jalan ring road selatan, tamanan, banguntapan, bantul yogyakarta 55166, indonesia 1 adhi.prahara@tif.uad.ac.id *; 2 dewi.ismi@tif.uad.ac.id; 3 ahmad.azhari@tif.uad.ac.id * corresponding author i. introduction clustering is the task of assigning unlabeled data points into a finite number of clusters. the assignment is usually based on similarity or distance, so data points located in the same cluster are similar to each other. clustering techniques have been implemented and play important roles in a wide range of application domains, such as image segmentation [1][2][3][4], image clustering [5], bioinformatics [6][7] and data mining [3][8]. through clustering, the underlying patterns of the data can be revealed. clustering is unsupervised learning that gives information based on the intrinsic properties of the data when no labels are assigned to the data. the most widely studied clustering algorithms are partitional clustering and hierarchical clustering [9]. partitional clustering, such as k-means and k-medoids is the most widely used in practice. partitional clustering divides the dataset into a number of partitions. the number of partitions must be smaller than the number of data points in the dataset. k-means clustering is simpler than k-medoids clustering, but its main drawback is the sensitivity to outliers. k-medoids clustering offers better result when dealing with outliers [10] and arbitrary distance metrics also in the situation when the mean or median does not have a clear definition. however, k-medoids clustering suffers high computational complexity. due to the high computational complexity, the efficiency of k-medoids clustering becomes a major concern in the k-medoids algorithm improvement. researchers have been working on attempts to improve the performance of k-medoids clustering [11][12]. in general, the efforts on improving kmedoids clustering focus on three different approaches [13] such as 1) empowering the local search and global search for medoids selection, 2) the number of data to be used for the medoids calculation: use the entire data (pam: partitioning around medoids) algorithm or just a sample of the data (clara: clustering large application) algorithm, and 3) the computation method: serial or parallel. article info a b s t r a c t article history: received 25 june 2020 revised 15 july 2020 accepted 9 august 2020 published online 17 august 2020 k-medoids clustering is categorized as partitional clustering. k-medoids offers better result when dealing with outliers and arbitrary distance metric also in the situation when the mean or median does not exist within data. however, k-medoids suffers a high computational complexity. partitioning around medoids (pam) has been developed to improve k-medoids clustering, consists of build and swap steps and uses the entire dataset to find the best potential medoids. thus, pam produces better medoids than other algorithms. this research proposes the parallelization of pam in k-medoids clustering on gpu to reduce computational time at the swap step of pam. the parallelization scheme utilizes shared memory, reduction algorithm, and optimization of the thread block configuration to maximize the occupancy. based on the experiment result, the proposed parallelized pam k-medoids is faster than cpu and matlab implementation and efficient for large dataset. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: k-medoids pam parallel computing cuda gpu a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 41 there have been studies conducted by researchers to implement the clustering method using parallel computing approaches [13][14][15][16][17], especially k-medoids clustering, that will be the focus of this research. one of the technologies that used to develop parallel k-medoids clustering is hadoop-mapreduce [18][19][20][21][22][23]. mapreduce consists of map function and reduces function. map function computes the distances between each data, and the medoids then assign the data to their clusters. reduce function checks the results of map function then search for new medoids. this function will return the results to the map function in the next mapreduce process. in [14], the optimal search of medoids is performed based on the basic properties of triangular geometry. the speed of k-medoids clustering is improved when the validity of the clustering result is maintained [18]. parallel k-medoids clustering can also be implemented on graphics processing unit (gpu). several gpu accelerated researchers have developed k-medoids clustering: parallel pam implementation using cuda [24][25], gpu based parallel k-medoids (combined pam-clara) clustering for remote sensing data [26], and gpu accelerated parallel clustering algorithms including k-means clustering, k-medoids clustering, and hierarchical clustering [27]. this work focuses on the development of parallel k-medoids clustering to increase the efficiency of partitioning around medoids (pam) in k-medoids clustering. our main contribution is to speed up the computation of the pam algorithm using a parallelization scheme on gpu. the proposed parallelization scheme can handle large dataset with n number of data without creating n×n table of distance that consume huge memory but still maintain the computation speed. this paper is organized as follows: section 2 presents the proposed parallelized pam in k-medoids clustering, section 3 presents the results and discussion, and the conclusion of this work is described in section 4. ii. method a. k-medoids clustering k-medoids clustering is a partitional clustering method and is similar to k-means clustering because the goal of both methods is to divide a set of measurements or observations into k subsets or clusters so that the subsets minimize the sum of distances between a measurement and a center of the measurement’s cluster. unlike k-means clustering, which minimizes the distance between data points within a cluster with the mean value of those data points, which called centroid, k-medoids clustering attempts to minimize the distance between data points within a cluster with a representative data point in the same cluster. this representative data point has a minimum total dissimilarities/distances to the other data points in the same cluster. this representative data point is called the medoid. each generated cluster in k-medoids clustering has a representative data point (medoid). thus, k-medoids clustering guarantees that the center of a cluster is the most centrally located data point. medoids are initialized by selecting k data points arbitrarily. k-medoids clustering iterates until the objective function returns minimum value. the objective function / absolute-error criterion (aec), e is defined in (1). 𝐸 = ∑ ∑ 𝑝 − 𝑜∈ (1) where 𝑝 is the data point in the cluster 𝑐 , 𝑜 is the medoid of 𝑐 . k-medoids function involves several iterative algorithms that minimize the sum of distances from each object to its cluster’s medoid, over all of the clusters. a well-known implementation of kmedoids clustering is the partitioning around medoids (pam) algorithm, which was developed by [28]. it takes two inputs: the number of clusters to generate (known as k) and a dataset d, which contains the data points. it generates k different clusters. the detail method of pam is: 1) build step. select k data points arbitrarily from the dataset d as the initial medoids. 2) swap step. repeat a) assign each non-medoid data point to the cluster, which has the most similar/nearest medoid. b) randomly select a non-medoid data point. c) compute the total cost of swapping that is the difference between the aec calculated using the current medoid data point and the aec calculated using the non-medoid data point selected in step b. 42 a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 d) if the aec calculated using a non-medoid data point is lower than the aec calculated using the medoid data point, swap the medoid to that selected non-medoid data point. then, the non-medoid data point is set to be the new medoid of the cluster. 3) until no more change on clusters. k-medoids clustering is better than k-means clustering when dealing with outliers [10]. k-medoids is the opposite of k-means clustering, that is sensitive to outliers. the presence of outliers does not influence medoids. k-medoids also useful for categorical clustering data where the mean of the data does not exist within the dataset. however, k-medoids is costlier than k-means clustering due to the iteration that examines every data point. the computational complexity of k-medoids clustering is 𝑂(𝑘(𝑛 − 𝑘) ) where k is the number of clusters, and n is the number of data [29]. for a large value of n and k, the computation is very costly. for a large number of data, the computation time will increase quickly as the number of data grows. thus, k-medoids clustering is only suitable for small data and suffers inefficiency for big data. b. cuda parallel computing gpu (graphics processing unit) is a high-level parallel architecture used to do a fast operation in computer graphics, but now it can be used to perform computation other than graphics, which known as gp-gpu (general purpose-graphics processing unit) [30]. the well-known general-purpose parallel computing platform and programming model is compute unified device architecture (cuda) from nvidia. gpu is highly parallel, multithreaded, has many-core processor and very high memory bandwidth. the difference between how cpu and gpu process the data is shown in fig. 1 (a) and (b). gpu devotes more transistor to data processing rather than data caching and flow control. gpu is built on array of streaming multiprocessors (sm) and it is organized into grids, blocks, and threads. data-parallel processing maps data elements to parallel processing threads. fig. 1 (c) shows the parallel processing threads in gpu. a multithreaded program is partitioned into blocks of threads that execute independently from each other. iii. results and discussions the proposed parallelized k-medoids are written using cuda based on matlab’s pam implementation [31][32], and runs on core-i7 7700k processor, 16 gb of ram, and nvidia gtx (a) (c) (b) fig. 1. gpu devotes more transistors to data processing; (a) cpu configuration; (b) gpu configuration; and (c) grid of threadblocks in gpu [30] a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 43 1070. the parallel implementation and performance evaluation is explained in the following subsections. a. parallelized k-medoids clustering the implementation of parallelized pam k-medoids is shown in algorithm 1. it consists of three kernels that compute the gain of medoids to each data within a cluster, computes the gain of nonmedoids to each data, and computes the new medoids. in algorithm 1, the data and initial cluster medoids that randomly picked from data are copied from host to device. the algorithm computes new medoid per cluster and iterates until no medoid is changed. the data partition in the swap step is used to reduce the number of data processed by thread blocks. algorithm 1. proposed parallelized pam k-medoids clustering. read parameters of k-medoids and gpu configuration set random data as initial cluster medoids copy data and initial medoids from host to device while cluster medoids changed do for each cluster do compute gain of medoid to each data within cluster for n = 0 to number of data partition do compute gain of non-medoids to each data … in n-th partition end compute new medoid end end copy cluster medoids and labels from device to host algorithm 2. compute the gain of medoids to each data within the cluster. gpu configuration blocks ← 256 grids ← (number of data + blocks – 1)/ blocks gain medoids ← 0 read data, cluster medoids, cluster labels, current cluster index, number of data, number of cluster allocate shared memory (smem) to store smem cluster medoids ← cluster medoids smem gain ← 0 i ← index of data if i < number of data then min distance index ← 0 min distance ← maximum value of type data for k = 0 to number of cluster do distance ← compute distance between … i-th data and k-th smem cluster medoids if distance < min distance then min distance index ← k min distance ← distance end end i-th cluster labels ← min distance index if min distance index = current cluster index then atomicadd(smem gain, min distance) 44 a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 end end synchronize the threads if threadidx.x = 0 then gain medoids ← smem gain end algorithm 3. compute the gain of non-medoids to each data. gpu configuration blocks ← 256 grids ← number of data after data partition read data, cluster medoids, cluster labels, gain of potential medoids, current cluster index, number of data, number of cluster allocate shared memory (smem) to store smem cluster medoids ← cluster medoids smem gain int ← 0 smem gain ext ← 0 i ← index of data after data partition i-th gain of non medoids ← 0 if i is not medoid then gain int ← 0 gain ext ← 0 j ← index of data if j < number of data then distance ← compute distance between … i-th data and j-th data p ← j-th cluster labels if p = current cluster index then min distance ext ← maximum value of type data for k = 0 to number of cluster do if k ≠ current cluster index then distance ext ← compute distance between … j-th data and k-th smem cluster medoids if distance ext < min distance ext then min distance ext ← distance ext end end end if min distance ext < maximum value of type … data then min value ← minimum value of … min distance ext and distance gain int ← gain int + min value end else distance int ← compute distance between … i-th data and p-th smem cluster medoids max value ← maximum value of … (distance int – distance) and 0 gain ext ← gain ext + max value end end q ← threadidx.x q-th smem gain int ← gain int q-th smem gain ext ← gain ext synchronize the threads a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 45 gain int ← sum reduction of smem gain int gain ext ← sum reduction of smem gain ext synchronize the threads if threadidx.x = 0 then i-th gain of potential medoids ← gain ext + … gain medoids – gain int end end algorithm 4. compute the new medoids. gpu configuration blocks ← 256 grids ← 1 stop iteration ← true read data, cluster medoids, cluster labels, medoid labels, gain of non medoids, current cluster index, number of data, number of cluster allocate shared memory (smem) to store smem max gain ← 0 smem max gain index ← 0 i ← index of data j ← current cluster index max gain ← 0 max gain index ← j-th cluster labels if i < number of data then if i-th gain of potential medoids > max gain then max gain index ← i max gain ← i-th gain of potential medoids end end k ← threadidx.x k-th smem max gain ← max gain k-th smem max gain index ← max gain index synchronize the threads max gain ← max reduction of smem max gain max gain index ← index of max reduction value … of smem max gain synchronize the threads p ← max gain index if max gain > 0 then if threadidx.x = 0 then j-th medoid labels ← p stop iteration ← false end j-th cluster medoids ← p-th data end based on the complexity of k-medoids, the most complex computation is in the process of finding new medoids thus, it will be parallelized. if 𝑛 is the number of data, the pam algorithm can be optimized using 𝑛 × 𝑛 table of distance that pre-calculated before k-medoids computation executed. however, creating 𝑛 × 𝑛 table of distance requires a large amount of memory. the goal is to optimize 46 a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 the 𝑂(𝑛 − 𝑘) using parallelization scheme without creating a table of distance to avoid large memory consumption on gpu. algorithm 2 shows the computation of the gain of medoids to each data within the cluster. gain is the sum of the distance between the medoids and each data in the same cluster. shared memory is utilized to store the cluster medoids that repeatedly used in the calculation of distance. this will reduce the latency of global memory access to faster-shared memory access. data are partitioned and processed by the thread blocks. each thread computes the nearest medoid to each data, assigns the nearest medoids index to cluster label, and sum up the gain using atomic addition from cuda in shared memory. the total gain that summed from each data then copied to the device memory. algorithm 3 shows the computation of the gain of non-medoids to each data. the configuration of blocks and grids is different from algorithm 2. while 𝑛 is the number of data and 𝑘 is the number of clusters, then algorithm 3 requires (𝑛 − 𝑘) data compared to 𝑛 data. here, algorithm 2 only requires 𝑛 data compare to 𝑘 data, where 𝑘 is usually a small number. therefore in algorithm 2, the thread blocks assigned to handle the outer (𝑛) loop and serialize the inner (𝑘) loop while in algorithm 3, the threads assigned to handle the inner (𝑛) loop and the blocks assigned to handle the outer (𝑛 − 𝑘) loop. with this configuration, maximum occupancy can be achieved. in the inner loop computation, each non-medoids is compared to the entire dataset. total gain of each non-medoids is computed by adding the gain of medoid to each data within a cluster (from algorithm 2) with the gain of non-medoids to each data from outer cluster then subtracted by the gain of non-medoids to each data within a cluster. because the threads handle the inner loop, sum reduction can be used in shared memory to sum up the gain of non-medoids to each data from the outer cluster with the gain of non-medoids to each data within the cluster. shared memory also utilized to store the cluster medoids to provide faster access in the distance calculation that similar in algorithm 2. the total gain of each non-medoids then copied to device memory to be used in the computation of new medoids. algorithm 4 shows the computation of new medoids. new medoid is computed by finding the maximum gain that greater than zero from the gain of non-medoids in algorithm 3. one block is used for the kernel to perform max reduction on 𝑛 data. the index of maximum gain indicates that the data which corresponds to that index has the highest potential as the new cluster medoid. the index and the new cluster medoid then copied to the device memory. if the maximum gain of non-medoids is less than zero for all cluster, then k-medoids iteration will stop otherwise the iteration continues. b. performance comparison the proposed parallelized pam in k-medoids is tested using kddcup dataset. the dataset is modified into a smaller set with a various number of data. we use 𝑘 = 5 to cluster 1,000, 2,000, 3,000, 4,000, 5,000, 10,000, and 20,000 data with 41 attributes. the same initial cluster medoids are used, and the algorithms are computed 10 times to get the average computational time. the experiment compares computational time between cpu implementation of pam k-medoids with the proposed gpu implementation of pam k-medoids without creating a pre-calculated table of distance. fig. 2 shows the performance, is measured based on the computational time in milliseconds against the number of data. the red, orange, and green lines present the cpu, matlab, and the proposed gpu implementation, respectively. based on fig. 2, the proposed parallelized pam k-medoids achieves more than 11 times speed up from the cpu implementation. for the larger dataset, the computational time of cpu implementation rises significantly, which indicates the proposed gpu implementation is efficient in dealing with this problem. fig. 2 also shows the performance evaluation of the proposed gpu and matlab implementation. matlab uses a table of distance in its pam k-medoids computation. matlab recommends the implementation of pam to be used in a small dataset, which less than 3000 data. the speedup gain using this method compared to the cpu implementation is approximately 3 times and more for more massive datasets. by accessing the lookup table, distance computation on each iteration can be simplified to only reading the data, thus reduce the computational time. however, by using a pre-calculated table of distance, space complexity increases significantly for large datasets. the proposed gpu implementation achieves approximately 2~3 times speedup compared to the matlab implementation. the result is acceptable because the proposed parallelized pam does not use a precalculated table of distance; thus, the potential implementation for a larger dataset is possible. a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 47 iv. conclusion in this research, parallelized pam k-medoids on gpu is proposed. pam algorithm for k-medoids returns the best medoids compared to the other algorithms because it uses the entire dataset to find the best potential medoids. however, it has a high computational complexity. pam algorithm usually implemented using a pre-calculated table of distance to avoid repeated distance calculation, which leads to massive memory consumption. the proposed implementation optimizes the distance computation of the pam algorithm using a parallel scheme without the pre-calculated table of distance. from the experiment, the proposed parallelized pam k-medoids is faster 2~3 times from matlab and 11~15 times from cpu implementation. the proposed method can handle large datasets by performing data partition and process the data in parallel. for future work, the proposed method will be improved to handle categorical dataset, increase the performance using multi gpus, and compared to the other parallel k-medoids clustering libraries. acknowledgment this research is supported by the indonesian ministry of research, technology and higher education (ristek-dikti) research grant no. pekerti-056/sp3/lpp-uad/iv/2017. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research receives funding from the indonesian ministry of research, technology and higher education (ristekdikti) research grant no. pekerti-056/sp3/lpp-uad/iv/2017. conflict of interest the authors declare no conflict of interest. fig. 2. pam k-medoids performance evaluation 48 a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 additional information no additional information is available for this paper. references [1] y. zou and b. liu, “survey on clustering-based image segmentation techniques,” in proceedings of the 2016 ieee 20th international conference on computer supported cooperative work in design, cscwd 2016, sep. 2016, pp. 106– 110, doi: 10.1109/cscwd.2016.7565972. [2] n. dhanachandra and y. j. chanu, “a survey on image segmentation methods using clustering techniques,” eur. j. eng. res. sci., vol. 2, no. 1, p. 15, jan. 2017, doi: 10.24018/ejers.2017.2.1.237. [3] a. saxena et al., “a review of clustering techniques and developments,” neurocomputing, vol. 267, pp. 664–681, dec. 2017, doi: 10.1016/j.neucom.2017.06.053. [4] a. prahara, i. t. r. yanto, and t. herawan, “histogram thresholding for automatic color segmentation based on kmeans clustering,” in advances in intelligent systems and computing, 2017, vol. 549 aisc, pp. 344–354, doi: 10.1007/978-3-319-51281-5_35. [5] s. wazarkar and b. n. keshavamurthy, “a survey on image data analysis through clustering techniques for real world applications,” j. vis. commun. image represent., vol. 55, pp. 596–626, aug. 2018, doi: 10.1016/j.jvcir.2018.07.009. [6] d. k. tasoulis, v. p. plagianakos, and m. n. vrahatis, “unsupervised clustering of bioinformatics data,” in european symposium on intelligent technologies, hybrid systems and their implementation on smart adaptive systems, eunite, 2004, pp. 47–53. [7] j. d. maccuish and n. e. maccuish, clustering in bioinformatics and drug discovery. crc press, 2010. [8] p. berkhin, “a survey of clustering data mining techniques,” in grouping multidimensional data: recent advances in clustering, springer berlin heidelberg, 2006, pp. 25–71. [9] c. k. reddy and b. vinzamuri, “a survey of partitional and hierarchical clustering algorithms.,” data clust. algorithms appl., vol. 87, 2013. [10] p. arora, deepali, and s. varshney, “analysis of k-means and k-medoids algorithm for big data,” in physics procedia, jan. 2016, vol. 78, pp. 507–512, doi: 10.1016/j.procs.2016.02.095. [11] e. schubert and p. j. rousseeuw, “faster k-medoids clustering: improving the pam, clara, and clarans algorithms,” in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), oct. 2019, vol. 11807 lncs, pp. 171–187, doi: 10.1007/978-3-030-32047-8_16. [12] p. o. olukanmi, f. nelwamondo, and t. marwala, “pam-lite: fast and accurate k-medoids clustering for massive datasets,” in proceedings 2019 southern african universities power engineering conference/robotics and mechatronics/pattern recognition association of south africa, saupec/robmech/prasa 2019, may 2019, pp. 200– 204, doi: 10.1109/robomech.2019.8704767. [13] h. song, j.-g. lee, and w.-s. han, “pamae: parallel k-medoids clustering with high accuracy and efficiency,” in proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, 2017, pp. 1087–1096, doi: 10.1145/3097983.3098098. [14] ying-ting zhu, fu-zhang wang, xing-hua shan, and xiao-yan lv, “k-medoids clustering based on mapreduce and optimal search of medoids,” in 2014 9th international conference on computer science & education, aug. 2014, pp. 573–577, doi: 10.1109/iccse.2014.6926527. [15] a. martino, a. rizzi, and f. m. frattale mascioli, “efficient approaches for solving the large-scale k-medoids problem: towards structured data,” in studies in computational intelligence, nov. 2019, vol. 829, pp. 199–219, doi: 10.1007/978-3-030-16469-0_11. [16] a. prahara, d. p. ismi, a. i. kistijantoro, and m. l. khodra, “parallelized k-means clustering by exploiting instruction level parallelism at low occupancy,” in proceedings 2017 2nd international conferences on information technology, information systems and electrical engineering, icitisee 2017, feb. 2018, vol. 2018-january, pp. 30–34, doi: 10.1109/icitisee.2017.8285516. [17] x. wang, “a survey of clustering algorithms based on parallel mechanism,” apr. 2018, pp. 119–122, doi:10.2991/cmsa-18.2018.28. [18] y. jiang and j. zhang, “parallel k-medoids clustering algorithm based on hadoop,” in 2014 ieee 5th international conference on software engineering and service science, jun. 2014, pp. 649–652, doi: 10.1109/icsess.2014.6933652. [19] m. o. shafiq and e. torunski, “a parallel k-medoids algorithm for clustering based on mapreduce,” in 2016 15th ieee international conference on machine learning and applications (icmla), dec. 2016, pp. 502–507, doi: 10.1109/icmla.2016.0089. [20] x. yue, w. man, j. yue, and g. liu, “parallel k-medoids++ spatial clustering algorithm based on mapreduce,” pp. 1–8, aug. 2016, accessed: jun. 21, 2020. [online]. available: http://arxiv.org/abs/1608.06861. [21] d. rajendran, s. jangiti, s. muralidharan, and m. thendral, “incremental mapreduce for k-medoids clustering of big time-series data,” in proceedings of the 2nd international conference on trends in electronics and informatics, icoei 2018, nov. 2018, pp. 1143–1146, doi: 10.1109/icoei.2018.8553756. [22] y. zhao, b. chen, and m. li, “parallel k-medoids improved algorithm based on mapreduce,” in proceedings 2018 6th international conference on advanced cloud and big data, cbd 2018, nov. 2018, pp. 18–23, doi: 10.1109/cbd.2018.00013. a. prahara et al. / knowledge engineering and data science 2020, 3 (1): 40–49 49 [23] r. wu, b. zhang, and m. hsu, “clustering billions of data points using gpus,” in proc. combined workshops on unconventional high performance computing workshop plus memory access workshop, uchpc-maw ’09, colocated with the 2009 acm int. conf. on computing frontiers, cf’09, 2009, pp. 1–5, doi: 10.1145/1531666.1531668. [24] e. zhou, s. mao, m. li, and z. sun, “pam spatial clustering algorithm research based on cuda,” in international conference on geoinformatics, sep. 2016, vol. 2016-september, doi: 10.1109/geoinformatics.2016.7578971. [25] y. li, k. zhao, x. chu, and j. liu, “speeding up k-means algorithm by gpus,” j. comput. syst. sci., vol. 79, no. 2, pp. 216–229, mar. 2013, doi: 10.1016/j.jcss.2012.05.004. [26] k. r. kurte and s. s. durbha, “high resolution disaster data clustering using graphics processing units,” in 2013 ieee international geoscience and remote sensing symposium igarss, jul. 2013, pp. 1696–1699, doi: 10.1109/igarss.2013.6723121. [27] k. j. kohlhoff, m. h. sosnick, w. t. hsu, v. s. pande, and r. b. altman, “campaign: an open-source library of gpu-accelerated data clustering algorithms,” bioinformatics, vol. 27, no. 16, pp. 2321–2322, aug. 2011, doi: 10.1093/bioinformatics/btr386. [28] l. kaufman and p. j. rousseeuw, clustering by means of medoids. faculty of mathematics and informatics, 1987. [29] r. t. ng and jiawei han, “clarans: a method for clustering objects for spatial data mining,” ieee trans. knowl. data eng., vol. 14, no. 5, pp. 1003–1016, sep. 2002, doi: 10.1109/tkde.2002.1033770. [30] j. fung and s. mann, “using graphics devices in reverse: gpu-based image processing and computer vision,” in 2008 ieee international conference on multimedia and expo, jun. 2008, pp. 9–12, doi: 10.1109/icme.2008.4607358. [31] l. kaufman, p. j. r. leonard kaufman, and p. j. rousseeuw, finding groups in data: an introduction to cluster analysis. wiley, 1990. [32] h.-s. park and c.-h. jun, “a simple and fast algorithm for k-medoids clustering,” expert syst. appl., vol. 36, no. 2, pp. 3336–3341, mar. 2009, doi: 10.1016/j.eswa.2008.01.039. knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 67–77 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p67-77 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) fish image classification using transfer learning method with adaptive learning rate rizka suhana 1, *, wayan firdaus mahmudy 2, agung setia budi 3 faculty of computer science, brawijaya university jl. veteran no. 8, malang 65145, indonesia 1 rizka28294@student.ub.ac.id *; 2 wayanfm@ub.ac.id; 3 agungsetiabudi@ub.ac.id * corresponding author i. introduction indonesia is an archipelagic country with a coral reef area of more than 85,700 km2 [1], directly there is the potential for abundant natural resources and very high biodiversity. fishery production in indonesia accounts for more than 50% of which comes from coastal areas, especially from seagrass ecosystems, mangroves, and coral reefs. indonesia is included in the coral triangle center as the center area of the coral triangle [2]. more than 412 species, including 44 families and 146 genera of fish, have been identified in the karimun jawa national park area, jepara regency, central java province [3]. the diversity of reef fish or other organisms living on coral reefs indicates that the ecosystem is healthy [4]. conservation activities are critical to monitor the coral reef environment regularly. conservation data in video, then processed to produce fish image data. the fish image will be analyzed by experts, including what type of fish image is. experts use the level of diversity of fish species as an indicator of a healthy coral reef ecosystem [5]. the study of villon et al. [6] obtained an accuracy value of 89.3% in the manual classification of fish images, namely direct observation using the naked eye by researchers, and there may still be errors in classifying what types of fish are in the image. image classification is included in the primary research area in image processing, which has broad prospects in various scientific fields such as image segmentation, image recognition, and many more. in k-nearest neighbors (knn) [7], random forest [8], and xgboost [9][10][11][12] are all machine learning methods that can be applied to image classification. in essence, the image classification process depends on feature extraction and feature classification composition. the first is feature extraction, which extracts all features from the image and is stored in tabular form. the second is article info a b s t r a c t article history: received 23 june 2021 revised 14 july 2022 accepted 14 august 2022 published online 7 november 2022 the diversity of fish species in coral reef ecosystems is one of the indications in determining health in coral reef ecosystems. many indonesian fisheries and marine research and development agency experts carefully classify fish images. a reliable technique for performing image classification is convolutional neural network (cnn). transfer learning appears and adopts part of cnn, namely the modified convolution layer. the paper aims to solve the fish classification problem using the pre-trained model of mobilenet v2. the model has a low computational process and does not use too many memory resources when training image data. the research image data used is 49,281 data of various sizes and 18 types of fish. the image is entered into the transformation process (random rotation, random resize crop, random horizontal flip) on the training and test data to produce varied data. after the transformation process, the image data is entered into the training process using the mobilenet v2 architecture. testing the mobilenet v2 architectural model obtained an accuracy score of 99.54%, which is reliable in classifying fish images. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: fish images image classification cnn transfer learning mobilenet v2 http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 68 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 feature extraction. classification, namely deviating the label from the classification image. after going through the feature extraction and classification processes, the data from the image can be processed using each of the above methods. the application of deep learning methods can be one solution to the problem of fish image classification. convolutional neural networks (cnn) can solve problems related to fish classification, according to the research of alshdaifat et al. [13] and cui et al. [14]. in the case of fish image classification, using learning methods that utilize pre-trained models, or transfer learning, is also more efficient than building deep-learning architectural models from scratch [15]. classification methods on fish images are beneficial for researchers in terms of speed to identify fish [6][16][17]. the pre-trained architecture of the mobilenet model is reliable for image recognition. mobilenet v2 is efficient because it can be inserted into mobile or other vision devices [18]. in the mobilenet v1 architecture model, using a convolution layer type called depthwise separable convolution makes the computing process on the mobilenet v1 architecture faster than the traditional cnn architecture. the mobilenet v2 [19] got an update on the following architecture, using inverted residual and linear bottleneck on the convolution layer in the mobilenet v2 architecture model. models with good performance will undoubtedly depend on optimal hyper-parameters, which will directly affect the performance/performance of the model, so the selection of hyper-parameters becomes very important [11]. one of the hyper-parameters used is learning rate and batch size. the learning rate is a hyper-parameter that controls how fast and slow the learning of the neural network model is to solve problems [19][20]. so there is an update on optimizing an adaptive learning rate that can gradually change to obtain a global minimum [21]. batch size is a hyper-parameter that controls the accuracy of the estimated gradient error when learning the neural network and controls the speed and stability of the neural network's learning process [22]. experts need to maintain the diversity of fish species and want to make it easier to classify fish species in the field of conservation. researchers decided to solve the problems of the experts. researchers who studied the method from previous research in this study will use the architecture of mobilenet v2 by combining optimization techniques, namely adaptive learning rate. it is hoped that using the mobilenet v2 architecture with an adaptive learning rate carried out by researchers can relieve and help experts at the fisheries and marine research and development agency. ii. methods a. dataset searching the dataset is obtained on the fish4knowledge website and a european foundation formed for water conservation [23]. from table 1, we can know the distribution of fish images. the data is in the form of video recordings with a complete recording of 87.000 hours with a total of 524.000 recordings. table 1. quantity distribution of image id species data training (80%) testing (20%) 01. abudefduf vaigiensis 403 322 81 02. acanthurus nigrofuscus 2729 2183 546 03. amphiprion clarkii 7034 5627 1407 04. chaetodon lunulatus 5028 4022 1006 05. chaetodon trifascialis 565 452 113 06. chromis chrysura 7186 5748 1438 07. dascyllus aruanus 738 590 148 08. dascyllus reticulatus 15.308 12246 3062 09. hemigymnus fasciatus 238 190 48 10. hemigymnus melapterus 189 151 38 11. lutjanus fulvus 206 164 42 12. myripristis kuntee 3454 2763 691 13. neoglyphidodon nigroris 145 116 29 14. neoniphon sammara 299 239 60 15. pempheris vanicolensis 78 62 16 16. plectroglyphidodon dickii 5139 4111 1028 17. pomacentrus moluccensis 181 14 37 18. zebrasoma scopas 361 288 73 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 69 in this study, the experts tried to do the image by cropping the screen-captured image on the videotape. so get an image dataset of various sizes. dataset transformation, the initial process before the data is entered to train the architectural model used, is through several transformation stages. the images in this data have different dimensions. as in figure 1, the image dimensions are 36x36 pixels. this study uses the pre-trained model to resize the image to 224 × 224 pixels. fig. 1. species of fish type abudefduf vaigiensis the data train transformation transforms the fish image on the training data, including random rotation, random resize crop, random horizontal flip, tensor (converting to tensor data), and data normalization. random rotation 10°, sets random rotation between left or right with a predetermined degree of inclination. the second setting, random resize crop (1-0.8 scale), randomly changes size with cutting with a predetermined scale between 1 and 0.8. random horizontal flip rotates the fish images horizontally randomly. the next step converts previous data into tensor data (pytorch). the last step, data normalization, used for transforming training data, is normalizing tensor data according to data normalization in the mobilenet v2 architectural model. mean = [0.485, 0.456, 0.406] and standard deviation = [0.229, 0.224, 0.225] on each image channel which has three channels (rgb). the data test transformation for the fish image on the test data includes random resize crop, random center crop, tensor (convert to tensor data), and data normalization. in the image, transformation test data is not done flipping because as much as possible to approach the image according to the original image. resize process changes the image size to 230x230 pixels because the fish image test data have different sizes. the second is center crop to change the resized image to 230x230 pixels and then crop it in the center to 224x224 pixels. the next is convert previous data into tensor data (pytorch). the last step for transforming training data is normalizing tensor data according to data normalization in the mobilenet v2 architectural model. mean = [0.485, 0.456, 0.406] and standard deviation = [0.229, 0.224, 0.225] on each image channel which has three channels (rgb). the image structuring phase transforms the image data into a tabular or tabular dataset by extracting / flattening and featuring each image pixel, as shown in figure 2. the figure extracted each pixel of the image to get the feature data of the image data. label the fish image using the folder name of the extracted image as a record. figure 3 shows feature images and labels from images, 12289 features of a 3-channel image with a size of 64 × 64 pixels. fig. 2 flowchart to create metadata for machine learning 70 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 fig. 3 feature images and labels from images b. architecture configuration the model architecture used in this research is mobilenet v2 (sandler et al., 2018), with modifications to the previous classification layer to classify 1000 types of images. the architectural model of mobilenet v2 consists of a complete convolution layer with 32 filters and 19 residual bottleneck layers. modifying the classification layer is changing to a fully connected layer with an input layer of 1280 and an output layer of 18. the architecture of mobilenet v2 follows table 2. input is the initial image size before entering into the convolution process, the operator is a simple name of the convolution layer, t is the expansion factor, c is output, n is repeated times of convolutions layer, s are strides. from the mobilenet v2 architecture table above, it can be simplified further in the figure below by describing all parts of the inverted residual block as one bottleneck, which will result in feature extraction, and the final layer, there is a classification layer. this research uses transfer learning architecture with an adaptive learning rate, and the architectural model of transfer learning that will be used is the pre-trained model from mobilenet v2. figure 4 is a simple description of the architecture above. fig. 4 simple architectural model in this research the steps in modeling the mobilenet v2 architecture are the input layer, the first layer in the architectural model as an input layer, and a fish image that has gone through image data preprocessing. a bottleneck is a simple arrangement described in the architectural model in which various layers comprise the v2 mobilenet architectural model. 19 layers comprise the bottleneck. one is the depthwise convolution layer and pointwise convolution layer using skip connection in each layer. flatten layer changes the results of the feature map in the previous layer into features that can later be processed on the neural network. the last layer is used for the classification process to determine the class the processed image belongs to. table 2 architecture mobilenet v2 input operator t c n s 2242 x 3 conv2d 32 1 2 1122 x 32 bottleneck 1 16 1 1 1122 x 16 bottleneck 6 24 2 2 562 x 24 bottleneck 6 32 3 2 282 x 32 bottleneck 6 64 4 2 142 x 64 bottleneck 6 96 3 1 142 x 94 bottleneck 6 160 3 2 72 x 160 bottleneck 6 320 1 1 72 x 320 conv2d 1x1 1280 1 1 72 x 1280 avrgpool 7x7 1 12 x 1280 conv2d 1x1 k r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 71 c. optimizer (adamw) the adamw optimizer is an adam optimization combined with l2 regularization and weight decay [21], while the adam optimizer is an optimization algorithm that replaces stochastic gradient descent in the deep learning model training stage. adam's optimization represents the best properties of other optimization algorithms, such as adagrad and rmsprop, which have the advantage of an adaptive learning rate. adamw algorithm, using hyperparameter α=0.001, β1=0.9, β2=0.999, ε= 10^ 8, λ ∈ r. hyperparameters are pre-set, and parameters t ←0, the first-moment vector is initialized to the value of 0 (mt ←0), the second-moment vector is also initialized to the value of 0 (vt ←0) and the schedule multiplier parameter is set to zero (ηt ←0 ∈ r). n the adamw algorithm, the parameter t will increase as the number of iterations increases, as in (1). 𝑡 ← 𝑡 + 1 (1) then add the derivative formula of gradient loss to weight in (2). 𝑔𝑡 ← 𝜕𝐿𝑡 𝜕𝑊𝑖,𝑡 + 𝜆 𝜕𝐿𝑡 𝜕𝑊𝑖,𝑡 (𝑖𝑔𝑛𝑜𝑟𝑒 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑖) (2) next, the first step is a formula similar to the momentum mt in (3), and the second step is the same as rmsprop vt in (4). 𝑚𝑖,𝑡 = 𝛽1𝑚𝑖−1,𝑡 + (1 − 𝛽1)𝑔𝑡 (3) 𝑣𝑖,𝑡 = 𝛽2𝑚𝑖−1,𝑡 + (1 − 𝛽2)(𝑔𝑡 ) 2 (4) of course, the adamw algorithm still has a technique to do bias correction by adding a formula to escape the value of 𝑚𝑡 𝑎𝑛𝑑 𝑣𝑡, being unbiased to 0 or close to 0. the following explains if, in the first iteration (𝑡 = 1), the momentum and rmsprop values are given a 0. so there is an additional formula in the next step to avoid bias in the initial iteration, that is �̂�𝑖,𝑡 (m hat, m hat, as momentum in (5)) dan �̂�𝑖,𝑡 (v hat, as rmsprop in (6)). �̂�𝑖,𝑡 = 𝑚𝑖,𝑡 1−𝛽1 𝑡 (5) �̂�𝑖,𝑡 = 𝑣𝑖,𝑡 1−𝛽2 𝑡 (6) then (7), as an update of the weight on adamw, the new weight equals the old weight subtracted from the multiplication of the coefficient 𝜂 with 𝛼�̂�𝑖,𝑡 divided by √�̂�𝑖,𝑡 + 𝜀 then added 𝜆𝑔𝒕. 𝑊𝑖,𝑡 = 𝑊𝑖−1,𝑡 + 𝜂𝑡 ( 𝛼�̂�𝑖,𝑡 √�̂�𝑖,𝑡+ 𝜀 ) 𝜆𝑔𝑡 (7) iii. results and discussion a. learning rate testing the recommended learning rate from the testing process is between 0.1 𝑡𝑜 1𝑒 −6, and 1.74𝑒−3 is obtained. the learning rate is an essential component that must be considered if the learning rate is too large, then we will not reach a minimal global loss, but on the contrary, if the learning rate is too low, it will take too long to reach the global minimum and even get stuck in the local minimum. figure 5 is the recommended learning rate for use in the architectural model of learning transfer modification because, according to smith's research [24], the ideal learning rate is neither too large nor too small. the value of the learning rate of 1.74𝑒 −3 is the better choice in this study. 72 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 fig. 5. suggested learning rate b. batch size testing in the batch size test results, researchers get different test results on train costs, test fees, train scores, test scores, and the number of epochs. the result is shown in table 3 and table 4. the phase 1 (adaptation) batch size test results in table 3 show that the small batch size value affects the test cost and test score because, with at least training data in 1 iteration, it will affect the results. meanwhile, for large batch sizes (256), more training data will be obtained in 1 iteration, the excellent test cost and test scores evidence this. the model's number of epochs on a batch size 64 is stuck at the local minimum. the results of the batch size test in phase 2 in table 4, the value of train cost, test cost, train score, and test score, get good results. the most striking change is in the epoch section, which decreases in large batch sizes. from testing phase 1 of adaptation and phase 2, the value of the size of the learning rate and early stopping is efficient with the accuracy value obtained. as in phase 1 adaptation with learning rate = 0.001, indeed, with a small batch size value will get a small accuracy value as well because it is affected by early stopping, which will stop the training process when the accuracy value is not increased [20]. c. performance of the model modified transfer learning performance testing using the architectural model of the transfer learning modification in phase 1 and phase 2, the result will be shown in table 5 and table 7. table 5 shows the highest accuracy value of 93.85% with two early stops. in this phase, the values obtained in training and validation are not much different, neither overfit nor underfit. table 5 can be visualized as a graph, as shown in figure 6. table 2. phase 1 (adaptation) batch size test results batch size train_cost test_cost train_score test_score epoch 64 0.3885 0.4139 0.8735 0.8708 7 128 0.1759 0.1830 0.9402 0.9407 17 256 0.2251 0.2341 0.9263 0.9270 8 table 3. phase 2 batch size test results batch size train_cost test_cost train_score test_score epoch 64 0.0052 0.0287 0.9985 0.9921 48 128 0.0018 0.0167 0.9994 0.9947 46 256 0.0038 0.0204 0.9994 0.9938 36 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 73 in figure 6, the accuracy value of training and validation, when viewed from table 5 there is no significant difference, but if seen in figure 6, it is evident because of the effect of several parameters that have been prepared above, such as early stopping and learning rate. fig. 6. phase 1 accuracy and loss graph this research section will discuss the performance of the prediction model that the researcher uses. a confusion matrix, one of which is used in the prediction model in supervised learning. the function of the confusion matrix is one of the benchmarks for evaluating the supervision learning model, namely by calculating accuracy, precision, sensitivity/recall. from the results of the prediction, values obtained a confusion matrix in phase 1 (adaptation) as shown in figure 7 and the calculations for accuracy, precision, sensitivity/recall in the confusion matrix phase 1 (adaptation). the above confusion matrix calculation is described in tabular form according to table 6. table 4. accuracy and loss values in phase 1 (adaptation) epoch training validasi avg accuracy avg loss avg accuracy avg loss 1 89.28 0.36 89.16 0.38 2 90.97 0.29 91.05 0.29 3 92.31 0.25 91.60 0.27 4 92.87 0.23 92.05 0.25 5 92.88 0.22 92.33 0.24 6 93.52 0.20 92.69 0.23 7 93.87 0.19 93.32 0.21 8 93.94 0.19 93.16 0.22 9 93.98 0.18 93.49 0.20 10 93.81 0.19 92.94 0.21 11 94.42 0.17 93.64 0.20 12 94.38 0.17 93.85 0.19 13 94.39 0.17 93.24 0.20 14 94.64 0.16 93.62 0.20 table 5 classification report phase 1 class precision recall f1-score abudefduf vaigiensis 0,929 1 0,963 acanthurus nigrofuscus 0,808 0,886 0,845 amphiprion clarkii 0,981 0,987 0,983 chaetodon lunulatus 0,967 0,995 0,98 chaetodon trifascialis 0,915 0,788 0,846 chromis chrysura 0,92 0,959 0,939 dascyllus aruanus 0,936 0,993 0,963 dascyllus reticulatus 0,956 0,921 0,938 hemigymnus fasciatus 1 0,916 0,956 hemigymnus melapterus 0,906 0,763 0,829 lutjanus fulvus 1 0,976 0,987 myripristis kuntee 0,959 0,9 0,928 neoglyphidodon nigroris 0,733 0,379 0,499 neoniphon sammara 1 1 1 pempheris vanicolensis 1 0,75 0,857 plectroglyphidodon dickii 0,891 0,906 0,898 pomacentrus moluccensis 1 1 1 zebrasoma scopas 0,6 0,739 0,662 average 0,916 0,881 0,892 74 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 fig. 7. confusion matrix in phase 1 (adaptation) in table 7, the highest accuracy value is 99.54375%, and the training accuracy value touches the value of 99.92389% in two early stops, so the values obtained in training and validation are overfitting at this stage. however, overfitting does not make a big difference from table 7, you can visualize it in the form of a graph as shown in figure 8. table 6. accuracy and loss values in phase 2 epoch training validasi avg accuracy avg loss avg accuracy avg loss 1 90.27348 0.42452 89.87 0.44 2 90.21006 0.42567 89.87124 0.44435 3 94.33254 0.23217 93.69 0.25 4 94.22853 0.23249 93.69360 0.25335 5 96.58278 0.14528 96.04 0.17 … … … …. … 27 99.91374 0.00455 99.54 0.02 28 99.92389 0.00455 99.54375 0.01831 29 99.93911 0.00417 99.49 0.02 30 99.89852 0.00456 99.49305 0.01938 31 99.93911 0.00370 99.52 0.02 32 99.94672 0.00359 99.52347 0.01674 33 99.94926 0.00339 99.52 0.02 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 75 fig. 8. phase 2 accuracy and loss graph in figure 8, the training and validation accuracy values are in table 7 show a smooth graph until the difference is less significant, as small learning rates can reach the global minimum. figure 9 shows phase 2, where the performance of the fish image classification system in the transfer learning process is analyzed using the mobilenet v2 architecture. the modified transfer learning architecture model has improved performance, decreased fn and fp values, and increased tp values. following are the calculations for accuracy, precision, sensitivity/recall in the phase 2 confusion matrix. the above confusion matrix at figure 9 calculation is described in tabular form according to table 8. fig. 9. confusion matrix in phase 2 76 r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 d. testing with other ai models in this section, researchers compare machine learning and deep learning models. that is by using traditional cnn, which has five convolution blocks and two hidden layer blocks with softmax function activation. the convolution block has 3x3 layer filters, with stride = 1, padding = 1, function activation = relu, and type pooling = max pool. from table 9, the modified transfer learning model gives the best results have some reason. use a pre-trained architectural model (which has been trained previously). the data trained on the previous architecture and the data used by the researcher are not too different because the pre-trained model architecture used has been trained on 1000 different types of images. traditional cnns are computationally faster to train images than the transfer learning modifications that use the inverted residual layer, although they differ slightly from the transfer learning modifications used by researchers. machine learning models from knn, random forest, and xgboost did not achieve accuracy values over 90%, but machine learning models were already suitable for classifying fish images. however, the data structuring process from image data / unstructured data to tabular/structured data still takes much time. iv. conclusion this study aims to classify fish images and use transfer learning modifications to determine the best performance. using a pre-trained model from mobilenet, you can modify the classification layer to provide modified transfer learning results. traditional cnns can be used to classify fish images, but the design of hidden layers is time-consuming and requires much computation. therefore, you can use modified transfer learning to solve the problem. the modified transfer learning performance and confusion matrix test results are excellent. when testing phase 1, accuracy rating = 0.8751; accuracy value = 0.9355; recall / sensitivity value = 0.93055. in phase 2 testing, accuracy value = 0.9895; accuracy value = 0.9947; recall / sensitivity value = 0.9947. based on the study's results, we can conclude that modified transfer learning can be the best model. table 7 classification report phase 2 class precision recall f1-score abudefduf vaigiensis 1 1 1 acanthurus nigrofuscus 0,973 0,99 0,981 amphiprion clarkii 0,997 1 0,998 chaetodon lunulatus 0,998 0,9 0,946 chaetodon trifascialis 0,9 1 0,947 chromis chrysura 0,997 0,998 0,997 dascyllus aruanus 1 1 1 dascyllus reticulatus 0,996 0,994 0,994 hemigymnus fasciatus 1 0,979 0,989 hemigymnus melapterus 0,947 0,947 0,947 lutjanus fulvus 0,976 1 0,987 myripristis kuntee 0,997 0,988 0,922 neoglyphidodon nigroris 0,933 0,965 0,948 neoniphon sammara 1 1 1 pempheris vanicolensis 1 0,875 0,933 plectroglyphidodon dickii 0,993 0,933 0,933 pomacentrus moluccensis 1 1 1 zebrasoma scopas 0,985 0,931 0,957 average 0,982 0,972 0,971 table 8. benchmarking table with machine learning model no method accuracy 1 modified transfer learning 99,64% 2 traditional cnn 98,58% 3 knn 85,5% 4 random forest 81,63% 5 xgboost 86,55% r. suhana et al. / knowledge engineering and data science 2022, 5 (1): 67–77 77 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] j. p. schulze rojas, “reef front heterogeneity analysis and coral genera diversity pattern in the bunaken national park, indonesia.” 2010. [2] i. asaad, c. j. lundquist, m. v. erdmann, and m. j. costello, “delineating priority areas for marine biodiversity conservation in the coral triangle,” biol. conserv., vol. 222, pp. 198–211, jun. 2018. [3] e. yuliana, i. farida, nurhasanah, m. boer, a. fahrudin, and m. m. kamal, “habitat quality and reef fish resources potential in karimunjawa national park, indonesia,” aacl bioflux, vol. 13, no. 4, pp. 1836–1848, 2020. [4] i. cáceres, e. c. ibarra-garcía, m. ortiz, m. ayón-parente, and f. a. rodríguez-zaragoza, “effect of fisheries and benthic habitat on the ecological and functional diversity of fish at the cayos cochinos coral reefs (honduras),” mar. biodivers., vol. 50, no. 1, p. 9, feb. 2020. [5] b. j. boom et al., “long-term underwater camera surveillance for monitoring and analysis of fish populations,” work. vis. obs. anal. anim. insect behav. (vaib), conjunction with icpr 2012, no. august 2015, pp. 2–5, 2012. [6] s. villon et al., “a deep learning method for accurate and fast identification of coral reef fishes in underwater images,” ecol. inform., vol. 48, no. august, pp. 238–244, 2018. [7] s. winiarti, f. i. indikawati, a. oktaviana, and h. yuliansyah, “consumable fish classification using k -nearest neighbor,” iop conf. ser. mater. sci. eng., vol. 821, no. 1, p. 012039, apr. 2020. [8] z. jin, j. shang, q. zhu, c. ling, w. xie, and b. qiang, “rfrsf: employee turnover prediction based on random forests and survival analysis,” in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 12343 lncs, 2020, pp. 503–515. [9] y. c. chang, k. h. chang, and g. j. wu, “application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions,” appl. soft comput. j., vol. 73, pp. 914–920, 2018. [10] t. chen and c. guestrin, “xgboost: a scalable tree boosting system,” proc. acm sigkdd int. conf. knowl. discov. data min., vol. 13-17-augu, pp. 785–794, 2016. [11] w. jiao, x. hao, and c. qin, “the image classification method with cnn-xgboost model based on adaptive particle swarm optimization,” information, vol. 12, no. 4, p. 156, apr. 2021. [12] j. brownlee, xgboost with python gradient boosted trees with xgboost and scikit-learn. 2018. [13] n. f. f. alshdaifat, a. z. talib, and m. a. osman, “improved deep learning framework for fish segmentation in underwater videos,” ecol. inform., vol. 59, no. may, p. 101121, 2020. [14] s. cui, y. zhou, y. wang, and l. zhai, “fish detection using deep learning,” appl. comput. intell. soft comput., vol. 2020, 2020. [15] b. s. rekha, g. n. srinivasan, s. k. reddy, d. kakwani, and n. bhattad, fish detection and classification using convolutional neural networks, vol. 1108 aisc, no. july. springer international publishing, 2020. [16] f. kratzert and h. mader, “fish species classification in underwater video monitoring using convolutional neural networks,” 2018. [17] d. li, z. wang, s. wu, z. miao, l. du, and y. duan, “automatic recognition methods of fish feeding behavior in aquaculture: a review,” aquaculture, vol. 528, p. 735508, 2020. [18] a. g. howard et al., “mobilenets: efficient convolutional neural networks for mobile vision applications,” 2017. [19] j. brownlee, better deep learning. train faster, reduce overfitting, and make better predictions, vol. 1.3, no. 0. 2019. [20] i. goodfellow, y. bengio, and a. courville, deep learning: machine learning book. 2016. [21] i. loshchilov and f. hutter, “decoupled weight decay regularization,” 7th int. conf. learn. represent. iclr 2019, 2019. [22] d. masters and c. luschi, “revisiting small batch training for deep neural networks,” pp. 1–18, 2018. [23] b. j. boom, p. x. huang, j. he, and r. b. fisher, “supporting ground-truth annotation of image datasets using clustering,” proc. int. conf. pattern recognit., no. january, pp. 1542–1545, 2012. [24] l. n. smith, “cyclical learning rates for training neural networks,” proc. 2017 ieee winter conf. appl. comput. vision, wacv 2017, no. april, pp. 464–472, 2017. http://journal2.um.ac.id/index.php/keds https://purl.utwente.nl/essays/90739 https://purl.utwente.nl/essays/90739 https://doi.org/10.1016/j.biocon.2018.03.037 https://doi.org/10.1016/j.biocon.2018.03.037 https://www.cabdirect.org/cabdirect/abstract/20203458419 https://www.cabdirect.org/cabdirect/abstract/20203458419 https://doi.org/10.1007/s12526-019-01024-z https://doi.org/10.1007/s12526-019-01024-z https://doi.org/10.1007/s12526-019-01024-z https://homepages.inf.ed.ac.uk/rbf/vaib12papers/boom.pdf https://homepages.inf.ed.ac.uk/rbf/vaib12papers/boom.pdf https://doi.org/10.1016/j.ecoinf.2018.09.007 https://doi.org/10.1016/j.ecoinf.2018.09.007 https://iopscience.iop.org/article/10.1088/1757-899x/821/1/012039 https://iopscience.iop.org/article/10.1088/1757-899x/821/1/012039 https://doi.org/10.1007/978-3-030-62008-0_35 https://doi.org/10.1007/978-3-030-62008-0_35 https://doi.org/10.1007/978-3-030-62008-0_35 https://doi.org/10.1016/j.asoc.2018.09.029 https://doi.org/10.1016/j.asoc.2018.09.029 https://doi.org/10.1145/2939672.2939785 https://doi.org/10.1145/2939672.2939785 https://doi.org/10.3390/info12040156 https://doi.org/10.3390/info12040156 https://machinelearningmastery.com/xgboost-with-python/ https://doi.org/10.1016/j.ecoinf.2020.101121 https://doi.org/10.1016/j.ecoinf.2020.101121 https://doi.org/10.1155/2020/3738108 https://doi.org/10.1155/2020/3738108 https://doi.org/10.1007/978-3-030-37218-7_128 https://doi.org/10.1007/978-3-030-37218-7_128 https://doi.org/10.31223/osf.io/dxwtz https://doi.org/10.31223/osf.io/dxwtz https://doi.org/10.1016/j.aquaculture.2020.735508 https://doi.org/10.1016/j.aquaculture.2020.735508 https://doi.org/10.48550/arxiv.1704.04861 https://machinelearningmastery.com/better-deep-learning/ https://machinelearningmastery.com/better-deep-learning/ https://www.deeplearningbook.org/ https://openreview.net/forum?id=rylv-2c9kq https://openreview.net/forum?id=rylv-2c9kq https://doi.org/10.48550/arxiv.1804.07612 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6460437 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6460437 https://doi.org/10.1109/wacv.2017.58 https://doi.org/10.1109/wacv.2017.58 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 1, july 2021, pp. 55–68 eissn 2597-4637 https://doi.org/10.17977/um018v4i12021p55-68 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by indonesian ministry of research & technology detection of disease and pest of kenaf plant based on image recognition with vggnet19 diny melsye nurul fajri a, 1 *, wayan firdaus mahmudy a, 2, titiek yulianti b, 3 a faculty of computer science, brawijaya university 8th veteran road malang 65145, jawa timur, indonesia b indonesian research institute for sweetener and fiber crops (irisfc) karangploso km 4, po.box 199, kabupaten malang, jawa timur, indonesia 1 dimelnf@gmail.com*; 2 wayanfm@ub.ac.id; 3 tyuliant@gmail.com * corresponding author i. introduction kenaf plant fiber is one of the basic ingredients for environmental management products that have become the center of global attention. this fiber is intended to replace the consumption of plastic materials used in car interiors at popular automotive companies [1]. this causes the export value of raw materials to increase because they have to meet production needs. a problem was encountered in one of kenaf's plantations, where some plants died, and the harvest failed. the cause of this crop failure was the kenaf plant which was attacked by diseases and pests. farmers do not know the plants affected by this disease because of a lack of counseling from experts, which can be fatal. with the development of information technology systems, this can be overcome by detecting symptoms of disease/pests in kenaf plants from an early age to avoid the consequences of crop failure. one of the methods offered is to detect symptoms of disease/pests in kenaf plants through pictures. a machine carries out image recognition then the machine can make decisions based on the trained image. this machine-driven decision-making system is called machine learning [2]. one application of the machine learning method has been carried out by saragih in the decision-making system for the classification of jatropha plant disease with an accuracy of 60.61% [3]. this result is categorized as quite good because it can achieve accuracy above 50%. the data used in this study [3] is a scoring of disease types in jatropha, which is given by the expert directly. the more data that is trained, the more complex the machine learning system will be as the data described in one image. one image can include hundreds of thousands or even millions of data that must be processed. deep complex machine article info a b s t r a c t article history: received 14 january 2021 revised 9 may 2021 accepted 13 august 2021 published online 17 august 2021 one of the advantages of kenaf fiber as an environmental management product that is currently in the center of attention is the use of kenaf fiber for luxury car interiors with environmentally friendly plastic materials. the opportunity to export kenaf fiber raw material will provide significant benefits, especially in the agricultural sector in indonesia. however, there are problems in several areas of kenaf's garden, namely plants that are attacked by diseases and pests, which cause reduced yields and even death. this problem is caused by the lack of expertise and working hours of extension workers as well as farmers' knowledge about kenaf plants which have a terrible effect on kenaf plants. the development of information technology can be overcome by imparting knowledge into machines known as artificial intelligence. in this study, the convolutional neural network method was applied, which aims to identify symptoms and provide information about disease symptoms in kenaf plants based on images so that early control of plant diseases can be carried out. data processing trained directly from kenaf plantations obtained an accuracy of 57.56% for the first two classes of introduction to the vggnet19 architecture and 25.37% for the four classes of the second introduction to the vggnet19 architecture. the 5×5 block matrix input feature has been added in training to get maximum results. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: convolutional neural network image recognition kenaf neural network vggnet19 http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i12021p55-68 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.ristekbrin.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 56 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 learning to process a lot of data, such as data in images, is called deep learning. the use of deep learning is often found in everyday cellphone applications such as google photos, qr scanners, even for biometric security systems to face unlock [4]. several studies using deep learning [5][6][7][8] were able to get 90% -95% accuracy results, which indicates that learning from deep learning is better than machine learning. convolutional neural network (cnn) is the primary method for classifying images in machine learning. this method consists of some layers, i.e., convolutional layer, subsampling layer, and fully connected layer. the arrangement of these layers is called architecture. cnn has some architecture. one of them is visual geometry group (vggnet) 19, which focused on the effect of cnn depth on its accuracy [9]. several studies using the concept of vgg architecture get a reasonably high accuracy value; one of the studies conducted by khrisne and suyadna in their research was to identify the types of indonesian herbs and spices [10]. the objects used have many similarities in models and shapes, so that the introduction made using the vgg architecture reaches 70% in 3527 images which are divided into 27 classes. other studies also show that the vggnet architecture is also good enough to recognize types of diseases. the research was conducted using three different types of architecture, namely stridednet, lenet, and vggnet. the results show that the vggnet architecture has a higher accuracy value than other architectures, reaching 95.40%, while the lenet model with 93.65% and stridenet figures reaches 90.10% [11]. based on field conditions and supported by current technological advances, almost all farmers have used smartphones, which means this is very helpful for farmers to be able to control the conditions of their kenaf plants so that they can receive early supervision to prevent crop failures caused by pests and diseases through their smartphones. in information technology, this is quite a challenge to help facilitate farmers and save timing. this research will introduce the types of pests and diseases on kenaf plants using image-based deep learning techniques. this research claims to continue previous research that has been carried out by fajri [12], who instilled a similar method on the same object. changes made in this study were to provide training for horizontal 5×5 matrix calculations on the image data in the input model that was used to carry out more effective, efficient, and maximum recognition training. the final result will undoubtedly change if the changes are made in the input model. the image data used in this study are the same as previous studies included in the research boundaries due to the plantation conditions. this research is dedicated to helping plantation farmers, especially kenaf plants, through a technological approach to an artificial intelligent decision-making system in the field of information technology and also provide results based on research, whether the method used is the right method for detection and how accurate the results of the method that have been designed. ii. methods some of the methods that underlie this research include deep learning, cnn and its architecture, and the proposed method to get good results in making a system to detect symptoms of pests and diseases of kenaf plants. a. object recognition the object used in this study is the kenaf plant. there are various kinds of kenaf leaf morphology, such as unlobed (figure 1a), partially lobed (figure 1b), and deeply lobed (figure 1c). for deeply lobed, there are 5-7 lobed and have serrations. the cross-section of the kenaf leaf is shown in figure 1. on healthy leaves, the leaves are fresh green without any reddish or yellowish to brownish spots. figure 2 shows healthy kenaf leaf; fresh leaves stretch without any curved parts are categorized under the criteria of "sehat". many types of diseases can attack kenaf plants, including fusarium wilt, root rot, bacterial wilt, leaf blight, and anthracnose [13]. the most frequently encountered is leaf blight and matches on the object used in this study, on leaves. the leaf surface looks brownish, and there are black spots with fungal pycnidia in the middle of the spots [13]. leaf blight is shown in figure 3. the pests that attack also give signs of symptoms on the leaves in leaf edges that start to turn yellow until they appear reddish and curl inward. symptoms like this are due to the attack of the amrasca bigutulla pest called sundapteryx disease on kenaf leaves. the symptoms of sundapteryx can be seen in figure 4. d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 57 fig. 1. kenaf's leaf cross section; (a) unlobed, (b) partially lobed, and (c) deeply lobed fig. 2. kenaf's healthy leaf fig. 3. leaf blight disease on kenaf leaf 58 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 another symptom that affects kenaf leaves is reddish patches that spread over almost all over the leaf surface and cause the leaves to dry out. the dried leaves will fall off, causing the plant to die. the cause of the appearance of this symptom is mites (tungau urticae). t. urticae is usually found in areas with temperatures around 30 ºc with low humidity (dry) [14]. symptoms of mites are shown in figure 5. b. deep learning deep learning is one of the algorithms of neural network algorithms. input to learning is metadata which is then processed by calling hidden layers to produce the output value. the advantage of this deep learning technique is that it has a unique feature to perform automatic feature extraction. this indicates that deep learning can get a differentiating/unique feature relevant to the problem [15]. the way the engineering works is quite complex because it focuses on network architecture and its optimal procedural features. deep learning can self-tun and select the optimal training model without requiring excess information [16]. the use of deep learning for learning with image objects can be done with a convolutional neural network. fig. 4. sundapteryx bigutulla on kenaf leaf fig. 5. tungau affected in kenaf leaf d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 59 c. convolutional neural network convolutional neural network is the development of a neural network designed to process twodimensional data in images and audio. yann le chun developed cnn in 1990 through handwriting and number recognition [17]; this method won the imagenet large scale visual recognition challenge competition presented by alex krizhevsky with 1.3 million high-resolution images lsvrc-2010 imagenet training set into the 1000 different classes. the basic concept of cnn is that it receives two-dimensional data and then passes it on to the next layer to be processed into an output. for every data that enters the layer, a linear operation will be carried out with a determined weight value then transformed using a non-linear operation called the activation function. 1) activation function the activation function acts as a determinant of neurons. neurons are determined to be active or inactive based on their weight. one type of activation function that is widely used is the rectified linear unit (relu). relu is applied by making a threshold at zero, or it can be translated as follows, if 𝑥 ≤ 0, then 𝑥 = 0, and if 𝑥 ≥ 0, then 𝑥 = 𝑥 [18]. the equation can be seen in (1) and the curve shown in figure 6. 𝑓(𝑥) = max(0,𝑥) (1) 2) dropout regularization a neural network regularization technique means reducing the number of inactive neurons by removing them. the contribution of the wasted neurons will be temporarily suspended. the selection of neurons is made to reduce the risk of overfitting [19]. the dropout technique is applied in figure 7. fig. 6. relu activation function fig. 7. dropout technique 60 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 3) cnn architecture layers and neurons cannot be defined by precise rules and are subject to different treatments on different data types [20]. several layers can be combined to get more accurate results and better processing performance. the main layers on cnn are as follows: a) convolution layer the convolution layer is the primary layer of cnn architecture, has a convolution process from the previous layer. convolution implements the function of a kernel at all possible offsets. the kernel moves from the top left corner to the bottom right corner. the purpose of this process is to extract features from the image. the convolution result is a linear transformation of the input according to the spatial information of the data. cnn convolutional operations are shown in figure 8. b) subsampling layer the subsampling layer is a layer that reduces the size of the image. this layer aims to increase the position invariance of the features. several techniques in the subsampling layer method include average pooling, k max pooling, and max pooling. average pooling is to find the average value of each dimension, then for k max pooling is to find the most considerable k value for each dimension then combined, and the subsampling technique that is often used is max pooling. max pooling works by breaking the output from the convolution layer into smaller parts and taking the highest value from each part to compose the reduced image matrix. the concept of the pooling technique can be seen in figure 9. c) fully connected layer a fully connected layer is a layer commonly used in mlp, which aims to transform data dimensions so that data can be classified linearly. this layer is located at the very end of the network. each neuron that is passed on to the fully connected layer must be one-dimensional since data can lose spatial information and is not reversible. fig. 8. convolution process in cnn fig. 9. pooling concept d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 61 d. vggnet architecture an input to perform with cnn uses 224×224 rgb in the pixel range 0 to 255 and reduces the average value of the calculated image in the training sequence. these images then pass through the convolutional layers and the connected layers. the vggnet architecture has a smaller 3×3 filter. it has the same receptive plane as if it had one layer of 7×7 convolution [9]. an illustration of the architecture can be seen in figure 10. e. proposed methods in the type recognition training, the target was four outputs, namely “hawar daun”, “sehat”, “sundapteryx” and “tungau”. the basic architectural modeling used in this training uses vgg net 19. this is another variation of vggnet. the layers designed in this training consist of input layers → 2 blocks for layer 2 convolution and max-pooling → 3 blocks for layer 4 convolution and maxpooling → flatten layer → 2 blocks of dense layer and dropout → dense layer. the architectural modeling design of this system can be seen in figure 11. an input model involves the matrix calculation from the image data to make a more significant introduction. this matrix calculation is done by creating a 5×5 pixel matrix block from the image data. then after the block division is carried out, the matrix of each block will be calculated as the average value. the average calculation value in the image matrix used the numpy module in python, namely the method 𝑚𝑎𝑡𝑟𝑖𝑥. 𝑚𝑒𝑎𝑛() and implemented in the source code shown in figure 12. this calculation is carried out horizontally. this training aims to increase the introduction of knowledge that represents the accuracy of the result. after the cnn training process has been implanted, the following process is the classification process. this process determines the results of the classification based on previous training. the classification flow begins with entering image data and then processed according to cnn training, then scanning is carried out using a 96×96 sliding window. during the scanning process, the system will detect whether the image is a leaf image or not. if it's not a leaf, the process will be terminated after issuing the "bukan daun" result, and the window will give you the option to re-select the image. if the inserted image is a leaf image, the results will be saved, then proceed with the classification process by selecting the range using the non-max suppression algorithm, which gives the final decision of the scanned image then the type of disease results will be displayed based on the percentage. fig. 10. vgg architecture 62 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 fig. 11. vggnet 19 acrhitecture fig. 12. matrix mean source code d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 63 iii. results and discussions a. leaf recognition testing accuracy in architectural modeling in leaf recognition training, namely leaf shape recognition, gets an accuracy value of 57.56%. the average value of a maximum of two classes: precision, recall, and f1 score amount 51%, and the average weight value is 72%. after the model is formed, the results of the value plot graph can be seen in figure 13. leaf recognition training has an epoch value of 100. the accuracy value obtained from an epoch of 100 can be seen in table 1. b. disease classification based on the introductory training that has been designed with such modeling, testing the accuracy value for the classification results is carried out by testing similar to the previous training. testing accuracy in disease recognition training, namely introducing kenaf plant disease classification, gets an accuracy value of 25.37%. the maximum average value between the four classes, namely: recall is 30%, precision and f1 score is 29%, and the average weight value is 32%. the graph has increased in accuracy and decreased data loss, so this is good enough to continue at a later stage. this can be seen in figure 14. disease recognition training has an epoch value of 100. the accuracy value obtained from an epoch of 100 can be seen in table 2. c. feature input matrix block based on the input feature introduction training that has been designed with such modeling, testing the accuracy value for the classification results is carried out by testing similar to the previous training. for accuracy testing, horizontal input training gets an accuracy value of 24%. the maximum average value between the four classes, namely: recall, precision, and f1 score of 28%, can be seen in figure 15. the results of testing for disease recognition can be seen in table 3. from the total sample data, 200 data are divided into 4, namely 50 data on leaf blight, 50 sundapteryx, 50 mites, and 50 healthy. the machine can correctly recognize 106 types of leaf disease data from 200 tested data. the machine also experienced an error in introducing 2 leaf blight data, 17 fig. 13. accuracy and plotting architectural model fig. 14. accuracy and classification charts 64 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 sundapteryx data, 50 mite data, and 25 healthy data. based on calculations for disease classification, the accuracy value is 53%. the object with the correct prediction is shown in figure 16. in figure 16, the actual data show that these leaves are affected by symptoms that refer to the category of leaf blight disease. this system shows the percentage of classification in the category of leaf blight by 100%. in figure 17, the system shows the kenaf plant species in the healthy category by 75%. the results show that 75% is the highest number from other categories, so it can be concluded that the figure falls into the category: healthy. the actual data also shows that the image is in the healthy category. for the detection of a wrong disease recognition object, it can be seen in figure 18. in figure 18, the actual data shows that the image is in the sundapteryx bigutulla category, while the prediction of the system states that the image is a mite category with a percentage of 42.1%. prediction error in this image is due to the training data image factor, which has high similarity between categories. the similarities can be seen in the histogram image shown in figure 19. the histogram shows the spread of the pixel intensity values of the image. figure 19 is a histogram image in the sundapteryx category and mites that are similar, making it one of the causes of prediction errors made by the system. table 1. leaf recognition training accuracy with the epoch approach is up to 100 epoch train accuracy train loss time 1 0.7672 0.525 52s 2 0.8307 0.3751 57s 3 0.8594 0.3038 59s 4 0.8942 0.1964 59s 5 0.9032 0.1389 59s … 96 1.0000 0.00008467 64s 97 1.0000 0.000091677 63s 98 1.0000 0.000078224 63s 99 1.0000 0.000098397 64s 100 1.0000 0.000060537 63s table 2. disease recognition training accuracy with the epoch approach is up to 100 epoch train accuracy train loss time 1 0.3231 1.6566 53s 2 0.3393 1.3856 65s 3 0.4036 1.3153 50s 4 0.4564 1.2401 60s 5 0.4911 1.1098 70s … 96 1.0000 0.0179 67s 97 0.9949 0.0222 67s 98 0.9949 0.0157 67s 99 0.9955 0.0135 77s 100 1.0000 0.0144 67s table 3. correction of disease recognition system no class actual data system prediction 1 leaf blight (hawar daun) 50 48 2 tungau 50 0 3 sundapteryx bigutulla 50 33 4 healthy (sehat) 50 25 total 200 106 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 65 fig. 15. horizontal input feature training accuracy fig. 16. identification system detection with correct prediction fig. 17. system detection in healthy plant categories 66 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 iv. conclusion in conclusion, the design of kenaf plant disease & pest detection can be done by developing image recognition technology by coding using the python programming language with open source keras which is run on the tensorflow machine learning platform. the architecture used in this study is the convolutional neural network architecture: vggnet 19 for leaf recognition and disease identification. the layer of the vggnet19 architecture used for detection in this study includes the input layer, convolution layer, max pool layer, dense layer, dropout layer, and the addition of a flattening layer in the section on disease identification. input feature training is given so that the characteristics formed from each disease cluster become more significant, thereby facilitating the classification process. the input feature training is designed by calculating the 5×5 pixel matrix block from the data of an image and calculating the average value so that a new matrix is formed, which is fig. 18. the detection system has a prediction error fig. 19. image histogram in the category of sundapteryx and mites d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 67 then reprocessed using the specified architecture. the results of the introductory training achieved an accuracy value of 57.56% for the first two recognition classes and 25.37% for the second 4 classes on the vggnet19 architecture. based on the architectural design that has been established in this study, from several leaf data tested, namely 200 leaf data, 94 disease leaf images cannot be recognized correctly by the system and shown accurately at number 53%. based on the results of the tests that have been carried out in this study, the convolutional neural network method can be recommended as a solution to an image-based recognition case. some suggestions are given for further research, including adding more data so that the sample results can achieve maximum results. then test more diverse types of architecture. training on input features can be further expanded by providing more diverse block matrix training, such as 3×3, 8×8, 16×16 block matrices, etc. add other methods such as the adaptive fuzzy filter method or some other filtering method used to reduce noise or adjust the light entering the image to obtain more accurate training results through image data improvement. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering unversitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] miyagawa and tranggono, “kebutuhan serat kenaf sebagai bahan baku industri pt tbina,” in seminar nasional serat alam serat alam inovasi teknologi serat alam mendukung agroindustri yang berkelanjutan, 2015, pp. 54–59 [2] e. alpaydin, introduction to machine learning. london: the mit press, 2004. [3] t. h. saragih, w. f. mahmudy, a. l. abadi, d. m. n. fajri, and y. p. anggodo, “jatropha curcas disease identification with extreme learning machine,” indones. j. electr. eng. comput. sci., vol. 12, no. 2, 2018, doi: 10.11591/ijeecs.v12.i2.pp883-888. [4] m. hassaballah and a. i. awad, deep learning in computer vision, 1st edition. crc press, 2020. [5] j. akbar, m. shahzad, m. i. malik, a. ul-hasan, and f. shafait, “runway detection and localization in aerial images using deep learning,” 2019 digit. image comput. tech. appl., pp. 1–8, 2019. [6] f. ertam, “deep learning based text classification with web scraping methods,” in 2018 international conference on artificial intelligence and data processing (idap), 2018, pp. 1–4, doi: 10.1109/idap.2018.8620790. [7] s. t. kebir and s. mekaoui, “an efficient methodology of brain abnormalities detection using cnn deep learning network,” in 2018 international conference on applied smart systems (icass), 2018, pp. 1–5, doi: 10.1109/icass.2018.8652054. [8] m. d. radu, i. m. costea, and v. a. stan, “automatic traffic sign recognition artificial inteligence deep learning algorithm,” 2020, doi: 10.1109/ecai50035.2020.9223186. [9] p. nepal, “vggnet architecture explained,” 2020. https://medium.com/analytics-vidhya/vggnet-architectureexplained-e5c7318aa5b6. [10] d. c. khrisne and i. m. a. suyadnya, “indonesian herbs and spices recognition using smaller vggnet-like network,” in 2018 international conference on smart green technology in electrical and information systems (icsgteis), oct. 2018, pp. 221–224, doi: 10.1109/icsgteis.2018.8709135. [11] s. v militante and b. d. gerardo, “detecting sugarcane diseases through adaptive deep learning models of convolutional neural network,” 2019 ieee 6th int. conf. eng. technol. appl. sci., pp. 1–5, 2019, doi: 10.1109/icetas48360.2019.9117332. [12] d. m. n. fajri, w. f. mahmudy, and t. yulianti, “detection of disease and pest of kenaf plant using convolutional neural network,” j. inf. technol. comput. sci., vol. 6, no. 1, p. 18, apr. 2021, doi: 10.25126/jitecs.202161195. [13] t. yulianti and supriyono, “penyakit tanaman kenaf dan pengendaliannya,” in monograf balittas: kenaf (hibiscus cannabinus l.), 2009, p. 107. [14] p. ranali, “a survey of hemp pest and disease,” in advances in hemp research, p. ranalli, ed. boca ratoon, london, http://journal2.um.ac.id/index.php/keds http://balittas.litbang.pertanian.go.id/images/prosiding/kebutuhan%20serat%20kenaf%20ser-al.pdf http://balittas.litbang.pertanian.go.id/images/prosiding/kebutuhan%20serat%20kenaf%20ser-al.pdf https://mitpress.mit.edu/books/introduction-machine-learning https://doi.org/10.11591/ijeecs.v12.i2.pp883-888 https://doi.org/10.11591/ijeecs.v12.i2.pp883-888 https://doi.org/10.11591/ijeecs.v12.i2.pp883-888 https://doi.org/10.1201/9781351003827 https://doi.org/10.1109/dicta47822.2019.8945889 https://doi.org/10.1109/dicta47822.2019.8945889 https://doi.org/10.1109/idap.2018.8620790 https://doi.org/10.1109/idap.2018.8620790 https://doi.org/10.1109/icass.2018.8652054 https://doi.org/10.1109/icass.2018.8652054 https://doi.org/10.1109/icass.2018.8652054 https://doi.org/10.1109/ecai50035.2020.9223186 https://doi.org/10.1109/ecai50035.2020.9223186 https://medium.com/analytics-vidhya/vggnet-architecture-explained-e5c7318aa5b6. https://medium.com/analytics-vidhya/vggnet-architecture-explained-e5c7318aa5b6. https://doi.org/10.1109/icsgteis.2018.8709135 https://doi.org/10.1109/icsgteis.2018.8709135 https://doi.org/10.1109/icsgteis.2018.8709135 https://doi.org/10.1109/icetas48360.2019.9117332 https://doi.org/10.1109/icetas48360.2019.9117332 https://doi.org/10.1109/icetas48360.2019.9117332 https://doi.org/10.25126/jitecs.202161195 https://doi.org/10.25126/jitecs.202161195 http://balittas.litbang.pertanian.go.id/index.php/id/publikasi/monograf/80-kenaf http://balittas.litbang.pertanian.go.id/index.php/id/publikasi/monograf/80-kenaf https://doi.org/10.1201/9781498705820-13 68 d.m.n. fajri et al. / knowledge engineering and data science 2021, 4 (1): 55–68 new york: crc press taylor & francis group, 1999, pp. 109–122. [15] j. k. gill, “automatic log analysis using deep learning and ai,” 2018. [16] z. monge, “does deep learning really require ‘big data’? — no!,” medium towards data science, 2018. [17] y. le cun et al., handwritten digit recognition with a back-propagation network. san francisco: morgan kaufmann publishers inc., 1990. [18] r. h. r. hannloser, r. sarpeshkar, m. a. mahowald, r. douglas, and h. s. seung, “digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” nature, vol. 405, pp. 947–951, 2000, doi: https://doi.org/ 10.1038/35016072. [19] n. srivastava, g. hinton, a. krizhevsky, i. sutskever, and r. salakhutdinov, “dropout: a simple way to prevent neural network from overfitting,” j. mach. learn. res. 15, pp. 1929–1958, 2014. [20] d. stathakis, “how many hidden layers and nodes?,” int. j. remote sens., vol. 30, no. 8, pp. 2133–2147, 2009, doi: 10.1080/01431160802549278. https://doi.org/10.1201/9781498705820-13 https://www.xenonstack.com/blog/log-analytics-deep-machine-learning https://towardsdatascience.com/does-deep-learning-really-require-big-data-no-13890b014ded https://papers.nips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-paper.pdf https://papers.nips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-paper.pdf https://doi.org/10.1038/35016072 https://doi.org/10.1038/35016072 https://doi.org/10.1038/35016072 https://jmlr.org/papers/v15/srivastava14a.html https://jmlr.org/papers/v15/srivastava14a.html https://doi.org/10.1080/01431160802549278 https://doi.org/10.1080/01431160802549278 i. introduction ii. methods a. object recognition b. deep learning c. convolutional neural network 1) activation function 2) dropout regularization 3) cnn architecture a) convolution layer b) subsampling layer c) fully connected layer d. vggnet architecture e. proposed methods iii. results and discussions a. leaf recognition b. disease classification c. feature input matrix block iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] miyagawa and tranggono, “kebutuhan serat kenaf sebagai bahan baku industri pt tbina,” in seminar nasional serat alam serat alam inovasi teknologi serat alam mendukung agroindustri yang berkelanjutan, 2015, pp. 54–59 [2] e. alpaydin, introduction to machine learning. london: the mit press, 2004. [3] t. h. saragih, w. f. mahmudy, a. l. abadi, d. m. n. fajri, and y. p. anggodo, “jatropha curcas disease identification with extreme learning machine,” indones. j. electr. eng. comput. sci., vol. 12, no. 2, 2018, doi: 10.11591/ijeecs.v12.i2.pp883-888. [4] m. hassaballah and a. i. awad, deep learning in computer vision, 1st edition. crc press, 2020. [5] j. akbar, m. shahzad, m. i. malik, a. ul-hasan, and f. shafait, “runway detection and localization in aerial images using deep learning,” 2019 digit. image comput. tech. appl., pp. 1–8, 2019. [6] f. ertam, “deep learning based text classification with web scraping methods,” in 2018 international conference on artificial intelligence and data processing (idap), 2018, pp. 1–4, doi: 10.1109/idap.2018.8620790. [7] s. t. kebir and s. mekaoui, “an efficient methodology of brain abnormalities detection using cnn deep learning network,” in 2018 international conference on applied smart systems (icass), 2018, pp. 1–5, doi: 10.1109/icass.2018.8652054. [8] m. d. radu, i. m. costea, and v. a. stan, “automatic traffic sign recognition artificial inteligence deep learning algorithm,” 2020, doi: 10.1109/ecai50035.2020.9223186. [9] p. nepal, “vggnet architecture explained,” 2020. https://medium.com/analytics-vidhya/vggnet-architecture-explained-e5c7318aa5b6. [10] d. c. khrisne and i. m. a. suyadnya, “indonesian herbs and spices recognition using smaller vggnet-like network,” in 2018 international conference on smart green technology in electrical and information systems (icsgteis), oct. 2018, pp. 221–224,... [11] s. v militante and b. d. gerardo, “detecting sugarcane diseases through adaptive deep learning models of convolutional neural network,” 2019 ieee 6th int. conf. eng. technol. appl. sci., pp. 1–5, 2019, doi: 10.1109/icetas48360.2019.9117332. [12] d. m. n. fajri, w. f. mahmudy, and t. yulianti, “detection of disease and pest of kenaf plant using convolutional neural network,” j. inf. technol. comput. sci., vol. 6, no. 1, p. 18, apr. 2021, doi: 10.25126/jitecs.202161195. [13] t. yulianti and supriyono, “penyakit tanaman kenaf dan pengendaliannya,” in monograf balittas: kenaf (hibiscus cannabinus l.), 2009, p. 107. [14] p. ranali, “a survey of hemp pest and disease,” in advances in hemp research, p. ranalli, ed. boca ratoon, london, new york: crc press taylor & francis group, 1999, pp. 109–122. [15] j. k. gill, “automatic log analysis using deep learning and ai,” 2018. [16] z. monge, “does deep learning really require ‘big data’? — no!,” medium towards data science, 2018. [17] y. le cun et al., handwritten digit recognition with a back-propagation network. san francisco: morgan kaufmann publishers inc., 1990. [18] r. h. r. hannloser, r. sarpeshkar, m. a. mahowald, r. douglas, and h. s. seung, “digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” nature, vol. 405, pp. 947–951, 2000, doi: https://doi.org/ 10.1038/350160... [19] n. srivastava, g. hinton, a. krizhevsky, i. sutskever, and r. salakhutdinov, “dropout: a simple way to prevent neural network from overfitting,” j. mach. learn. res. 15, pp. 1929–1958, 2014. [20] d. stathakis, “how many hidden layers and nodes?,” int. j. remote sens., vol. 30, no. 8, pp. 2133–2147, 2009, doi: 10.1080/01431160802549278. microsoft word 5-11100-harits-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 2, december 2019, pp. 90–100 eissn 2597-4637 https://doi.org/10.17977/um018v2i22019p90-100 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) comparison of indonesian imports forecasting by limited period using sarima method harits ar rosyid a, 1, mutyara whening aniendya a, 2, heru wahyu herwanto a, 3, * a electrical engineering department, universitas negeri malang jl. semarang no. 5, malang 65145, indonesia 1 harits.ar.ft@um.ac.id; 2 mutyaraaniendya@gmail.com; 3 heru_wh@um.ac.id * * corresponding author i. introduction indonesia is a country with rapid economic growth. good economic growth is one of the national benchmarks capable of giving welfare to its people. economic growth in indonesia, especially for international trade in exports and imports, is one of the largest. import is an activity to enter goods from another country into the indonesian customs area. import has three types of materials that are often needed by indonesian society such as, consumption goods, raw materials, and capital goods. based on data of the import trade balance of the ministry of trade of the republic of indonesia starting from january 2002 until july 2019, it shows the results of imports experiencing an unstable increase and decrease. indonesian imports in january 2019 – july 2019 amounting to 111.88 billion usd or decreased by 9.89 % when compared with the results of the import of january 2018 – july 2018 amounting to 124,167 billion usd. the necessity for import that is still very high could decrease the indonesian income because of the domestic payments abroad, while exports can add money because there is a purchase from domestic to overseas. if the valuation of import is higher than the export’s, it can threatens the indonesian economy, especially the local businesses. for instance, recent imports of rice looked to be overly performed. this have caused the decision to exterminate tons of local rice products just to maintain the market price. when the balance rate of trade is unstoppable, inflation could act like a time bomb to the indonesian economy. article info a b s t r a c t article history: received 9 december 2019 revised 13 december 2019 accepted 13 december 2019 published online 23 december 2019 the development of indonesia's imports fluctuate over years. inability to anticipate such rapid changes can cause economic slump due to inappropriate policy. for instance, recent years imports in rice led to the extermination of rice reserves. the reason is to maintain the market price of rice in indonesia. to overcome these changes, forecasting the amount of imports should assist the government in determining the optimum policy. this can be done by utilizing an algorithm to forecast time series data, in this case the amount of imports in the next few months with a high degree of accuracy. this study uses data obtained from the official website of the indonesian ministry of trade. then, seasonal autoregressive integrated moving average (sarima) method is applied to forecast the imports. this method is suitable for the interconnected dependent variables, as well as in forecasting seasonal data patterns. the results of the experiment showed that 6-period forecast is the most accurate results compared to forecasting by 16 and 24 periods. the research resulted in the best model, that is arima (0, 1, 3)(0, 1, 1)12 produces forecasting with a mape value of 7.210 % or an accuracy rate of 92.790 %. by applying this imports forecast model, the government can have a forward strategic plans such as selectively imports products and carefully decide the amount of the incoming products to indonesia. hence, it could maintain or improve the economic condition where local businesses can grow confidently. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: import dataset forecasting model limited period sarima mape h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 91 these inconsistent import developments can be anticipated by forecasting imports in the future periods. by using the assistance of a forecasting method, the result of forecasting can then be used by the government as a consideration material to take a new policy or step in reducing the outcome of imports so that the economy in indonesia is better. the main thing to note in forecasting is the level of accuracy of the methods done. several research to forecast import result has been that is, forecasting iron ore import and consumption of china using grey model optimized by particle swarm optimization algorithm [1]. this research concluded that proposed hybrid-model performs better than the results obtained by a single method such as basic gm(1,1), pso-gm(1,1), or rolling gm(1,1). the pso-rolling gm(1,1) approach to modeling iron ore imports and consumption in china is both reliable and efficient. the prediction accuracies of the proposed model for imports and consumption have reached 3.2 % and 2.3 % respectively. then some research that has been done for forecasting using the arima method has been carried out, among others identifying an appropriate forecasting model for forecasting total import of bangladesh [2]. the research produced the best model, which is arima(0,1,1)(1,0,0)12 with an mse value of 15747374 and mape value of 22.97802 %. next forecasting international tourism demand in malaysia using box jenkins sarima application [3]. the research produced the best model, which is arima(1,0,1) model with rmse value is 0.2914, mae value is 0.2075 and mape value of 1.4319 %. sarima is a development of arima models that have seasonal patterns in their data. arima is one of the forecasting models that fully ignores independent variables and uses dependent variables where data is interconnected. the advantages of the arima method are being able to produce highly accurate forecasting in forecasting short-term, flexible and can represent a wide range of time series characters occurring in the short term, and can analyze random, trending, and seasonal data situations. especially for data that has seasonal patterns such as indonesian import data, the exact method is seasonal autoregressive integrated moving average (sarima). based on the problem in the import trade, this research will use sarima method to forecast indonesian imports. the sarima method is chosen because it is capable of predicting time series data and generating high levels of accuracy for short-term forecasting. so by using sarima method is expected to produce good forecasting and become a step to the development of innovation and the establishment of a strategic plan in determining the ledge to reduce the outcome of imports. ii. materials and methods the research is divided into 5 main phases (shown in figure 1), namely data collection, preprocessing, model candidate determination, model assessment and evaluation, and best model determination. a. data collection the dataset used in this research is sourced from the official website kemendag.go.id. the website of ministry of trade of the republic of indonesia (kemendagri) is a site that provides various information about trading in indonesia, such as the development of exports and imports, trade balance, foreign exchange rate against rupiah, inflation, and other trading activities. it contains 211 indonesia’s import data from 1996 until july 2019. the dataset has 5 attributes, which is year, total, consumption goods, raw material support, and capital goods. fig. 1. research design 92 h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 b. preprocessing 1) attribute removal attribute removal isa trivial process by eliminating or removing unused attributes for the forecasting process. the original data consists of time-series occasions of imports in indonesia. however, forecasting import only requires the time attribute as the independent variable (x-axis) and import amount as the target output (y-axis), the remaining attributes were removed from the dataset. the removal process was done manually via spreadsheet application. in addition, the resulting dataset was converted to dd-mm-yyyy format. then, the order of the dataset was reversed to an increasing time order. 2) stationary test stationarity of a data means that the statistical attributes of the time-series data has not change over time. one can illustrate that there is a constant progression of the graph. it is similar to a linear model, but not a constant one. as time progresses, the linear function constantly changes. it has a constant slope, a value representing the rate of change. so, time series with seasonal occasions or trends are not stationary. in contrast, a stationary time series contains no foreseen patterns in the long-term. in this case, forecasting becomes impossible because wherever any point one observed, there exist relatively the same values. a stationary test is performed to determine whether the data is stationary or not [4]. stationary tests can be performed in two ways. the first way is by viewing the graph of dataset, if the graph fits to a straight line or the average of a chart is close to zero then the dataset is already stationary. the second way is to see the auto-correlation function (acf) and partial auto-correlation function (pacf) plots on the dataset. the acf plot is used to measure the comparison between time series data and time lag. pacf plot is used to measure the amount between a variable and the time lag. if the plot of acf and pacf display a change in the value between lag which is evident in the form of cut off and dies down then the dataset is stationary. 3) differencing differencing is a technique to make the time-series data stationary as a requirement for the sarima model. so, differencing only applies for the non-stationary time-series data. differencing removes the series dependencies on time, this includes structures like trends and seasonality. a non stationary time series would not be suitable to be forecasted. therefore, differencing is done by calculating the change or difference between the subsequent observations. the value of difference obtained is checked back whether it is stationary, otherwise, it will repeat the process. equation (1) shows a formulate for the differenciation process between yt and yt-1 [5]. 𝑌 = 𝑌 − 𝑌 (1) more sequence differences are calculated in the same way. for example, the second sequence difference (d = 2) is only expanded to include the second lag of the series, as follows [5]. 𝑌 = 𝑌 − 𝑌 (2) the number of differencing processes that have been done will determine the order of the coefficient d which is then used to determine the candidate models such as auto regressive (ar) and moving average (ma). c. sarima seasonal autoregressive integrated moving average (sarima) is a development of the arima model that has seasonal patterns in their data. seasonal patterns are patterns that experience a loop at each season, such as weekly, monthly, quarterly, yearly, and so on. arima is a method developed by george box and gwilyn jenkins in 1970 and commonly referred to as the box-jenkins method [6][7]. arima is one of the models used in time-series forecasting and its accuracy is recognisable for the short-term forecasting. arima is a forecasting model that fully ignores independent variables and uses dependent variables where data is interconnected and has some assumptions that must be fulfilled such as autocorrelation, trend, or seasonal [8]. arima uses its previous data values to produce accurate short-term forecasting. in the sarima model (p,d,q)(p,d,q)s, parameters p and p indicates non-seasonal ar values and seasonal ar values, the parameters q and q indicates non h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 93 seasonal ma values and seasonal ma values, and parameter d indicates the differencing process in non-seasonal data for d in seasonal data [9]. the arima method is divided into 4 groups, namely ar, ma, arma and arima. inserting a sarima model into data involves the following things four-step recurring cycle: (a) identification of sarima structure (p, d, q) (p, d, q); (b) estimate the unknown parameters; (c) perform tests on residual estimates; (d) forecasting future results based on known data [10]. the sarima method is defined as data that has a repeating pattern within a fixed period of time. since there are seasonal patterns, the models used by the mathematical arima are arima (p, d, q) (p, d, q)s with the formulae models (3) [11]. φ(𝐵 )𝜙 (𝐵)(1 − 𝐵) (1 − 𝐵 ) 𝑋 = 𝜃 (𝐵)θ (𝐵 )𝑒 (3) d. model candidate determination the sarima model in the study has three orders namely p, d, and q for non-seasonal data while the three orders p, d, and q are for seasonal data and the s order for the frequency of data used. the determination of sarima order candidates can be done by analyzing the plot of autocorrelation function (acf) and partial autocorrelation function (pacf). the acf plot is used to measure the correlation between time series data and time-lag. the acf plot is used to indicate an autoregressive (ar) or order (p, p) value. the pacf plot is used to measure the amount of correlation between variables and time-lag after removing the linear dependency that is at the bottom lag. the pacf plot is used to indicate the value of moving average (ma) or order (q, q). the order value (d, d) is determined by the number of differencing processes performed in stationary data changes. as for the s order is determined by looking at the frequency of data used, weekly, monthly, yearly and others. the value of the order can be seen from the results of the plot acf and the plot of pacf with the existence of dies down and cut off. the dies down pattern occur when the data decreases to close to a value of 0 slowly. while the cut off pattern occurs when the data is approaching a value of 0 at the initial lag or visible patterns of images have drastically decreased. determining order value based on acf and pacf plot conditions can be seen in table 1. after the acf and pacf plotting on each dataset, a white noise test is performed to determine if there is a residual between lag with the ljung-box model. if the resulting p-value is greater than α = 0.05 then the value meets the criteria of the white noise test. the ljung-box formula as follows [12]. q = n(n + 2) (4) after the acf and pacf plotting on each dataset, a white noise test is performed to determine if there is a residual between lag with the ljung-box model. if the resulting p-value is greater than α = 0.05 then the value meets the criteria of the white noise test. the ljung-box formula as follows [13]. aic = −2 log 𝐿 + 2𝑉 (5) the smallest aic value is a candidate for the selected sarima model for forecasting process. e. testing model for prediction and evaluation after obtaining sarima model candidates, the next step is to test each models. the testing process is divided into two stages: forecasting and evaluation. the models built in this experiment forecast imports in different periods: 6 (six) periods, 12 periods and 24 periods. then, the evaluation calculates the error rate of each forecasting model using mean absolute percentage error (mape) [14]. table 1. acf and pacf plot criteria model acf trend pacf trend ar(p) decreases exponentially drastically decreased on certain lag ma(q) drastically decreased on certain lag drastically decreased on certain lag arma(p,q) decreases exponentially decreases exponentially 94 h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 mape is an alternative method used to measure the accuracy level of a forecasting model in a percentage unit (fraction). mape is an average of the overall percentage of error results from actual data and forecasting data. a low mape value indicates the resulting value is approaching its actual value. the mape formulae is shown in (6) [15]. mape = % (6) the test will be conducted using trial and error method, import dataset amounting to 211 data will be divided into training dataset and testing dataset (shown in table 2). f. best model determination mape scores of all forecasting model candidates act as the selection criteria of the forecasting model. the best model is the one with low mape scores (error rate) or high accuracy. iii. results and discussions a. preprocessing 1) attribute removal at this stage, only two out of five attributes are used: the year and the total. while the consumption goods, raw material support, capital goods attributes are dismissed in forecasting process. such removal process is trivial but the choice of attributes for selection was based on the forecasting target: the amount of yearly imports. regarding the year attribute, reordering of the year was done to ensure it is consistent with the time-series x-axis. in addition, the year attribute needs reformatting from mm-yyyy to dd-mm-yyyy. a peek of final import dataset can be seen in table 3. 2) stationary test a stationary test can be done in two ways, first by looking at the original data graph plot or viewing the graphic plot of the acf data. figure 2 indicates that the data is not stationary because the graph do not fit to a straight line. from the image, it appears that the acf shows a value that exceeds the line at the initial lags and decreases very slowly. from both tests, it can be ensured that the data has not been stationary. table 3. import dataset after attribute removal process date value 1/1/2002 2,087.90 1/2/2002 2,182.30 1/3/2002 2,362.71 1/4/2002 2,382.90 1/5/2002 2,498.09 1/6/2002 2,438.90 1/7/2002 2,646.30 1/8/2002 2,823.70 1/9/2002 2,860.20 1/10/2002 3,104.80 1/11/2002 2,955.90 1/12/2002 2,945.20 table 2. data distribution scenario period training data testing data 6 january 2002 to january 2019 (205 data) february 2019 to july 2019 (6 data) 12 january 2002 to july 2018 (199 data) august 2018 to july 2019 (12 data) 24 january 2002 to july 2017 (187 data) august 2017 to july 2019 (24 data) h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 95 3) differencing the next step is to differencing the data by using the diff () function of timeseries package in r. differencing process is done once and figure 3 shows the resulting acf plot. from the graph in figure 3, the import dataset is now in a stationary form. the acf plot indicates the presence of significant changes showed by the boxed values but there is a repetition of seasonal patterns or patterns occurring. on the acf plot, the seasonal pattern occurs nearly by the increment of 12, so the s value used is s = 12. then, by re-differencing the seasonal lag to determine the candidate order model on the seasonal pattern. from the data graph and the acf/pacf plot in figure 4, the data has been changed to stationary. the data graph shows a straight chart at a value of 0 in the middle. it shows that the data is stationary and the acf plot is also subjected to significant changes and does not exceed the line limit. b. model candidate determination the next stage is the determination of the candidate order model p, q, p and q via observation to the acf and pacf plots. based on figure 4, the dataset graph shows a straight chart at a value of 0 in the middle. it shows the data is stationary and the acf plot has also undergone significant changes. determining candidate order models that do not have a seasonal pattern is done by looking at the initial lag (lag 1, 2, 3, and so on) while determining the candidate model on the data that has a seasonal pattern seen at lag 12, 24 and 36. meanwhile, both seasonal and non-seasonal data only needs one differencing process, thus, the d value is 1. based on the plot results of the acf/pacf for the non-seasonal pattern, the cuts were taken off at lag 1, 2 and 10. hence, the pacf plot shows the dies down. meanwhile, the acf plot results for seasonal patterns shows no lag that exceeds the line, and on the pacf plot there is a line exceeding the 12th lag. from these results, candidate models for the order p and q in the non-seasonal patterns are 1, 2, and 3, while for the candidate of the order p and q in the non-seasonal pattern is 1 on the sarima model. order d and d are 1, due to the differencing process was done once. fig. 2. graph and acf plot on import dataset fig. 3. acf plot after differencing on import dataset 96 h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 from several candidates of the order of p, d, q, p, d, q and s, the combination produce forecasting model candidates in a form of (1,1,0)(1,1,0)12, (2,1,0)(1,1,0)12, (3,1,0)(1,1,0)12, (1,1,0)(0,1,1)12, (2,1,0)(0,1,1)12, (3,1,0)(0,1,1)12, (0,1,1)(1,1,0)12, (0,1,2)(1,1,0)12, (0,1,3)(1,1,0)12, (0,1,1)(0,1,1)12, (0,1,2)(0,1,1)12, and (0,1,3)(0,1,1)12. 1) white noise test assuming white noise is met when it meets the criteria, i.e. when p-value resulting from the ljungbox process is greater than α = 0.05 then the value meets the criteria [16]. the import dataset has a pvalue value that can be seen in table 4. both datasets have a p-value that exceeds α = 0.05, so white noise assumptions are fulfilled. to display the p-value value of ljung-box using the box test() function in the rstudio application. 2) akaike’s information criterion (aic) the best models are models that have the smallest aic value of all existing model candidates [17]. comparison table of values on each candidate model can be seen in table 5. from both stages of selection of the best models, it can be concluded that the best arima models are arima (0,1,3)(0,1,1)12, because each dataset has the smallest aic value. table 4. import dataset of arima model candidates’ ljung-box value arima model ljung-box (1,1,0)(1,1,0)12 0.3167 (2,1,0)(1,1,0)12 0.7164 (3,1,0)(1,1,0)12 0.5898 (1,1,0)(0,1,1)12 0.1658 (2,1,0)(0,1,1)12 0.6858 (3,1,0)(0,1,1)12 0.6807 (0,1,1)(1,1,0)12 0.09715 (0,1,2)(1,1,0)12 0.4701 (0,1,3)(1,1,0)12 0.9819 (0,1,1)(0,1,1)12 0.07206 (0,1,2)(0,1,1)12 0.5218 (0,1,3)(0,1,1)12 0.9662 fig. 4. data graph and acf/pacf plot on import dataset after differencing h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 97 c. testing model for forecasting testing was conducted with the training and test data specified in table 2. the testing process is divided into two, i.e. forecasting and calculates error rate forecasting results. forecasting is conducted to obtain forecasting of import results in several periods. after the forecasting results, then the calculation of error rate of each forecasting using mape. as for the calculation of mape using (10) and assisted by the mape() function of package mlmetrics on rstudio applications. the first testing phase is testing the model for forecasting the import results. sample testing done on the model arima(1,1,0)(1,1,0)12. the function used to commit forecasting is arima(x, order = c(p,d,q), seasonal = (p,d,q)). the results can be seen in figure 5. the model generates two values of ar coefficient and 1 value of ar coefficient for seasonal. the value of the coefficient will then be used for the subsequent period forecasting using the (7). forecasting results using model arima(1,1,0)(1,1,0)12 for the 6 future periods can be seen in figure 6. to display forecasting results use the forecast () function. the output of the function is forecasting value, lower limit and upper limit of forecasting. then calculate the error rate of the prediction result with the actual value using mape. to display the mape calculations using the mape () function and the result can be seen in figure 7. on the other model, candidates are done the same testing and evaluation methods table 5. import dataset of arima model candidates’ aic value arima model aic (1,1,0)(1,1,0)12 3274.87 (2,1,0)(1,1,0)12 3272.56 (3,1,0)(1,1,0)12 3268.29 (1,1,0)(0,1,1)12 3271.9 (2,1,0)(0,1,1)12 3263.48 (3,1,0)(0,1,1)12 3262.28 (0,1,1)(1,1,0)12 3282.45 (0,1,2)(1,1,0)12 3262.51 (0,1,3)(1,1,0)12 3260.28 (0,1,1)(0,1,1)12 3276.74 (0,1,2)(0,1,1)12 3256.99 (0,1,3)(0,1,1)12 3255.22 fig. 5. arima(1,1,0)(2,1,0)12 coefficient value fig. 6. arima(1,1,0)(1,1,0)12 forecasting results 98 h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 d. evaluation of forecasting results after testing each model for prediction and getting the predicted result for the current period, the next step is to calculate the prediction error rate using mape. the final result obtained for each model can be seen in table 6 for 6-period forecast model, table 7 for 12-period forecast model, and table 8 for 24-period forecast model the result of forecasting on the import dataset with 6 periods resulted in arima(0,1,3)(0,1,1)12 as the best model with mape value of 7.516 % or an accuracy rate of 92.79 %. the result of forecasting on the import dataset with 12 periods resulted in a different model with the previous period of 6 periods. in this period resulted in the best two models because it produces the same mape model arima(1,1,0)(0,1,1)12 and arima(2,1,0)(0,1,1)12 with mape value of 16.029 % or an accuracy rate of 83.971 %. the forecasting of the import dataset with 24 periods resulted in different models with the previous period of 6 periods and 12 periods. this period resulted in the best model of arima model (0,1,3)(1,1,0)12 with 9.526 % mape value or an accuracy rate of 90.474 %. from the test results, by adding the number of forecasting periods, it can be concluded that there is an increase in the mape value when compared with the short term shown in table 9. it proves that the sarima method can do short-term forecasting with a high degree of accuracy. table 6. import dataset of mape value for 6 periods forecasting model arima mape (1,1,0)(1,1,0)12 10.301 (2,1,0)(1,1,0)12 11.688 (3,1,0)(1,1,0)12 8.759 (1,1,0)(0,1,1)12 11.973 (2,1,0)(0,1,1)12 11.973 (3,1,0)(0,1,1)12 9.279 (0,1,1)(1,1,0)12 12.913 (0,1,2)(1,1,0)12 8.847 (0,1,3)(1,1,0)12 8.807 (0,1,1)(0,1,1)12 14.038 (0,1,2)(0,1,1)12 8.307 (0,1,3)(0,1,1)12 7.516 table 7. import dataset of mape value for 12 periods forecasting model arima mape (1,1,0)(1,1,0)12 23.530 (2,1,0)(1,1,0)12 22.658 (3,1,0)(1,1,0)12 22.085 (1,1,0)(0,1,1)12 16.029 (2,1,0)(0,1,1)12 16.029 (3,1,0)(0,1,1)12 17.452 (0,1,1)(1,1,0)12 24.445 (0,1,2)(1,1,0)12 22.360 (0,1,3)(1,1,0)12 23.374 (0,1,1)(0,1,1)12 19.267 (0,1,2)(0,1,1)12 18.445 (0,1,3)(0,1,1)12 19.875 fig. 7. arima(1,1,0)(2,1,0)12 mape value h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 99 e. discussions test results of 12 model candidates to forecast indonesia’s import by 6 periods, 12 periods, and 24 periods produce interesting error rates. the 6-period forecast model be the best one with smallest mape of 7.210 %, that is arima(0,1,3)(0,1,1)12 model. interestingly, the 12-period forecast (arima(1,1,0)(0,1,1)12 and arima(2,1,0)(0,1,1)12 models) have mape values much larger than the shorter or longer period forecast models. this 12-period forecast model experienced a greater increase in mape value because the dataset has a high value in the last data. from these experiments, sarima is superior to short-term forecasting, this result is consistent with previous research [18] stating that the more periods produced from the dataset can lead to a higher accuracy of forecasting. therefore, forecasting import in 6 periods (month) produced more set for forecasting, thus, a better accuracy iv. conclusion this research produced a forecasting model for indonesia’s import. in the experiments with the set of months as periodical forecast, the best result was obtained when forecasting the future 6 periods of imports. the best forecast for imported results is arima(0,1,3)(0,1,1)12 because it produces the smallest mape value and aic value. mape value for forecasting indonesian imports is 7.210 % with an accuracy value of 92.79 % and aic value of 3255.22. this research also proved that forecasting using sarima method is best used for short-term future trends for indonesia’s imports. in this regards, the 6-period forecast should make the government to be more aware in their development planning and highly prepared with contingency planning in import policy. therefore, the strategic plan to improve the local businesses can be accommodated by effective yet efficient imports as the supporting roles. in this research, the forecasting model development applied a hold-out validation method where the test set was the time series of the last period. hence, it may not be the best (generic) method applied. therefore, there is an open challenge to improve this research by applying the cross validation method. it is expected that by applying this method, a more generic forecasting model can be built. table 8. import dataset of mape value for 24 periods forecasting model arima mape (1,1,0)(1,1,0)12 10.035 (2,1,0)(1,1,0)12 9.982 (3,1,0)(1,1,0)12 9.930 (1,1,0)(0,1,1)12 15.322 (2,1,0)(0,1,1)12 15.322 (3,1,0)(0,1,1)12 15.175 (0,1,1)(1,1,0)12 10.376 (0,1,2)(1,1,0)12 9.663 (0,1,3)(1,1,0)12 9.526 (0,1,1)(0,1,1)12 13.579 (0,1,2)(0,1,1)12 15.383 (0,1,3)(0,1,1)12 15.236 table 9. mape result comparison period mape 6 period 7.210 % 12 period 16.029 % 24 period 9.526 % 100 h.a. rosyid et al. / knowledge engineering and data science 2019, 2 (2): 90–100 acknowledgement we thank everyone who contributed to the completion of this paper in one way or another. first of all, we thank god for the ability to do the job. we are also very grateful to my informants. their identities cannot be published, but during our research we want to recognize and appreciate their support and accountability. we are also so grateful to my fellow students whose struggles and constructive critics are facing the search for new ideas. lastly, we would like to thank pui disruptive learning innovation, universitas negeri malang, for the intensive support and guidance for this research to run well. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] w. ma, x. zhu, and m. wang, “forecasting iron ore import and consumption of china using grey model optimized by particle swarm optimization algorithm,” resour. policy, vol. 38, no. 4, pp. 613–620, 2013. [2] t. khan, “identifying an appropriate forecasting model for forecasting total import of bangladesh,” int. j. trade, econ. financ., vol. 2, no. 3, pp. 242–246, 2011. [3] y. ibrahim, nanthakumar, and loganathan, “forecasting international tourism demand in malaysia using box jenkins sarima application,” south asian j. tour. herit., vol. 3, no. 2, pp. 50–60, 2010. [4] t. s. rao, and m. m. gabr, "a test for linearity of stationary time series," journal of time series analysis, vol. 1, no. 2, pp. 145-158, 1980. [5] rob j hyndman, “forecasting: forecasting: principles & practice,” no. september, p. 138, 2014. [6] e. b. dagum, the x-ii-arima seasonal adjustment method. ottawa: statistic canada, 1980. [7] g. box, "box and jenkins: time series analysis, forecasting and control," in a very british affair, pp. 161-215. palgrave macmillan, london, 2013. [8] a. qonita, a. g. pertiwi, and t. widiyaningtyas, “prediction of rupiah against us dollar by using arima,” int. conf. electr. eng. comput. sci. informatics, vol. 4, no. september, pp. 746–750, 2017. [9] k. k. sumer, o. goktas, and a. hepsag, “the application of seasonal latent variable in forecasting electricity demand as an alternative method,” energy policy, vol. 37, no. 4, pp. 1317–1322, 2009. [10] k. y. chen and c. h. wang, “a hybrid sarima and support vector machines in forecasting the production values of the machinery industry in taiwan,” expert syst. appl., vol. 32, no. 1, pp. 254–264, 2007. [11] f. m. tseng and g. h. tzeng, “a fuzzy seasonal arima model for forecasting,” fuzzy sets syst., vol. 126, no. 3, pp. 367–376, 2002. [12] w. w. s. wei, “time seried analysis: univariate and multivariate methods 2nd edition.” pearson addison wesley, new york, 2006. [13] e. j. wagenmakers and s. farrell, “aic model selection using akaike weights,” psychon. bull. rev., vol. 11, no. 1, pp. 192–196, 2004. [14] m. v. shcherbakov, a. brebels, n. l. shcherbakova, a. p. tyukov, t. a. janovsky, & v. a. e. kamaev, "a survey of forecast error measures," world applied sciences journal, vol. 24, no. 24, pp. 171-176, 2013. [15] a. de myttenaere and dkk, “mean absolute percentage error for regression models,” neurocomputing, vol. 192, pp. 38–48, 2016. [16] r. serra, and a. c. rodríguez, "the ljung-box test as a performance indicator for vircs," international symposium on electromagnetic compatibility-emc europe, ieee, pp. 1-6, 2012. [17] t. w. arnold, "uninformative parameters and model selection using akaike's information criterion." the journal of wildlife management, vol. 74, no. 6, pp. 1175-1178, 2010. [18] t. widiyaningtyas, muladi, and a. qonita, “use of arima method to predict the number of train passenger in malang city,” proceeding 2019 int. conf. artif. intell. inf. technol. icaiit 2019, pp. 359–364, 2019. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 1, july 2020, pp. 50–59 eissn 2597-4637 https://doi.org/10.17977/um018v3i12020p50-59 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) opinion analysis for emotional classification on emoji tweets using the naïve bayes algorithm siti sendari a, 1, *, ilham ari elbaith zaeni a, 2 , dian candra lestari a, 3 , hanny prasetya hariyadi b, 4 a departement of electrical engineering, universitas negeri malang jl. semarang no. 5 malang 65145, indonesia b graduate school of information, production and systems (ips), waseda university 1-104 totsukamachi, shinjuku city, tokyo 169-8050, japan 1 siti.sendari.ft@um.ac.id *; 2 ilham.ari.ft@um.ac.id; 3 dnivers@gmail.com; 4 prasetya.hanny@fuji.waseda.jp * corresponding author i. introduction social media is a tool used to interact or communicate digitally and can be accessed while connected to the internet. there are several examples of social media, such as twitter, facebook, and instagram. twitter is an open-source social media, thus developers can do research and development on it [1][2]. quoted from the online media ebizmba, twitter was ranked 4th with 375 million active users in the september 2019 period [3]. twitter has several features, namely tweet, hashtag, and emoji. based on these features, one feature is chosen that is quite interesting to study is emoji. where emojis are the latest generation of emoticons. the use of emojis emerged in the late 20th century, by shigetaka kurita in 1990 with the aim of beautifying the message. in other words, the emojis are graphic symbols included in unicode, used to express facial expressions or represent an object as a simple illustration in conveying an idea [4][5]. using emojis separately within a message can lead to miscommunication. however, if it is attached within a message, it can maximize the understanding between the writer and the reader [6]. often, opinions in the form of text experience the ambiguity of emotions conveyed, including the emojis contained therein. sentiment analysis needs to be done to see the user's opinion on the tendency of opinion on a problem [7]. this will affect the psychology of users in interacting through social media. in the book the emoji code, by vyvyan evans (cognitive linguist), states that emojis imply non-verbal language in non-face-to-face interactions [8]. therefore, emojis play a role in article info a b s t r a c t article history: received 28 july 2020 revised 9 august 2020 accepted 11 august 2020 published online 17 august 2020 opinion analysis is a research study needed to social media, since the content could become a trending topic and has a significant impact on social life. one of the social media that have a big contribution to cyberspace and information development is twitter. in the twitter application, users can insert images that represent emotions, facial expressions, or icons. emoji is a graphic symbol in the form of an image to express a thing, with the emoji, a text can be read and understood according to its meaning because the image represents it. of the several things that have been mentioned then, the researchers conducted research on the classification of tweet content based on the use of emojis. this study aims to determine the emotional uses of twitter in one period. every tweet on the twitter timeline, which contains both text and emojis, will be classified according to several categories. the algorithm used was naïve bayes. it calculated the probability of emoji tweet to obtain the text classification with emojis. the results of the classification of emotions are grouped with three categories, namely "angry," "joy," and "sad," it showed that the category "joy" had become the emotional trend of twitter users where emojis (x1f60a) dominate the most. meanwhile, the accuracy of the algorithm used to reach 90% with a 70:30 holdout technique. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: opinion analysis twitter emoji classification naïve bayes mailto:siti.sendari.ft@um.ac.id mailto:ilham.ari.ft@um.ac.id mailto:dnivers@gmail.com mailto:prasetya.hanny@fuji.waseda.jp https://creativecommons.org/licenses/by-sa/4.0/ s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 51 representing mood and emotion within communication. emoji can also add user’s personality to text and generate user empathy and it is important to produce effective communication. thus, it is essential to find deeper meaning of emoji within the field of sentiment analysis. sentiment analysis is a study of opinion mining, which is carried out to obtain information about opinions and emotions on a topic. emotions can be used as a benchmark for the happiness of society with consideration for making decisions. emotion detection can be used to check the message content so as to minimize misunderstanding between the reader and the sender [9][10]. an opinion is able to represent feelings and emotions thus the classification of emotions is needed to see the emotional tendency towards the meaning of the content implied in the opinion. emotional classed as a form of sentiment classification that focuses on the emotional classification of meaning or content. there are several aspects of the study that become reference points in this study, namely the naive bayes algorithm, the exploration of emojis on twitter for opinion analysis, and the exploration of the classification of emotions towards emoji users. some examples of research are relevant to this study, such as a study of sentiment classification with emoji using training heuristical training [11][12], research on multilingual emoji prediction [13][14], differences perception using emoji [15][16][17], sentiment analysis with emojis [18][19]. ii. methods this research established a system for classifying emotions based on tweets that have emojis. the study was conducted by applying the naïve bayes classifier (nbc) algorithm for text-based emotion classification. meanwhile, the emojis contained in the tweet will be identified as terms. thus, text and emojis are likened to certain terms. the experiments were carried out on two conditions, namely tweet by ignoring emojis and tweet by processing emojis. the design of the research flow is shown in figure 1. some of the processes are to be done in developing a system consist of: (1) collecting data of twitter data retrieval, (2) data pre-processing to prepare data to fit the research boundaries, (3) the classification process to identify emotions and measure the performance of algorithms against opinion data, then (4) evaluate by seeing the results of accuracy from the algorithm, and (5) finally is the analysis and visualization of the frequency of texts and emojis that often appear. at the stage of twitter data retrieval, the method used was crawling, which was crawling each tweet on the timeline. the crawling process is done using the r language of r-studio software, which is flexible and adaptable for other applications [20]. then, at the pre-processing stage, there are several steps in it, such as data preparation, cleaning, stopword removal, and stemming [21]. data preparation was done to select the data that have been collected to fit the research limitations and ease the workload of the system [22]. data preparation was done manually by selecting tweets fig. 1. emotional classification research design flow 52 s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 that only post themselves (not retweeted), tweets containing text and emojis, and tweets in english. next, the cleaning stage is to delete components in the tweet that are not needed in the classification process [23]. these components included @username, punctuation, and numbers. next, the stopword removal stage was deleting conjunctions or not unique words [24]. the last step, stemming, was the process of turning every word in a sentence into a basic word that matches the dictionary in the stemmer [25]. followed by the classification process, in which text and emojis (which are considered as a term) were calculated in the word polarity to determine the emotional class [26][27]. after the classification process was complete, the evaluation process was carried out using the naive bayes algorithm to see the results of the system's accuracy in processing opinion data [28]. finally, it continued by analyzing the results and visualization of the term and emoji frequencies [29]. iii. results and discussions on the pre-processing data stage, data preparing, case folding, cleaning, stopword removal, and stemming were performed. initial data processing (pre-processing) is processing raw data from crawling into data that is ready for the classification process. the process of preparing data was to select data. the selected data are tweets that are not retweets, tweets that contains emojis, and tweets in english. this process was done manually by scanning data. in this process, emotional labeling classes were also done manually based on the expert review on the psychological-based of the faculty of psychology education – universitas negeri malang (state university of malang) as data verification. the results of the selection data obtained as many as 305 tweets. in this process, the selection of attributes that will be used in the research is also carried out. unused attributes were deleted and ignored. furthermore, it included attributes to support dataset processing. the following table 1 presents the attributes that were used in research. the reason for choosing these attributes is because each attribute is a factor needed in the classification of emotions. text and x1f600-x1f637 attributes were calculated to find the probability value hence it can determine the tendency of emotional classes. meanwhile, the emoji count attribute is used to see the emoji trends that are often used. then, the no, id, and emoji attributes were used as identification of the text and x1f600-x1f637 attributes. the cleaning process cleaned tweets that have punctuation marks, numbers (0 to 9), links (http / https), and username (@ user1) because they do not provide informative messages in terms of emotions. figure 2 is an example of a tweet that was reduced in the cleaning process. thus, from the process, it produced the following tweet: "to every sunrise and sunset and everything in between of we're excited for you (beer)." the stopword removal removed conjunctions and words that were not included in the unique word. tweets were scanned based on a database containing conjunctions. if the words in the tweet have similarities with the words in the database, then the word was deleted. the above figure is an example of the stopword removal process: "to every sunrise and sunset and everything in between of we're excited for you (beer)." after going through the stopword removal process, it changed into: "every sunrise, sunset everything excited you (beer)." the process of stemming was the process of changing a word into a basic word. the words that contain a prefix and suffix affixes were changed to a word stem. a collection of word stems was stored in a database called a porter stemmer. every word in the tweet was matched into the porter table 1. dataset details no attribute name data type explanation 1 no numeric data sequence number 2 id numeric tweet identity 3 text string tweet 4 emoji character emojis (unicode) in the form of characters 5 x1f600 – x1f637 biner emoji contained in tweets 6 emoji count numeric total emojis contained in one tweet s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 53 stemmer dictionary thus words that contain the affix were changed according to the dictionary. the following is an example of the stemming process: "every sunrise, sunset everything excited you (beer)". after going through the stemming process, it becomes: "every sunrise, sunset everything excites you (beer)". in calculating the probability using the naïve bayes algorithm, a comparison of values was performed. where the category of emotions that has a high probability value is included in the dominant category. probability is an event that can be known or predicted by looking at the pattern of previous events based on facts. another term is simply explained that probability is the chance of an event or the possibility of an event occurring based on previous events. the program is carried out to measure emotions. it is based on text variables and other emoji forms initialized with hexadecimal. then, from these variables, the probability of emotion categories (joy, anger, sadness) and word probability is obtained based on each emotion. meanwhile the variable n shows the number of words / terms. after the variables are formed, the naive bayes algorithm performs calculations with a formula to find the emotional probability based on the text and get the identity of the result of the emotion. for more details, the calculation steps for calculating and obtaining variables will be provided. based on these three keys, an equation can be arranged to produce the probability value of an event, and it can be shown by (1) p (𝐴) = (1) where p(a) it the probability of an event, n is the number of events, and m is the amount of sample space. conditional probability is an event that occurs after another event exists. more precisely, it is an event that is based on another event that affects each other. for example, p (b | a) is spelled out that the probability of event b with condition a. at the time of classification, the algorithm looks for the highest probability value of all categories tested [30]. the basic naïve bayes theorem is described in (2) pr(𝐵|𝐴) = pr(a|b) pr(b) pr(a) (2) where pr(𝐵|𝐴) is the class probability (𝐵) based on the object (𝐴), pr(𝐴|𝐵) is the probability of occurrence of objects (𝐴) based on class (𝐵), pr(𝐵) is the probability of class data occurrence (𝐵), and pr(a) is the probability of object (𝐴). in table 2 there are four sample datasets with emotional category labels. the table presents an example of a tweet for emotional classification where the table contains new data, which will later be determined as a class of a category. by using the probability calculation formula, the results of the tendency of the tweet emotion category seen from the probability value were obtained. the example of calculation is presented as follows. first, it requires to determine the probability value of the category based on training data. in the example above, each category has a probability (3) (3) where pr( ) is the probability of the appearance of class i, 𝑁𝑑 is the amount of data based on class i, and 𝑁𝑑 is the sum of all training data. fig. 2. example tweets before the cleaning process 54 s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 to obtain the value of pr( 𝑖), the researchers divided the amount of data based on class i and the total number of training data. then, it obtained joy = , sad = , angry = . further, it calculated the probability of each word in one tweet that served as the testing data. thus, the calculation is presented as follows [31]: ∑ = (4) the probability of the word based on the the category was obtained by counting the number of words in a category ( ) added by 1, divided by the number of words in that category (∑ ). before doing these calculations, it should first determine: • total words in categories joy = 17 word • total words in categories sad = 21 word • total words in categories angry = 7 word to make it easier, it is presented in the form of table as table 3. after obtaining the probability value of words with emojis, then it proceeded with calculating the probability of tweets in each category. this was done to find the highest value of each category based on tweets. the following are examples of calculations: ∏ (5) where pr is the category probabilities on tweets, pr( ) is the probability of tweets by category , n is the many words in one tweet, and pr( ) is the word probability by category . table 3. probability value of word with emoji word joy sad angry will 0 + 1 17 0 + 1 21 0 + 1 7 remote 0 + 1 17 0 + 1 21 1 + 1 7 work 2 + 1 17 1 + 1 21 1 + 1 7 another 0 + 1 17 0 + 1 21 0 + 1 7 company 0 + 1 17 0 + 1 21 0 + 1 7 x1f60a 1 + 1 17 0 + 1 21 0 + 1 7 table 2. examples of labeled tweets for emotional classification no tweet emotional categories 1 exhaust good i work paycheck just collect x1f60a joy 2 sun shine i work home can see daylight x1f60d joy 3 good day move forward something stuck hand feel exhaust mind many hour work hard try plate always full x1f629 x1f629 x1f629 sad 4 just lazy people want remote work x1f603 angry 5 will remote work another company x1f60a ? s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 55 the naïve bayes algorithm is useful for finding the highest class probability value in a tweet. the probability of data appearing for a category ( ) was obtained by dividing the number of tweets included in that category based on the total number of tweets. meanwhile, the probability of the occurrence of tweets in a category (pr( )) was done by calculating the multiplication of the word probability. pr(joy| will remote work another company x1f60a) = pr(joy) * pr(will|joy) * pr(remote|joy) * pr(work|joy) * pr(another|joy) * pr(company|joy) * pr(x1f60a|joy) = ( ) ( ) ( ) ( ) ( ) ( ) ( ) = 1242.9x10 -10 pr(sad| will remote work another company x1f60a) = pr(sad) * pr(will|sad) * pr(remote|sad) * pr(work|sad) * pr(another|sad) * pr(company|sad) * pr(x1f60a|sad) = ( ) ( ) ( ) ( ) ( ) ( ) ( ) = 58.3x10 -10 pr(angry| will remote work another company x1f60a) = pr(angry) * pr(will|angry) * pr(remote|angry) * pr(work|angry) * pr(another|angry) * pr(company|angry) * pr(x1f60a|angry) = ( ) ( ) ( ) ( ) ( ) ( ) ( ) = 8499.86x10 -10 after obtaining the probability value of tweets, then the three categories were compared based on which category has the highest probability value. hence, it could be classified into the following groups: joy, sad, or angry emotional categories. the results of the probability calculation found that the value of angry's condition outperformed joy and sad's emotional categories. thus, tweet number 5 belongs to the angry emotional category. holdout evaluation was an evaluation method used to divide data into training and testing data in accordance with a specified percentage. it is known that the accuracy of the classification of emotions by using the system is 90% where the holdout used was 70%: 30%. thus, systematically, it obtained accurate results, then it performed manual calculations (accuracy, precision, recall, and specificity) to prove and strengthen the results based on the confusion matrix. where the training data used were 214 data, and testing data were 91 data. from the results of testing data, 28 data are prediction error data. thus, it can be illustrated by using the confusion matrix as shown in figure 3. overall accuracy can be presented in the following calculation ∑ ∑ 100 = 100 = 100 = 90.81% fig. 3. confusion matrix of training data 56 s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 the algorithm performance calculation was a test for emoji tweet data. meanwhile, the value of the algorithm performance with 91 data based on testing data was 69%. this is obtained from the confusion matrix shown in figure 4. testing accuracy can be presented in the following calculation ∑ ∑ 100 = 100 = 100 = 69% to see the performance of the naïve bayes algorithm for the classification of emotions on tweet data with differences in the percentage of training and testing data, an experiment was conducted using the holdout method on three schemes, namely 70:30, 80:20 and 90:10. the results of all three schemes are presented in the following explanation: the first scheme (70:30) split 305 data into 214 training data and 91 testing data. from these data, a naïve bayes calculation was made to the classification of emotions. the result is an accurate value calculated from the overall data of 90%. meanwhile, the accuracy of the performance of the naïve bayes algorithm based on data testing has a value of 69%. in this scheme (80:20), the 305 record is devided into 244 training data and 61 testing data. from these data, a naïve bayes calculation was made to the classification of emotions. the result is an accurate value calculated from the overall data of 93%. meanwhile, the accuracy of the performance of the naïve bayes algorithm based on data testing has a value of 67%. the last scheme (90:10) devided 305 data was done into two types, namely 275 training data and 30 testing data. from these data, a naïve bayes calculation is made to the classification of emotions. the result is an accurate value calculated from the overall data of 95%. meanwhile, the accuracy of the performance of the naïve bayes algorithm based on data testing has a value of 53%. based on the three schemes, the results show that the amount of training data affects the level of accuracy. because, more training data, the probability of words with emojis is also higher. that is because of the frequency of words with emojis affects algorithm calculations. it is evidenced by the results of the comparison of 70:30, 80:20, and 90:10, the sequential accuracy is 90%, 93%, and 95%. then, based on testing data for accuracy in a row, that is 69%, 67%, and 53%. the results of testing data, as many as 45 data were prediction error data. in addition, the accuracy of the overall data is 85%. meanwhile, the performance of the naïve bayes algorithm has an accuracy of 50% based on testing data. therefore, for the comparison between text tweets and emoji tweets, testing was done on the data testing with the third scheme, which is 70:30. accordingly, from the results of the overall comparison of data on the accuracy of text tweets (85%) and emoji tweets (90%), it was stated that the accuracy increased by 5%. as for the results of comparison of testing data on the accuracy of text tweets (50%) and emoji tweets (69%), it was stated that the increase in the probability of tweets was 19%. figure 5 is the result of visualization in the form of word cloud, where the word "day" is the center of word cloud [32]. the word has the highest frequency compared to other words, thus "day" dominates. the left (a) is a square-shaped word cloud, and the right (b) is a circular word cloud. both have the same information, and it's just a different word cloud model. fig. 4. confusion matrix of training data s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 57 figure 6 shows a histogram based on words in the highest frequency sequence on the left side. horizontal lines are identification to show words/terms that are there. then, the vertical line (y-axis) shows the value / many words based on the x-axis (horizontal line). the frequency of the word "day" has the highest value of 39. iv. conclusion pre-processing is the stemming stage that uses the logic of the porter stemmer. where, based on this logic, it produces basic and single words. however, the lack of sensitivity and adaptation of words so that the resulting changes in words become less precise. the probability of a word acquires a high value depends on the frequency of the word based on emotional categories. testing of naïve bayes algorithm using the holdout method was done by sharing training data and testing data by 70% and 30% of 305 data. where training data are 214 and testing data are 91 therefore 90% accuracy is obtained. precision in joy (0.99), sad (0.90), and angry (0.72). then, recall of joy (0.88), sad (0.91), and angry (0.98). calculation of the probability of tweet emoji, able to increase the emotional tweet by 19%. where, based on the accuracy of data testing text tweet with emojis at 69%. meanwhile, the accuracy of testing data, text tweets without emojis is 50%. this is made clear by the results of prediction errors on emoji text data totaling 28 data. and prediction errors in text data without emojis are 45 data. it can be implemented using other classification algorithms to compare the performance of classification algorithms and handling methods on emojis. the stemming pre-processing stage can use other logic to convert words into basic and single words, according to the actual basic words. punctuation can affect the emotional state of a text. hence, in subsequent studies, the research can be extended by including punctuation marks to see the effect on tweet emotions. (a) (b) fig. 5. (a) square word cloud and (b) circle word cloud fig. 6. word frequency histogram 58 s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 acknowlegdement this research was supported by universitas negeri malang and waseda university. we thank our colleagues from both institutions who provided insight and expertise that greatly assisted the research, although they may not agree with all of the interpretations/conclusions of this paper. we thank dr. aji p. wibawa for assistance with suggestion in methodology and for comments that greatly improved the manuscript declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this paper is a part of research, which has been supported by drpm research grant of indonesian government.. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] n. alias, m. s. sabdan, k. a. aziz, m. mohammed, i. s. hamidon, and n. jomhari, "research trends and issues in the studies of twitter: a content analysis of publications in selected journals (2007 – 2012)," procedia soc. behav. sci., vol. 103, pp. 773–780, 2013. [2] a. uhl, n. kolleck, and e. schiebel, "twitter data analysis as contribution to strategic foresight-the case of the eu research project' foresight and modelling for european health policy and regulations' (fresher)," eur. j. futur. res., vol. 5, no. 1, 2017. [3] statista research department, "twitter: number of users in indonesia 2019 | statista," 2019. [online]. available: https://www.statista.com/statistics/490591/twitter-users-malaysia/. [4] p. k. novak, j. smailović, b. sluban, and i. mozetič, “sentiment of emojis,” plos one, vol. 10, no. 12, pp. 1–21, 2015. [5] y. tang and k. f. hew, "emoticon, emoji, and sticker use in computer-mediated communication: a review of theories and research findings," int. j. commun., vol. 13, pp. 2457–2483, 2019. [6] h. miller, d. kluver, j. thebault-spieker, l. terveen, and b. hecht, "understanding emoji ambiguity in context: the role of text in emoji-related miscommunication," proc. 11th int. conf. web soc. media, icwsm 2017, pp. 152–161, 2017. [7] d. bandorski et al., "contraindications for video capsule endoscopy," world j. gastroenterol., vol. 22, no. 45, pp. 9898–9908, 2016. [8] e. vyvyan, the emoji code: the linguistics behind smiley faces and scaredy cats handbook, 2017. [9] i. ileri and p. karagoz, "detecting user emotions in twitter through collective classification," ic3k 2016 proc. 8th int. jt. conf. knowl. discov. knowl. eng. knowl. manag., vol. 1, no. ic3k, pp. 205–212, 2016. [10] m. s. asriadie, m. s. mubarok, and adiwijaya, "classifying emotion in twitter using bayesian network," in journal of physics: conference series, 2018, vol. 971, no. 1. [11] f. hallsmar and j. palm, "multi-class sentiment classification on twitter using an emoji training heuristic," pp. 1– 27, 2016. [12] s. narr, m. hulfenhaus, and s. albayrak, "language-independent twitter sentiment analysis," knowl. discov. mach. learn. (kdml), lwa, pp. 12–14, 2012. [13] f. barbieri et al., “semeval 2018 task 2: multilingual emoji prediction,” pp. 24–33, 2018. [14] h. w. raj and s. balachandran, “future emoji entry prediction using neural networks,” journal of computer science, vol. 16, no. 2, pp. 150–157, feb. 2020 [15] j. berengueres and d. castro, "differences in emoji sentiment perception between readers and writers," proc. 2017 ieee int. conf. big data, big data 2017, vol. 2018-janua, pp. 4321–4328, 2018. [16] s. lau, "the effect of smiling on person perception," j. soc. psychol., vol. 117, no. 1, pp. 63–67, 1982. [17] j. berengueres and d. castro, "sentiment perception of readers and writers in emoji use," 2017. [18] g. guibon, m. ochs, and p. bellot, "from emojis to sentiment analysis," 2016. [19] s. ayvaz and m. o. shiha, "the effects of emoji in sentiment analysis," int. j. comput. electr. eng., vol. 9, no. 1, pp. 360–369, 2017. [20] s. khalil and m. fakir, "rcrawler: an r package for parallel web crawling and scraping," softwarex, vol. 6, pp. 98– 106, 2017. https://doi.org/10.1016/j.sbspro.2013.10.398 https://doi.org/10.1016/j.sbspro.2013.10.398 https://doi.org/10.1016/j.sbspro.2013.10.398 https://doi.org/10.1007/s40309-016-0102-4 https://doi.org/10.1007/s40309-016-0102-4 https://doi.org/10.1007/s40309-016-0102-4 https://www.statista.com/statistics/490591/twitter-users-malaysia/ https://www.statista.com/statistics/490591/twitter-users-malaysia/ https://doi.org/10.1371/journal.pone.0144296 https://doi.org/10.1371/journal.pone.0144296 https://doi.org/10.1007/978-981-10-8896-4_16 https://doi.org/10.1007/978-981-10-8896-4_16 http://brenthecht.com/publications/icwsm17_emojitext.pdf http://brenthecht.com/publications/icwsm17_emojitext.pdf http://brenthecht.com/publications/icwsm17_emojitext.pdf https://doi.org/10.3748/wjg.v22.i45.9898 https://doi.org/10.3748/wjg.v22.i45.9898 https://doi.org/10.1126/science.aao5728 https://doi.org/10.5220/0006037502050212 https://doi.org/10.5220/0006037502050212 https://doi.org/10.1088/1742-6596/971/1/012041 https://doi.org/10.1088/1742-6596/971/1/012041 http://www.diva-portal.org/smash/get/diva2:927073/fulltext01.pdf http://www.diva-portal.org/smash/get/diva2:927073/fulltext01.pdf http://www.dai-labor.de/fileadmin/files/publications/narr-twittersentiment-kdml-lwa-2012.pdf http://www.dai-labor.de/fileadmin/files/publications/narr-twittersentiment-kdml-lwa-2012.pdf https://www.aclweb.org/anthology/s18-1003/ https://doi.org/10.3844/jcssp.2020.150.157 https://doi.org/10.3844/jcssp.2020.150.157 https://doi.org/10.1109/bigdata.2017.8258461 https://doi.org/10.1109/bigdata.2017.8258461 https://doi.org/10.1080/00224545.1982.9713408 https://arxiv.org/abs/1710.00888 https://hal-amu.archives-ouvertes.fr/hal-01529708/document https://doi.org/10.17706/ijcee.2017.9.1.360-369 https://doi.org/10.17706/ijcee.2017.9.1.360-369 https://doi.org/10.1016/j.softx.2017.04.004 https://doi.org/10.1016/j.softx.2017.04.004 s. sendari et al. / knowledge engineering and data science 2020, 3 (1): 50–59 59 [21] m. desai and m. a. mehta, "techniques for sentiment analysis of twitter data: a comprehensive survey," proceeding ieee int. conf. comput. commun. autom. iccca 2016, no. april 2016, pp. 149–154, 2017. [22] a. s. raamkumar, m. erdt, h. vijayakumar, e. rasmussen, and y. l. theng, "understanding the twitter usage of humanities and social sciences academic journals," proc. assoc. inf. sci. technol., vol. 55, no. 1, pp. 430–439, 2018. [23] v. a. and s. s. sonawane, "sentiment analysis of twitter data: a survey of techniques," int. j. comput. appl., vol. 139, no. 11, pp. 5–15, 2016. [24] j. k. and j. r., "stop-word removal algorithm and its implementation for sanskrit language," int. j. comput. appl., vol. 150, no. 2, pp. 15–17, 2016. [25] m. adriani, j. asian, b. nazief, s. m. m. tahaghoghi, and h. e. williams, “stemming indonesian,” acm transactions on asian language information processing, vol. 6, no. 4, pp. 1–33, dec. 2007 [26] h. pajupuu, r. altrov, and j. pajupuu, “identifying polarity in different text types,” folklore: electronic journal of folklore, vol. 64, pp. 125–142, jun. 2016 [27] g. yurtalan, m. koyuncu, and ç. turhan, "a polarity calculation approach for lexicon-based turkish sentiment analysis," turkish j. electr. eng. comput. sci., vol. 27, no. 2, pp. 1325–1339, 2019. [28] f. c. permana, y. rosmansyah, and a. s. abdullah, “naive bayes as opinion classifier to evaluate students satisfaction based on student sentiment in twitter social media,” journal of physics: conference series, vol. 893, p. 012051, oct. 2017. [29] e. hauthal, d. burghardt, and a. dunkel, "analyzing and visualizing emotional reactions expressed by emojis in location-based social media," isprs int. j. geo-information, vol. 8, no. 3, 2019. [30] li-guo duan, d. peng, and ai-ping li, “a new naive bayes text classification algorithm,” telkomnika indonesian journal of electrical engineering, vol. 12, no. 2, feb. 2014 [31] m. s. saputri, r. mahendra, and m. adriani, "emotion classification on indonesian twitter dataset emotion classification on indonesian twitter dataset," in international conference on asian language processing, 2018, no. november. [32] b. tessem, s. bjørnestad, w. chen, and l. nyre, “word cloud visualisation of locative information,” j. locat. based serv., vol. 9, no. 4, pp. 254–272, 2015. https://doi.org/10.1109/ccaa.2016.7813707 https://doi.org/10.1109/ccaa.2016.7813707 https://doi.org/10.1002/pra2.2018.14505501047 https://doi.org/10.1002/pra2.2018.14505501047 https://doi.org/10.5120/ijca2016908625 https://doi.org/10.5120/ijca2016908625 https://doi.org/10.5120/ijca2016911462 https://doi.org/10.5120/ijca2016911462 https://doi.org/10.1145/1316457.1316459 https://doi.org/10.1145/1316457.1316459 https://doi.org/10.7592/fejf2016.64.polarity https://doi.org/10.7592/fejf2016.64.polarity https://doi.org/10.3906/elk-1803-92 https://doi.org/10.3906/elk-1803-92 https://doi.org/10.1088/1742-6596/893/1/012051 https://doi.org/10.1088/1742-6596/893/1/012051 https://doi.org/10.1088/1742-6596/893/1/012051 https://doi.org/10.3390/ijgi8030113 https://doi.org/10.3390/ijgi8030113 https://doi.org/10.11591/telkomnika.v12i2.4180 https://doi.org/10.11591/telkomnika.v12i2.4180 https://doi.org/10.1109/ialp.2018.8629262 https://doi.org/10.1109/ialp.2018.8629262 https://doi.org/10.1109/ialp.2018.8629262 https://doi.org/10.1080/17489725.2015.1118566 https://doi.org/10.1080/17489725.2015.1118566 i. introduction ii. methods iii. results and discussions iv. conclusion acknowlegdement declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 1, july 2020, pp. 1–10 eissn 2597-4637 https://doi.org/10.17977/um018v3i12020p1-10 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) flood prediction using artificial neural networks: empirical evidence from mauritius as a case study a. z. dhunny a, 1, *, r. h. seebocus b, 2 , z. allam c, 3 , m. y. chuttur d, 4 , m. eltahan e, 5 , h. mehta a, 6 a cyberange global holdings pte, singapore 68 circular road, #02-01, 049422 singapore b department of physics, faculty of science, university of mauritius 4 th floor nac building, university of mauritius reduit mu mu, 80837, mauritius c curtin university sustainability policy institute, curtin university building 209, level 1, room 133, kent st, bentley wa 6102 perth, australia d dept. of software & information systems, faculty of information, communication & digital tech., univ. of mauritius 2 nd floor phase ii building university of mauritius reduit mu mu, 80837, mauritius e aerospace department, faculty of engineering, cairo university 1 gamaa street, giza, egyptgiza12613 egypt 1 zaynah.d@cyberange.io *; 2 reenahansaseebocus@gmail.com; 3 zaheerallam@gmail.com; 4 y.chuttur@uom.ac.mu; 5 muhammedsamireltahan@gmail.com; 6 harsh@cyberange.io * corresponding author i. introduction the average temperature of the earth is increasing at an alarming rate and it has been envisaged to increase by a factor of about 1.4 to 5.8 degree celsius by the year 2100 [1]. an increase in the atmospheric temperature entails the occurrence of many extreme events such as stronger heat waves, formation of intense cyclones, unprecedented flash floods and severe drought events [2] which are set to impact greatly on both the global economy and society. among the various natural disasters, which affect mankind, flash floods have been reported to cause more casualties in terms of economic loss, death tolls and infrastructural damages. flooding has become a recurrent phenomenon in the recent decade accounting for about 73% of damages caused by natural disasters which in turn results in an overall loss of about $30 billions [3]. flash floods are thus a global phenomenon affecting major parts of the world [4][5] as indicated for the year 2018, which marked the occurrence of several deadly flash floods in kerala, france and vietnam [6]. in this study, we focus our attention to mauritius, which is a small island located in the indian ocean, off the east coast of africa and madagascar. the morphological landscape of mauritius consists of highlands and coastal regions in a relatively small geographical area of 1865 km² such that it is typical for the island to experience several microclimates on the same day in different regions. our study is especially motivated by the occurrence of a series of flash floods in mauritius article info a b s t r a c t article history: received 23 july 2020 revised 26 july 2020 accepted 9 august 2020 published online 17 august 2020 artificial neural networks (ann) has been well studied for flood prediction. however, there is not enough empirical evidence to generalize ann applicability to small countries with microclimates prevailing in a small geographical space. in this paper, we focus on the climatic conditions of mauritius for which we seek to investigate the accuracy of using ann to predict flooding using locally collected data from 11 meteorological stations spread across the country. the ann model for flood prediction presented in this work is trained using 20,000 climate data records, collected over a period of two years for mauritius. our input climate features are minimum temperature, maximum temperature, rainfall and humidity and our output decision is „flood‟ or „no flood‟. using ann, we achieved an accuracy of 98% for flood prediction and hence, we conclude that ann is indeed a good predictor for flood occurrence even for regions with predominantly microclimatic conditions. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: flood prediction artificial neural networks (ann) microclimate mauritius case study mailto:zaynah.d@cyberange.io mailto:reenahansaseebocus@gmail.com mailto:zaheerallam@gmail.com mailto:y.chuttur@uom.ac.mu mailto:muhammedsamireltahan@gmail.com mailto:harsh@cyberange.io https://creativecommons.org/licenses/by-sa/4.0/ 2 a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 during past years. the island has been subjected to some major flash flood incidents since the year 2008. these events have become quite recurrent and have caused much infrastructural damage and in some unfortunately occurrences, have led to loss of human lives. for instance, in march 2013, a 152mm of rainfall was experienced in port louis causing massive flooding that claimed the lives of at least 11 people, besides causing unprecedented chaos and traffic crisis in the city. it is also believed that the recurrent flooding events in several parts of mauritius can be attributed to the rapid urbanization occurring in various parts of the country, which resulted in the obstruction of actual water evacuation channel beds for constructions purposes [4][5][7]. given that mauritius is a developing country and the high likelihood of increased urbanization projects in the future, there is a high need to develop weather forecasting tools that can accurately predict the occurrence of flood. weather forecasting is defined as the process of identification and prediction of climatic conditions (e.g., temperature, wind, humidity, etc.) to a certain degree of precision. the results of accurate weather forecasting can be then be used to predict correlated conditions such as flood occurrences. there are different types of forecasting methods such as native approach, judgmental approach, quantitative and qualitative method, causal or economic forecasting methods, time series methods and artificial intelligence (ai) methods which can be used for weather prediction [8]. on this, the two most commonly used forecasting methods are statistical methods, which make use of linear data and artificial intelligence, which treats nonlinear data. the use of statistical methods is not a good alternative since weather variables exhibits stochastic behavior and are mostly non-linear in nature [9][10]. ai methods, in contrast, have been extensively used in literature for the modeling of weather forecast using non-linear data [11][12]. artificial intelligence algorithms can fall under three main categories: genetic algorithm, neuro-fuzzy logic and neural networks and each of these can be used as an individual entity for weather forecasting. for the purpose of this study, we will be using the neural network category as it caters for the complex nature of weather, as defined by several parameters such as temperature, humidity, rainfall amount, cloud cover, wind speed and associated direction which are all continuously varying on a temporal basis [8]. in fact, artificial neural network has been found to produce results with higher level of accuracy and precision [9]. we give a brief outline of the different works that have found ann suitable for weather forecasting using the artificial neural network. for a detailed overview on the application of ann to weather forecast, readers are invited to consult the work of nayak et al. [13] and mosavi et al. [14]. ustaoglu et al. [15] modeled the maximum and minimum temperature in turkey using three different ann methods namely the feed-forward back propagation, radial basis function and generalized regression neural network and the linear regression to determine the best model to be used for forecasting. results obtained from the neural network analysis and the linear regression were compared, and it was found that all the models studied were good predictors for weather forecasting with ann slightly performing better than the others. ann was also used to forecast rainfall predictability in the semi-arid kharosan province of iran. the rainfall data used was calculated using the digital elevation model [16] by taking different climatic variables such as sea surface temperature, sea level pressure, and relative humidity which are responsible factors causing the formation of active clouds resulting in rainfall [17]. here also, results indicated that the ann model provided fairly good accuracy in predicting rainfall data. abhishek et al. [18][19] have also built a prototype ann model to forecast different weather variables such as temperature, humidity, wind and rainfall. they studied the effect of increasing the number of hidden layers on the result generated by the model, and deduced that by increasing the number of samples and neurons without exceeding an optimum value increases the models precision. narverkar et al. [6] and nayak et al. [13] also performed a thorough literature survey on the different algorithm techniques available which can be used with the neural network for forecasting purposes. their analysis concluded that the multi-layer perceptron network (mlp), bpn, radial basis function network (rbfn), som and svm are all suitable rainfall predictors. hardwinarto [20] used ann with the back propagation neural network with three different epochs set. the mean square error (mse) was used to measure the accuracy of the results, which indicated that the back propagation neural network produced outputs with greater accuracy. a similar analysis was done by [8], by reviewing the different algorithms namely; back propagation network, ensemble neural network, artificial neural network, radial basis function network, a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 3 and general regression neural network. their study concluded in accordance with the work of [20] that the neural network with back propagation network algorithm produced results with least errors. abbhishek et al. [9] further explain that ann‟s capability is infallible, even in the case of nonlinear statistics and especially those generated by the weather data; where they are able to perform predictions with minimum error occurrences. abbhishek et al. [18] also argue that artificial intelligence (ai) technologies such as ann have numerous advantages over traditional methods of weather prediction. previous findings therefore suggest that ann stands as a good candidate for weather prediction compared to other techniques used for weather forecasting. at the same time, it is noted that there is no empirical evidence to suggest that ann will still have a good level of prediction accuracy in areas where microclimates co-exist, such as in the case of mauritius. we expect that such evidence will not only add to the literature regarding the suitability of the ann model for weather prediction, here flood occurrence, but it will also investigate the behavior of the ann prediction model when real data from regions experiencing microclimates are taken into consideration. the main goal of this paper therefore is to address the lack of empirical evidence for small geographical regions experiencing microclimates that can be used to support the applicability of ai models for accurate flood prediction. to this end, the following two objectives are set: 1) apply artificial neural network (ann) to daily climate data (min. and max. temperature, rainfall, humidity) for the small island of mauritius to develop a flood forecasting model and 2) evaluate the effectiveness of the developed model for flood prediction. the organization of the paper is as follows: section 2 describes the study area and the ann algorithm for the methodology part, section 3 presents the results and discussions of the results; and conclusions are drawn in section 4. ii. methodology the study area for this work is mauritius. it is a small island forming part of the african small islands developing state (sids) network and is situated in the south west indian ocean at latitude 20.2 degrees south and longitude 57.3 degrees east. the mauritian republic consists of several other islands and about 49 islets surrounding the main island which forms part of the mauritian sovereignty (statistics mauritius, 2013). the main island of mauritius has a complex topography with a total surface of about 1865 km 2 . its orography consists of broken chains of mountainous ranges in the western parts, flat lowlands and a central plateau found at an altitude of about 400 – 500 m which represents a former caldera [21]. the island has a tropical maritime climate with two seasons namely summer (varying from november to late april) and winter (varying from june to september), may and october are considered as transition months. the summer season is normally hot and humid with a mean summer temperature of 24.7 degrees celsius and a higher probability of rainfall while the winter season is cold and dry with a mean temperature of 20.4 degrees celsius favoring very little rainfall. the summer season, especially the months of january to march accounts for about 40% of the seasonal rainfall amount due to the southward convergence of the itcz towards the subtropical latitudes and the cyclonic activities. winter season generates little amount of rainfall which can be attributed to anticyclones and active trade winds which bring stable atmospheric air [22]. the total annual rainfall for mauritius is approximately 2010 mm. this value is subjected to the orographic influence and varies from 1400 mm in the eastern coastal lowlands to about 4000 mm on the central plateau and about 800 mm on the western coasts [23]. the precipitation pattern is modulated on inter annual time scale (greater than one year) in relationship with the influence of large-scale circulation patterns such as the enso, iod and tropical cyclones (tc). given the landscape morphology of mauritius, it is very common for the country to experience several microclimates in different regions during the same day making weather prediction and consequently flood predictions a challenging task. as seen in previous studies, ann is considered one of the most successful machine learning methods that can be used for flood prediction. here, a prediction model was developed using theano python library to predict flood possibility using neural networks with tanh logistic activation function. the three basic variables constituting the neural network include the set of connecting links, the activation function and the bias. a schematic overview of the ann architecture used in this study is shown in figure 1. 4 a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 the three layer neural network (input, hidden and output layers) can be represented by the mathematical expression given in equation (1) [24]. ̂ [∑ (∑ ) ] (1) where ŷ is the forecasted k th output value, f0 is the activation function for the output neuron, n is the number of output neurons, wkj is the weight connecting the j th neuron in the hidden layer and k th neuron in the output layer, fh is the activation function for the hidden neuron, m is the number of hidden neurons, wji is the weight connecting the i th neuron in the input layer and j th neuron in the hidden layer, xi is the i th input variable, wjb is the bias for the j th hidden neuron and wkb is the bias function for the k th output neuron. the input layer represents the data that will be analysed and it is divided into two groups, namely; training dataset to estimate the weight and test dataset to determine the behavior of the ann model. the input dataset was divided into two parts as follows: 23 months of data used for training and 1 month of data used for testing. the input independent variables or input features of the model were four parameters: rainfall, humidity, minimum temperature, and maximum temperature. these daily data were collected for a time frame two years starting 1 st january 2017 to 30 rd december 2018 from the mauritius meteorological station (mms) for 20 stations as indicated by the regions in figure 2. out of the 20 stations, data for 11 stations only, was used, as there were missing data from 9 stations. the daily data included the four input features required to train the predictive model. the exact coordinates for each of the 11 mms locations used to collect input data for the ann model is given in table 1. feature scaling technique for all of the four input features (rainfall, humidity, minimum temperature, and maximum temperature) was applied to optimize the performance of the ann algorithm. feature scaling is done using mean normalization according to equation (2). moreover, to avoid overfitting the ann model, extreme data points were removed from our input dataset. ( ) ( ) ( ) (2) fig. 1. ann architecture adapted from [10] table 1. list of 11 mms used for study with exact coordinates mms longitude latitude albion 57.408627 -20.206125 baie du cap 57.378771 -20.485812 belle mare 57.777398 -20.199583 m. loisir rouillard 57.684835 -20.12059 p. aux canonniers 57.56176 -20.00641 plaisance (airport) 57.678997 -20.433194 port louis 57.502388 -20.161998 providence 57.621084 -20.249614 quatre bornes 57.478959 -20.267172 rose-belle 57.606683 -20.400288 vacoas 57.495288 -20.291098 a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 5 the weights calculated by the training dataset are multiplied by each input node value and are then sent to the second layer known as the hidden layer which can be thought of as an intermediate between the input and the output. the aim of the hidden layer is to determine the underlying complexity in the model, by analysing the variations in the data. at the hidden layer, an activation function is applied before processing the data to the output layer, where a second activation function is applied to the dataset before generating the final output result [25]. there are different types of activation function such as the log-sigmoid, tan-sigmoid and pure-linear function [18][26][27]. for this study, the logistic tanh activation function was used for each neuron within the network. the first layer contained 50 neurons and 10 neurons in each hidden layer with the last layer having two outputs for a yes (flood predicted) or a no (no flood predicted). once constructed successfully, the ann model should be able to accurately forecast climatic variables as explained in the work of [18][19]. in this study, we will evaluate the accuracy of our ann model in predicting flood occurrence using real world microclimatic data collected from different regions. iii. results and discussions time series for the four input features (rainfall, humidity, max. temp. and min. temp.) collected from 11 mms are shown in figures 3 to 6. the data is classified based on the eleven locations selected for this case study (refer to table 1). it is seen that despite the close geographical proximity of the stations, significant variation of climate parameters are recorded confirming the existence of microclimatic conditions across different regions in mauritius. brief explanations on each plot are given further. as for minimum and maximum temperatures, it is noticed that there is a similar trend for all regions spread across mauritius. the temperature takes a slight peak around january and then drops gradually towards august to again rise to another peak with the cycle repeated again. the temperature graph indicates a sinusoidal trend demonstrating temperature variation over the two fig. 2. geographic distribution of meteorological stations under analysis 6 a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 seasons of mauritius, (winter and summer). winter usually occurs around may-october and summer occurs from november to april as demonstrated by figures 3 and 4. maximum temperature observed is around 35 degrees celsius and minimum temperature observed is around 13 degrees celsius. in contrast to expected increase in temperature [28], there is no noticeable increase in temperature trend for the two years under study for 2017 and 2018 for mauritius. in regards to rainfall data collected (figure 5), we notice a net difference in the amount of rainfall for 2018 compared to 2017 for the same period. in 2017, the month of february was marked with the highest rainfall recorded for regions like providence, quatre bornes and vacoas, all of which are centrally located on the high lands in mauritius. similarly the highest rainfall data was observed for the same regions for the months of march, may, august, september, and december. an exception is noted for the month of november, where the southern region of mauritius indicated by rose belle recorded the highest rainfall data compared to the other regions. in 2018, the beginning of the year was marked by a net increase in rainfall for almost all parts of mauritius. recorded rainfall data skyrocketed for the central regions (providence, quatre bornes and vacoas), while other regions recorded higher rainfall data than for the same period in 2017. in general, it is observed that for the first three months of 2018 (january to march), there was heavy rainfall recorded across almost all the country. a flattening of the plot from april to october is then observed, with another peak of rainfall occurring in december of 2018. fig. 3. time series for minimum temperature fig. 4. time series for maximum temperature a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 7 the trend observed for 2018, thus, is quite different from that of 2017. in fact, it is seen that 2018 is marked by heavier rainfall and for longer periods. at the same time, it is seen that more regions are being subjected to more rainfall than in previous year. this observation contrast with results obtained in [29], whereby rainfall data show a declining trend in southern african region. such observation highlight the specificity of climatic condition in a region and the need for actual field evidence to better understand environmental phenomena. as seen in figure 6, the humidity level varied across the country according to the region (high lands versus low lands). stations, which were located in high lands (e.g., providence, vacoas, quatre bornes) registered high humidity compared to stations, which were located near sea level (e.g., albion, baie du cap, port louis, p. aux cannoniers). in some cases, outlier values were noted, but those values were not included in the training data set. the typical observed trend in humidity was also sinusoidal. however, there seems to be no correlation with the season, winter versus summer. regardless of season or temperature, humidity would have several peaks and troughs all throughout the year. our results corroborates with observations reported in [30], where the authors found significant variability regarding a best-fit trend in the relative humidity of land, for which further investigation is required. a. training the ann regression model for flood prediction the output feature from our data driven model is the decision shown in figure 7. based on the different locations, for which climate data was collected, the decision is either [yes=1] which means there is flood or [no=0] which means there is no flood for a specific time period for each location. fig. 5. time series for rainfall fig. 6. time series for humidity 8 a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 as per figure 7., for the period 1/1/2017 to 12/30/2018, there were several occurrences of flood across different regions of mauritius as indicated by the colored bars. those decisions, along with rainfall, maximum and minimum temperature, humidity as input features were fed to our ann model for training purpose. our training dataset contained a total of 20,000 input parameters relevant to data collected from the 11 mms under study for different regions spread over mauritius. the hypothesis function for the logistic regression is given in equation (3) and cost function j(θ), which is named cross entropy (also known as log loss) is shown in equation (4). minimization of the cost function was achieved by running gradient decent algorithm to find the best estimate for the θ s parameters. ( ) (3) ( ) ∑ ( ( ( )) ( )) (4) ( ( ) ) ( ( )) ( ( ) ) ( ( )) the weights for each neuron of the hidden layers were adjusted based on the error produced through backpropagation until the predicted output, as indicated by data fed from decision data for figure 7., was reached. the error graph, i.e., the value of the cost function at every iteration/loop obtained during the training phase is shown in figure 8. as illustrated, the error graph converges accordingly and the minimization loops had little effect after around 200 iterations. at this point, we considered, the ann model to have been optimized in producing appropriate prediction of flooding occurrence based on the input parameters: humidity, temperature and rainfall. fig. 7. time series for output feature decision ([yes =1] / [no=0]). fig. 8. cost function for flood data driven logistic regression model a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 9 b. validating the ann regression model for flood prediction once our ann model was trained, we proceeded to testing the accuracy of our model. for the purpose, we used another part of our dataset. to recall, out of the two years of collected data for humidity, min. temperature, max. temperature and rainfall from 11 mms spread across mauritius, 23 months of data was used to train our ann model, and 1 month of data was used for testing purpose. it was ensured that the testing month chosen for testing our model had occurrences of flood in one or more regions in mauritius. the accuracy of our model was tested using equation (5). (5) where tp and tn are the true positives and true negatives; fp and fn are the false positives and false negatives. the period of data used for testing and validating the model was 11/16/2018 to 12/16/2018. using equation 5, we obtained an accuracy of 98%, which corroborates with findings in [31], where the authors determined that flood prediction using neural networks performed better with higher accuracies than other machine learning algorithms. our work differ from [31] in that the authors used only temperature and rainfall and achieved an accuracy of 91.185 whereas, here we included humidity as a third input parameter and considered data from different regions exhibiting microclimatic conditions. iv. conclusion this study applied ann for flood prediction using daily climate data in mauritius. the results obtained indicated high-level accuracy in flood prediction, and thus this work adds to the body of literature supporting the application of ann for flood forecasting. a logistic regression classifier was used as the core algorithm which, processed data collected from 11 meteorological stations data scattered at different altitudes throughout the island of mauritius. a key contribution of this work is the empirical evidence obtained to support the accuracy of ann even for regions like mauritius, which experience microclimatic weather conditions in different regions over the same day. as part of future work on enhancing the efficiency of the ann model for flood forecasting in mauritius, further investigations are warranted on the impact of the number of hidden layers, size of dataset and number of regions considered on the performance of the developed model. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] e. ashraf, a. sarwar, m. junaid, m. b. baig, h. k. shurjeel, and r. k. barrick, “an assessment of in-service training needs for agricultural extension field staff in the scenario of climate change using borich needs assessment model.,” sarhad journal of agriculture, vol. 36, no. 2, 2020. [2] q. h. nguyen et al., “land-use dynamics in the mekong delta: from national policy to livelihood sustainability,” sustainable development, vol. 28, no. 3, pp. 448–467, 2020. [3] y. zhao, x. zou, l. cao, y. yao, and g. fu, “spatiotemporal variations of potential evapotranspiration and aridity index in relation to influencing factors over southwest china during 1960–2013,” theoretical and applied climatology, vol. 133, no. 3–4, pp. 711–726, 2018. [4] z. allam, “building a conceptual framework for smarting an existing city in mauritius: the case of port louis,” journal of biourbanism, vol. 6, no. 1 & 2, pp. 103–121, 2017. https://doi.org/10.17582/journal.sja/2020/36.2.427.446 https://doi.org/10.17582/journal.sja/2020/36.2.427.446 https://doi.org/10.17582/journal.sja/2020/36.2.427.446 https://doi.org/10.1002/sd.2036 https://doi.org/10.1002/sd.2036 https://doi.org/10.1007/s00704-017-2216-4 https://doi.org/10.1007/s00704-017-2216-4 https://doi.org/10.1007/s00704-017-2216-4 https://journalofbiourbanism.org/2018/03/20/jbu-volume-vi-12-2017/ https://journalofbiourbanism.org/2018/03/20/jbu-volume-vi-12-2017/ 10 a.z. dhunny et al. / knowledge engineering and data science 2020, 3 (1): 1–10 [5] z. allam and d. jones, “promoting resilience, liveability and sustainability through landscape architectural design: a conceptual framework for port louis, mauritius; a small island developing state,” in ifla world congress singapore, 2018, pp. 1599–1611. [6] p. luo et al., “flood inundation assessment for the hanoi central area, vietnam under historical and extreme rainfall conditions,” scientific reports, vol. 8, no. 1, pp. 1–11, 2018. [7] z. ahmed, d. r. m. rao, k. r. m. reddy, and y. e. raj, “urban flooding–case study of hyderabad,” global journal of engineering, design and technology, vol. 2, no. 4, pp. 63–66, 2013. [8] m. narvekar and p. fargose, “daily weather forecasting using artificial neural network,” international journal of computer applications, vol. 121, no. 22, 2015. [9] r. nayak, “artificial neural network model for weather prediction,” international journal of applied information system (ijais), 2015. [10] m. p. darji, v. k. dabhi, and h. b. prajapati, “rainfall forecasting using neural network: a survey,” in 2015 international conference on advances in computer engineering and applications, 2015, pp. 706–713. [11] g. zhang, b. e. patuwo, and m. y. hu, “forecasting with artificial neural networks:: the state of the art,” international journal of forecasting, vol. 14, no. 1, pp. 35–62, 1998. [12] a. tealab, h. hefny, and a. badr, “forecasting of nonlinear time series using ann,” future computing and informatics journal, vol. 2, no. 1, pp. 39–47, 2017. [13] d. r. nayak, a. mahapatra, and p. mishra, “a survey on rainfall prediction using artificial neural network,” international journal of computer applications, vol. 72, no. 16, 2013. [14] a. mosavi, p. ozturk, and k. chau, “flood prediction using machine learning models: literature review,” water, vol. 10, no. 11, p. 1536, 2018. [15] b. ustaoglu, h. cigizoglu, and m. karaca, “forecast of daily mean, maximum and minimum temperature time series by three artificial neural network methods,” meteorological applications: a journal of forecasting, practical applications, training techniques and modelling, vol. 15, no. 4, pp. 431–445, 2008. [16] n. dempsey, g. bramley, s. power, and c. brown, “the social dimension of sustainable development: defining urban social sustainability,” sustainable development, vol. 19, no. 5, pp. 289–300, 2011. [17] g. a. fallah-ghalhary, m. mousavi-baygi, and m. habibi-nokhandan, “seasonal rainfall forecasting using artificial neural network,” journal of applied sciences, vol. 9, no. 6, pp. 1098–1105, 2009. [18] k. abhishek, a. kumar, r. ranjan, and s. kumar, “a rainfall prediction model using artificial neural network,” 2012 ieee control and system graduate research colloquium, jul. 2012.. [19] k. abhishek, m. p. singh, s. ghosh, and a. anand, “weather forecasting model using artificial neural network,” procedia technology, vol. 4, pp. 311–318, 2012. [20] mislan, haviluddin, s. hardwinarto, sumaryono, and m. aipassa, “rainfall monthly prediction based on artificial neural network: a case study in tenggarong station, east kalimantan indonesia,” procedia computer science, vol. 59, pp. 142–151, 2015.. [21] c. g. staub, f. r. stevens, and p. r. waylen, “the geography of rainfall in mauritius: modelling the relationship between annual and monthly rainfall and landscape characteristics on a small volcanic island,” applied geography, vol. 54, pp. 222–234, 2014. [22] c. mcsweeney, m. new, g. lizcano, and x. lu, “the undp climate change country profiles: improving the accessibility of observed and projected climate information for studies of climate change in developing countries,” bulletin of the american meteorological society, vol. 91, no. 2, pp. 157–166, 2010. [23] d. senapathi, f. underwood, e. black, m. a. nicoll, and k. norris, “evidence for long-term regional changes in precipitation on the east coast mountains in mauritius,” international journal of climatology, vol. 30, no. 8, pp. 1164–1177, 2010. [24] j. lee, c.-g. kim, j. e. lee, n. w. kim, and h. kim, “application of artificial neural networks to rainfall forecasting in the geum river basin, korea,” water, vol. 10, no. 10, p. 1448, 2018. [25] c. w. dawson and r. wilby, “an artificial neural network approach to rainfall-runoff modelling,” hydrological sciences journal, vol. 43, no. 1, pp. 47–66, 1998. [26] i. maqsood, m. r. khan, and a. abraham, “an ensemble of neural networks for weather forecasting,” neural computing & applications, vol. 13, no. 2, pp. 112–122, 2004. [27] a. kaur, j. k. sharma, and s. agrawal, “artificial neural networks in forecasting maximum and minimum relative humidity,” international journal of computer science and network security, vol. 11, no. 5, pp. 197–199, 2011. [28] c. r. sunstein, s. bobadilla-suarez, s. c. lazzaro, and t. sharot, “how people update beliefs about climate change: good news and bad news,” cornell l. rev., vol. 102, p. 1431, 2016. [29] k. dube and g. nhamo, “evidence and impact of climate change on south african national parks. potential implications for tourism in the kruger national park,” environmental development, vol. 33, p. 100485, 2020. [30] m. p. byrne and p. a. o‟gorman, “trends in continental temperature and humidity directly linked to ocean warming,” proceedings of the national academy of sciences, vol. 115, no. 19, pp. 4863–4868, 2018. [31] s. sankaranarayanan, m. prabhakar, s. satish, p. jain, a. ramprasad, and a. krishnan, “flood prediction based on weather parameters using deep learning,” journal of water and climate change, 2019. https://doi.org/10.13140/rg.2.2.10884.37762 https://doi.org/10.13140/rg.2.2.10884.37762 https://doi.org/10.13140/rg.2.2.10884.37762 https://doi.org/10.1038/s41598-018-30024-5 https://doi.org/10.1038/s41598-018-30024-5 https://www.longdom.org/abstract/urban-flooding--case-study-of-hyderabad-3043.html https://www.longdom.org/abstract/urban-flooding--case-study-of-hyderabad-3043.html https://doi.org/10.5120/21830-5088 https://doi.org/10.5120/21830-5088 https://www.ijais.org/proceedings/icwccv2015/number3/802-1573 https://www.ijais.org/proceedings/icwccv2015/number3/802-1573 https://doi.org/10.1109/icacea.2015.7164782 https://doi.org/10.1109/icacea.2015.7164782 https://doi.org/10.1016/s0169-2070(97)00044-7 https://doi.org/10.1016/s0169-2070(97)00044-7 https://doi.org/10.1016/j.fcij.2017.05.001 https://doi.org/10.1016/j.fcij.2017.05.001 https://doi.org/10.5120/12580-9217 https://doi.org/10.5120/12580-9217 https://doi.org/10.3390/w10111536 https://doi.org/10.3390/w10111536 https://doi.org/10.1002/met.83 https://doi.org/10.1002/met.83 https://doi.org/10.1002/met.83 https://doi.org/10.1002/sd.417 https://doi.org/10.1002/sd.417 https://doi.org/10.3923/jas.2009.1098.1105 https://doi.org/10.3923/jas.2009.1098.1105 https://doi.org/10.1109/icsgrc.2012.6287140 https://doi.org/10.1109/icsgrc.2012.6287140 https://doi.org/10.1016/j.protcy.2012.05.047 https://doi.org/10.1016/j.protcy.2012.05.047 https://doi.org/10.1016/j.procs.2015.07.528 https://doi.org/10.1016/j.procs.2015.07.528 https://doi.org/10.1016/j.procs.2015.07.528 https://doi.org/10.1016/j.apgeog.2014.08.008 https://doi.org/10.1016/j.apgeog.2014.08.008 https://doi.org/10.1016/j.apgeog.2014.08.008 https://doi.org/10.1175/2009bams2826.1 https://doi.org/10.1175/2009bams2826.1 https://doi.org/10.1175/2009bams2826.1 https://doi.org/10.1002/joc.1953 https://doi.org/10.1002/joc.1953 https://doi.org/10.1002/joc.1953 https://doi.org/10.3390/w10101448 https://doi.org/10.3390/w10101448 https://doi.org/10.1080/02626669809492102 https://doi.org/10.1080/02626669809492102 https://doi.org/10.1007/s00521-004-0413-4 https://doi.org/10.1007/s00521-004-0413-4 http://paper.ijcsns.org/07_book/html/201105/201105029.html http://paper.ijcsns.org/07_book/html/201105/201105029.html https://doi.org/10.2139/ssrn.2821919 https://doi.org/10.2139/ssrn.2821919 https://doi.org/10.1016/j.envdev.2019.100485 https://doi.org/10.1016/j.envdev.2019.100485 https://doi.org/10.1073/pnas.1722312115 https://doi.org/10.1073/pnas.1722312115 https://doi.org/10.2166/wcc.2019.321 https://doi.org/10.2166/wcc.2019.321 i. introduction ii. methodology iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 117–127 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p117-127 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology recognition of handwritten javanese script using backpropagation with zoning feature extraction anik nur handayani a, 1, *, heru wahyu herwanto a, 2 , katya lindi chandrika a, 3 , kohei arai b, 4 a department of electrical engineering, universitas negeri malang jl. semarang 5, malang 65145, indonesia b department of information science, saga university saga shi honjou machi honjou, 840-8502, japan 1 aniknur.ft@um.ac.id*; 2 heru_wh@um.ac.id; 3 katyachandrika@gmail.com, 4 arai@is.saga-u.ac.jp * corresponding author i. introduction the backpropagation method is one of the approaches used in artificial neural networks (ann), separated into three layers: input, hidden, and output. backpropagation can solve complicated problems since it consumes less memory than other algorithms and produces solutions with a low error rate and less time [1]. moreover, this method is preferred because of its ability to distinguish incomplete or weak input patterns. there are three phases in the backpropagation training process, including the forward-backward and weight modification phases. as a result, backpropagation is frequently used in machine learning for a variety of tasks, including classification [2], prediction [3], forecasting [4], and image pattern recognition [5]. backpropagation in image pattern recognition can be utilized to preserve cultures in diverse parts of the world, such as handwriting recognition of different regional languages in a country, particularly in asia. related studies discuss the identification and recognition of each printed chinese character utilizing projection and zoning feature extraction [6]. in another study, the 990 most commonly used syllables were also used to introduce traditional korean script, or hangul [7]. in thailand, the study identifies thai letters that include 77 different character patterns [8]. in japan, there is a study on a procedure to distinguish between kanji and kana writing styles for character recognition [9]. moreover, due to the complexity of printed and handwritten arabic letters, there is research on arabic characters to synthesize the essential aspects of the arabic writing style [10]. because of the various writing styles, handwriting identification has been the subject of extensive and fascinating research over the last few decades. as a result, the focus of this study will be on using backpropagation to recognize the picture pattern of javanese scriptwriting in indonesia. many similar studies have been conducted, including identifying each character in the javanese script pattern using article info a b s t r a c t article history: submitted 22 october 2021 revised 5 november 2021 accepted 21 november 2021 published online 31 december 2021 backpropagation is part of supervised learning, in which the training process requires a target. the resulting error is transmitted back to the units below in its training process. backpropagation can solve complicated problems because it consumes less memory than other algorithms. in addition, it also can produce solutions with a low error rate while executing less time. in image pattern recognition, backpropagation can be utilized for cultural preservation in many places worldwide, including indonesia. it is used to recognize picture patterns in javanese script writings. this study concluded that feature extraction approaches, zoning, and backpropagation could be utilized to distinguish handwritten javanese characters. the best accuracy is attained at 77.00%, with the network architecture comprising 64 input neurons, 40 hidden neurons, a learning rate of 0.003, a momentum of 0.03, and an iteration of 5000. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: backpropagation feature extraction image processing javanese script pattern recognition http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p117-127 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 118 a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 the horizontal and vertical image extraction method transformed through fourier, which produced an accuracy of 59.5% [11]. then, another study using backpropagation ann resulting an accuracy of 61% [12]. also, the research on hanacaraka javanese script using the backpropagation ann and producing an accuracy of 74% [13]. from those studies, no one has applied a novel strategy to feature extraction, which is zoning developed by elima hussain [14] with backpropagation. therefore, this study discusses the introduction of handwritten javanese scripts using the backpropagation zoning feature extraction method. in comparison to prior studies, it is expected that the current results will yield a higher accuracy value rather than that did not apply zoning feature extraction on backpropagation, ii. method the initial data collection step will be completed in this project. the obtained data is subsequently entered into the second step, which is the pre-processing data. the pre-processing aims to use data in the feature extraction procedure [15]. the image's features will be stored in the feature extraction procedure at the third step. the normalization procedure is followed by testing with selected algorithms, as well as evaluation and validation of the findings. figure 1 depicts the progression of the research stages. in the following subchapter, the method and the details of the steps will be detailed in greater depth. a. data collection the image of the javanese character nglegena script was used in this study. the data was obtained by spreading the form to respondents. children, adolescents, and adults who have studied javanese script were chosen as respondents depending on their age categories: children's ages range from 5 to 11, adolescents' ages range from 12 to 25, and adults' ages range from 26 to 45 years. each age group will have ten respondents, bringing the total number to 30. each respondent will write a total of 20 nglegena javanese characters. there are 600 photos in all that have been collected. figure 2 illustrates a sample of respondents' responses to the scanned javanese script. b. data pre-processing the picture data will be turned into an image used for feature extraction before further processing [16]. the steps at this process, including changing the image to a gray image (grayscaling), converting the image to a binary image (binarization), cleaning noise (noise removal), equalizing position (crop edge), and equalizing image size, are the stages involved in processing (resizing). • grayscaling is the initial level of image processing in this study. the percentage value of each pixel r, g, and b is added to obtain the color picture transition to gray image [17]. equation (1) shows the percentage of each value in this case. 𝐺𝑟𝑎𝑦 = 𝑌 = (0,2989 ∙ 𝑅) + (0,5870 ∙ 𝐺) + (0,1140 ∙ 𝐵) (1) • binarization, the threshold value determines whether a grayscale pixel value changes from 0 to 255 or black and white [18]. the value chosen as a threshold is 128 [19]. this value is determined by dividing the white pixel value range by the number of gray pixels, which is 255. fig. 1. research flow a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 119 • noise removal removes the noise from the marker ink and dirt that is simultaneously being scanned. this noise removal process uses the wiener median filtering method, as illustrated in equation (2) [20]. 𝑊(𝑓1, 𝑓2) = 𝐻∗(𝑓1,𝑓2)𝑆𝑥𝑥(𝑓1,𝑓2) / |𝐻(𝑓1,𝑓2)| 2𝑆𝑥𝑥(𝑓1,𝑓2)+𝑆𝜂𝜂(𝑓1,𝑓2) (2) • crop edge is used to remove unneeded bits of javanese characters before selecting them. character cutting is done by searching for the highest and lowest x and y point values with black pixels [21]. • resizing, at this stage, the image size is equalized to 120×120 pixels [22]. figure 3 presents the difference between the raw image and the resized image. (a) (b) (c) fig. 2. samples of scanned data sheets for javanese script; (a) children; (b) adolescents; and (c) adults fig. 3. differences in character image; (a) before image pre-processing; (b) after grayscalling; (c) after binarization; (d) after crop edge; (e) after resizing; and (f) final resized image 120 a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 c. extraction feature in this study, feature extraction is done using a novel method called zoning, developed by elima hussain [14]. the 120×120 character image will be broken into 16, 25, 36, and 64 parts. this number of divisions was chosen since these values may entirely divide the image size. the zoning feature produces numerical data on the number of black pixels in each zone. figure 4 shows an illustration of the zone division. d. normalization normalization is the process of scaling attribute values to fit within a given range [23]. since the range of data acquired in this study is so vast, normalization is required. for example, a zone size of 10 by 10 pixels will have a black pixel value range of 0 to 100. several normalizing strategies are presented; the normalization techniques used in this study are: • min-max normalization is the process of scaling data from one range to another [24]. this method uses the formula stated in equation (3) to change the data. 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑(𝑥) = 𝑚𝑖𝑛𝑅𝑎𝑛𝑔𝑒+(𝑥−𝑚𝑖𝑛𝑉𝑎𝑙𝑢𝑒)(𝑚𝑎𝑥𝑅𝑎𝑛𝑔𝑒−𝑚𝑖𝑛𝑅𝑎𝑛𝑔𝑒) 𝑚𝑎𝑥𝑉𝑎𝑙𝑢𝑒−𝑚𝑖𝑛𝑉𝑎𝑙𝑢𝑒 (3) • z-score normalization is accomplished by removing the value from the data's average and dividing it by its standard deviation [25]. equation (4) shows the formula for this method. 𝑛𝑒𝑤 𝑣𝑎𝑙𝑢𝑒 = 𝑜𝑙𝑑 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 𝑠𝑡𝑑𝑒𝑣 (4) • decimal scaling normalization is a method of normalization that divides a variable by the power of 10 [26]. this approach is presented in equation (5). 𝑛𝑒𝑤 𝑣𝑎𝑙𝑢𝑒 = 𝑜𝑙𝑑 𝑣𝑎𝑙𝑢𝑒 10 (5) e. classification backpropagation ann is the classification algorithm employed in this study. backpropagation is part of supervised learning because the training procedure requires a target. backpropagation is named after the resulting error propagates back to the units below it throughout the training process [27]. the network architecture used in this study is a multi-layer net, which is a network design with multiple layers. an input layer, a hidden layer, and an output layer make up the network. there are 16 neurons in the input layer that contain the value retrieved from the feature. the number of neurons in the hidden layer is modified to find the best outcomes [8]. the output layer has 20 neurons, which matches the expected number of javanese characters, which is 20. the architecture employed in this investigation is illustrated in figure 5. backpropagation's pseudocode is shown in the code snippet below. fig. 4. division of 16 zones on the character "ca" and the calculation results of black pixels for each zone a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 121 # calculate neuron activation for an input def activate(weights, inputs): activation = weights[-1] for i in range(len(weights)-1): activation += weights[i] * inputs[i] return activation # transfer neuron activation def transfer(activation): return 1.0 / (1.0 + exp(-activation)) # forward propagate input to a network output def forward_propagate(network, row): inputs = row for layer in network: new_inputs = [] for neuron in layer: activation = activate(neuron['weights'], inputs) neuron['output'] = transfer(activation) new_inputs.append(neuron['output']) inputs = new_inputs return inputs # calculate the derivative of an neuron output def transfer_derivative(output): return output * (1.0 output) # backpropagate error and store in neurons def backward_propagate_error(network, expected): for i in reversed(range(len(network))): layer = network[i] errors = list() if i != len(network)-1: for j in range(len(layer)): error = 0.0 for neuron in network[i + 1]: error += (neuron['weights'][j] * neuron['delta']) errors.append(error) else: for j in range(len(layer)): neuron = layer[j] errors.append(expected[j] neuron['output']) for j in range(len(layer)): neuron = layer[j] neuron['delta']=errors[j]*transfer_derivative(neuron['output']) f. testing during the testing phase, the collected data will be separated into two categories, including training data and testing data. the k-fold cross-validation approach will be used to divide the data. the study employed 5, 10, and 15 as the number of k [28]. in this experiment, 600 image data will be separated into ten groups, each with 60 data. the data from the nine groups were utilized as training data, while one group was used as test data. figure 6 presents the process of partitioning the dataset using the kfold cross-validation approach. 122 a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 g. evaluation the confusion matrix approach is used to evaluate the backpropagation classification algorithm. this evaluation approach is currently being used to test two classes, but it can be changed to handle many class classifications. table 1 shows the confusion matrix for 20 classes [29]. the accuracy formula presented in equation (6) [30] is used to calculate the evaluation to determine the correctness of the data. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝛴 𝑁 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 σ n (6) where σ 𝑁 correct is the number of image data that are classified correctly, and σ 𝑁 is the number of available image data. fig. 5. backpropagation neural network architecture java script recognition fig. 6. illustration of k-fold cross validation with total k=10 table 1. confusion matrix for handling 20 classes classification results k1 k2 k3 k4 k5 k6 k7 k8 k9 … k20 k1 x k2 x k3 x k4 x k5 x k6 x k7 x k8 x k9 x … x k20 x a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 123 iii. results and discussions backpropagation architecture parameters used in the training process include architecture, number of neurons, activation function, learning rate value, momentum value, maximum iteration, and learning algorithm. table 2 lists the requirements for these parameters. in this study, there were various stages to the experiments. the first step is to figure out which normalization results are the best. based on these findings, the next stage is to determine the best learning rate and momentum values. the number of neurons in the hidden layer is determined at the third stage. the number of input neurons will be changed in the fourth stage to discover the optimal zone in this study. a. determination of the best normalization results each experiment has a set design at this point, with 16 neurons in the input layer, 20 neurons in the hidden layer, and 20 neurons in the output layer, with a maximum of 5000 iterations. the number of zones formed is equal to the number of input layers. a total of 16 zones were employed in this experiment. in the third stage, the value of the zone or input layer will be adjusted. there is a range of parameter values in each experiment. raw data, data from the min-max normalization process, data from the z-score normalization process, and data from the decimal normalization process are all used. experiments at this stage will use k-fold cross validation with a k value of 10 until the fourth stage. table 3 shows the findings of the first experiment. according to table 3, data normalized using the z-score normalization approach had the highest overall accuracy. the range of values produced using z-score normalization data is 56.33% to 60.00%. this value range is higher than the accuracy results produced from raw data, data adjusted using the min-max approach, and data standardized using decimal normalization. b. determination of the best learning rate and momentum values each experiment has the same architecture and settings as the previous experiment. the data used has been adjusted using the z-score normalization method. table 4 shows the results of determining the best learning rate and momentum values. table 4 shows that the best accuracy results are obtained when the learning rate is 0.003 with the momentum at 0.03. the accuracy of recognizing javanese characters reaches 60.00% with these two values. the experimental stages of calculating the number of hidden neurons will be carried out using this learning rate and momentum value. table 2. specifications of backpropagation neural network characteristic specification input neuron 16 hidden neuron 20, 30, dan 40 output neuron 20 activation function sigmoid biner (logsig) learning rate 0.003; 0.005; 0.008 momentum 0.005; 0.01; 0.03; 0.05; 0.08; 0.1 maximum iteration 5000 table 3. best normalization result no. learning rate momentum iteration raw min-max z-score decimal 1 0.003 0.03 5000 43.33% 23.00% 60.00% 9.17% 2 0.003 0.005 5000 43.67% 20.50% 59.33% 9.83% 3 0.003 0.05 5000 46.83% 22.00% 59.67% 9.67% 4 0.005 0.1 5000 40.17% 44.83% 57.33% 10.17% 5 0.008 0.01 5000 29.83% 55.17% 57.33% 27.00% 6 0.008 0.08 5000 32.67% 56.17% 56.17% 29.83% 124 a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 c. determination of the number of hidden neurons the third step is to determine how many neurons are in the hidden layer. the number of neurons used varies between 20, 30, and 40. the learning rate parameter is 0.003, the momentum parameter is 0.03, and the iterations are 5000. table 5 shows the results of the experiment. table 5 shows the maximum result of 64.00%, achieved with as many as 40 neurons in the hidden layer. the test findings of various network architectures are still of low value, as can be shown. therefore, increasing the number of input neurons can help enhance backpropagation testing outcomes [31]. in this scenario, increasing the number of input neurons means increasing the number of zones, which provides more precise information about the network. d. determination of the number of input neurons in the experiment, there are 40 hidden neurons in the backpropagation architecture. according to the zone partition plan, 16, 25, 36, and 64 input neurons are used. the best learning rate and momentum levels are 0.003 and 0.03. table 6 shows the results of determining the number of input neurons. table 6 shows that the accuracy improves when the input neurons are increased up to 64 neurons. when the number of input neurons is increased to 100, the accuracy attained drops to 73.00%. it validates that the highest accuracy results are achieved with 64 input neurons, which is 77.00%. low test scores can be caused by a lack of diversity in handwritten javanese writing patterns used as training data. e. determination of the number of k in k-fold cross-validation the last experiment was to identify the amount of k in k-fold cross-validation. k's value will be adjusted to 5, 10, and 15, respectively. table 7 shows the results of the k number on the k-fold crossvalidation. according to table 7, the test with a k number of 5 has a 75.17% accuracy, whereas the test with a k number of 15 has a 76.67% accuracy. these two results are low compared to the accuracy results achieved when the k value is 10. the accuracy rating can reach 77.00% with a k value of 10. table 4. the best learning rate and momentum values no. learning rate momentum iteration accuracy 1 0.003 0.03 5000 60.00% 2 0.003 0.005 5000 59.33% 3 0.003 0.05 5000 59.67% 4 0.005 0.1 5000 57.33% 5 0.008 0.01 5000 57.33% 6 0.008 0.08 5000 56.17% table 5. the best hidden neuron number results no. learning rate momentum hidden neuron iteration accuracy 1 0.003 0.03 20 5000 60.10% 2 0.003 0.03 30 5000 62.50% 3 0.003 0.03 40 5000 64.00% table 6. the best number of input neurons no. input neuron learning rate momentum hidden neuron iteration raw 1 16 0.003 0.03 40 5000 64.00% 2 25 0.003 0.03 40 5000 64.83% 3 36 0.003 0.03 40 5000 76.17% 4 64 0.003 0.03 40 5000 77.00% 5 100 0.003 0.03 40 5000 73.00% a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 125 f. evaluation based on the previous experimental stages, the maximum accuracy is obtained using a network architecture that comprises 64 input neurons, 40 hidden neurons, and 20 neurons in the output layer. the learning rate is 0.003, the momentum is 0.3, and the number of iterations is 5000. also, the test uses k-fold cross validation with a k value of 10. the confusion matrix in figure 7 is used to analyze the categorization findings from the testing stage. the accuracy value for 20 classes is calculated by adding all the correctly predicted data and dividing by the number of data tested using equation 6. as a result of the evaluation, the resulting accuracy value is 462/600 × 100% = 77.00%, with a neuron architecture of 64-40-20. iv. conclusion there are multiple stages to creating the backpropagation architecture, including identifying the appropriate learning rate and momentum values, number of hidden layers, and number of input neurons. the network architecture includes 64 input neurons, 40 hidden neurons, and 20 neuron outputs, achieving the highest accuracy of 77.00%. the learning rate is 0.003, the momentum is 0.03, and the number of iterations is 5000. the recognition accuracy is affected by increasing the number of input neurons (in this case, adding zones) in the backpropagation architecture. variations in javanese script handwriting patterns that are more reproduced as backpropagation training data and research data development utilizing javanese scriptwriting in the form of words or sentences can be used for future research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. table 7. result of best normalization no. number of k input neuron learning rate momentum hidden neuron iteration accuracy 1 5 64 0.003 0.03 40 5000 75.17% 2 10 64 0.003 0.03 40 5000 77.00% 3 15 64 0.003 0.05 40 5000 76.67% fig. 7. confusion matrix 126 a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. suliman and y. zhang, “a review on back-propagation neural networks in the application of remote sensing image classification,” j. earth sci. eng., vol. 5, no. 1, jan. 2015. [2] m. muladi, d. lestari, d. t. prasetyo, a. p. wibawa, t. widiyaningtiyas, and u. pujianto, “classification of locally grown apple based on its decent consuming using backpropagation artificial neural network,” in 2019 international conference on electrical, electronics and information engineering (iceeie), 2019, pp. 96–100. [3] h. aini and h. haviluddin, “crude palm oil prediction based on backpropagation neural network approach,” knowl. eng. data sci., vol. 2, no. 1, p. 1, 2019. [4] p. purnawansyah, h. haviluddin, h. darwis, h. azis, and y. salim, “backpropagation neural network with combination of activation functions for inbound traffic prediction,” knowl. eng. data sci., vol. 4, no. 1, p. 14, aug. 2021. [5] s. afroge, b. ahmed, and f. mahmud, “optical character recognition using back propagation neural network,” in 2016 2nd international conference on electrical, computer & telecommunication engineering (icecte), 2016, pp. 1–4. [6] a. khawaja, s. tingzhi, n. m. memon, and a. rajpar, “recognition of printed chinese characters by using neural network,” in 2006 ieee international multitopic conference, 2006, pp. 169–172. [7] s.-b. cho and j. h. kim, “recognition of large-set printed hangul (korean script) by two-stage backpropagation neural classifier,” pattern recognit., vol. 25, no. 11, pp. 1353–1360, nov. 1992. [8] b. kijsirikul and s. sinthupinyo, “approximate ilp rules by backpropagation neural network: a result on thai character recognition,” in 9th international workshop on inductive logic programming, 2003, pp. 162–173. [9] s. d. budiwati, j. haryatno, and e. m. dharma, “japanese character (kana) pattern recognition application using neural network,” in proceedings of the 2011 international conference on electrical engineering and informatics, 2011, pp. 1–6. [10] h. a. a., “back propagation neural network arabic characters classification module utilizing microsoft word,” j. comput. sci., vol. 4, no. 9, pp. 744–751, sep. 2008. [11] i. prihandi, i. ranggadara, s. dwiasnati, y. s. sari, and suhendra, “implementation of backpropagation method for identified javanese scripts,” j. phys. conf. ser., vol. 1477, no. march, pp. 1–6, mar. 2020. [12] n. nurmila, a. sugiharto, and e. a. sarwoko, “algoritma back propagation neural network untuk pengenalan pola karakter huruf jawa,” j. masy. inform., vol. 1, no. 1, pp. 1–10, 2010. [13] a. setiawan, a. s. prabowo, and e. y. puspaningrum, “handwriting character recognition javanese letters based on artificial neural network,” int. j. comput. netw. secur. inf. syst., vol. 1, no. 1, pp. 39–42. [14] h. w. herwanto, a. n. handayani, k. l. chandrika, and a. p. wibawa, “zoning feature extraction for handwritten javanese character recognition,” in 2019 international conference on electrical, electronics and information engineering (iceeie), 2019, pp. 264–268. [15] s. khalid, t. khalil, and s. nasreen, “a survey of feature selection and feature extraction techniques in machine learning,” in 2014 science and information conference, 2014, pp. 372–378. [16] g. kumar and p. k. bhatia, “a detailed review of feature extraction in image processing systems,” in 2014 fourth international conference on advanced computing & communication technologies, 2014, pp. 5–12. [17] t. kumar and k. verma, “a theory based on conversion of rgb image to gray image,” int. j. comput. appl., vol. 7, no. 2, pp. 5–12, sep. 2010. [18] k. y. kok and p. rajendran, “a descriptor-based advanced feature detector for improved visual tracking,” symmetry (basel)., vol. 13, no. 8, p. 1337, jul. 2021. [19] m. h. ali, s. kurokawa, and k. uesugi, “vision based measurement system for gear profile,” in 2013 international conference on informatics, electronics and vision (iciev), 2013, pp. 1–6. [20] a. k. ghosh and a. a. ansari, “to analysis and implement image de-noising using fuzzy and wiener filter in wavelet domain,” int. j. trend res. dev., vol. 8, no. 3, pp. 320–373, 2021. [21] s. zhu, s. dianat, and l. k. mestha, “end-to-end system of license plate localization and recognition,” j. electron. imaging, vol. 24, no. 2, p. 023020, mar. 2015. [22] r. samad and h. sawada, “edge-based facial feature extraction using gabor wavelet and convolution filters,” proc. 12th iapr conf. mach. vis. appl. mva 2011, pp. 430–433, 2011. http://journal2.um.ac.id/index.php/keds https://doi.org/10.17265/2159-581x/2015.01.004 https://doi.org/10.17265/2159-581x/2015.01.004 https://doi.org/10.1109/iceeie47180.2019.8981472 https://doi.org/10.1109/iceeie47180.2019.8981472 https://doi.org/10.1109/iceeie47180.2019.8981472 https://doi.org/10.17977/um018v2i12019p1-9 https://doi.org/10.17977/um018v2i12019p1-9 https://doi.org/10.17977/um018v4i12021p14-28 https://doi.org/10.17977/um018v4i12021p14-28 https://doi.org/10.17977/um018v4i12021p14-28 https://doi.org/10.1109/icecte.2016.7879615 https://doi.org/10.1109/icecte.2016.7879615 https://doi.org/10.1109/inmic.2006.358156 https://doi.org/10.1109/inmic.2006.358156 https://doi.org/10.1016/0031-3203(92)90147-b https://doi.org/10.1016/0031-3203(92)90147-b https://doi.org/10.1007/3-540-48751-4_16 https://doi.org/10.1007/3-540-48751-4_16 https://doi.org/10.1109/iceei.2011.6021648 https://doi.org/10.1109/iceei.2011.6021648 https://doi.org/10.1109/iceei.2011.6021648 https://doi.org/10.3844/jcssp.2008.744.751 https://doi.org/10.3844/jcssp.2008.744.751 https://doi.org/10.1088/1742-6596/1477/3/032020 https://doi.org/10.1088/1742-6596/1477/3/032020 https://ejournal.undip.ac.id/index.php/jmasif/article/view/74 https://ejournal.undip.ac.id/index.php/jmasif/article/view/74 https://ijconsist.org/index.php/ijconsist/article/view/12 https://ijconsist.org/index.php/ijconsist/article/view/12 https://doi.org/10.1109/iceeie47180.2019.8981462 https://doi.org/10.1109/iceeie47180.2019.8981462 https://doi.org/10.1109/iceeie47180.2019.8981462 https://doi.org/10.1109/sai.2014.6918213 https://doi.org/10.1109/sai.2014.6918213 https://doi.org/10.1109/acct.2014.74 https://doi.org/10.1109/acct.2014.74 https://doi.org/10.5120/1140-1493 https://doi.org/10.5120/1140-1493 https://doi.org/10.3390/sym13081337 https://doi.org/10.3390/sym13081337 https://doi.org/10.1109/iciev.2013.6572652 https://doi.org/10.1109/iciev.2013.6572652 http://www.ijtrd.com/papers/ijtrd22695.pdf http://www.ijtrd.com/papers/ijtrd22695.pdf https://doi.org/10.1117/1.jei.24.2.023020 https://doi.org/10.1117/1.jei.24.2.023020 https://dblp.org/rec/conf/mva/samad11.html https://dblp.org/rec/conf/mva/samad11.html a.n. handayani et al. / knowledge engineering and data science 2021, 4 (2): 117–127 127 [23] j. c. caicedo et al., “data-analysis strategies for image-based cell profiling,” nat. methods, vol. 14, no. 9, pp. 849–863, sep. 2017. [24] t. jayalakshmi and a. santhakumaran, “statistical normalization and back propagationfor classification,” int. j. comput. theory eng., vol. 3, no. 1, pp. 89–93, 2011. [25] d. singh and b. singh, “investigating the impact of data normalization on classification performance,” appl. soft comput., vol. 97, p. 105524, dec. 2020. [26] a. s. eesa and w. k. arabo, “a normalization methods for backpropagation: a comparative study,” sci. j. univ. zakho, vol. 5, no. 4, p. 319, dec. 2017. [27] a. p. markopoulos, s. georgiopoulos, and d. e. manolakos, “on the use of back propagation and radial basis function neural networks in surface roughness prediction,” j. ind. eng. int., vol. 12, no. 3, pp. 389–400, sep. 2016. [28] s. hulu, p. sihombing, and sutarman, “analysis of performance cross validation method and k-nearest neighbor in classification data,” int. j. res. rev., vol. 7, no. april, pp. 69–73, 2020. [29] g. bueno et al., “automated diatom classification (part a): handcrafted feature approaches,” appl. sci., vol. 7, no. 8, p. 753, jul. 2017. [30] a. bogoliubova and p. tymków, “accuracy assessment of automatic image processing for land cover classification of st. petersburg protected area,” acta sci. pol. geod. descr. terrarum, vol. 13, no. january 2014, pp. 5–22, 2014. [31] o. krestinskaya, k. n. salama, and a. p. james, “learning in memristive neural network architectures using analog backpropagation circuits,” ieee trans. circuits syst. i regul. pap., vol. 66, no. 2, pp. 719–732, feb. 2019. https://doi.org/10.1038/nmeth.4397 https://doi.org/10.1038/nmeth.4397 https://doi.org/10.7763/ijcte.2011.v3.288 https://doi.org/10.7763/ijcte.2011.v3.288 https://doi.org/10.1016/j.asoc.2019.105524 https://doi.org/10.1016/j.asoc.2019.105524 https://doi.org/10.25271/2017.5.4.381 https://doi.org/10.25271/2017.5.4.381 https://doi.org/10.1007/s40092-016-0146-x https://doi.org/10.1007/s40092-016-0146-x https://www.ijrrjournal.com/ijrr_vol.7_issue.4_april2020/ijrr0010.pdf https://www.ijrrjournal.com/ijrr_vol.7_issue.4_april2020/ijrr0010.pdf https://doi.org/10.3390/app7080753 https://doi.org/10.3390/app7080753 http://yadda.icm.edu.pl/baztech/element/bwmeta1.element.baztech-8eff476e-c9b7-43c3-a566-724896da62c1?q=bwmeta1.element.baztech-ad9759e3-175b-4700-8b99-ea81487e40c3;0&qt=children-stateless http://yadda.icm.edu.pl/baztech/element/bwmeta1.element.baztech-8eff476e-c9b7-43c3-a566-724896da62c1?q=bwmeta1.element.baztech-ad9759e3-175b-4700-8b99-ea81487e40c3;0&qt=children-stateless https://doi.org/10.1109/tcsi.2018.2866510 https://doi.org/10.1109/tcsi.2018.2866510 i. introduction ii. method a. data collection b. data pre-processing c. extraction feature d. normalization e. classification f. testing g. evaluation iii. results and discussions a. determination of the best normalization results b. determination of the best learning rate and momentum values c. determination of the number of hidden neurons d. determination of the number of input neurons e. determination of the number of k in k-fold cross-validation f. evaluation iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] a. suliman and y. zhang, “a review on back-propagation neural networks in the application of remote sensing image classification,” j. earth sci. eng., vol. 5, no. 1, jan. 2015. [2] m. muladi, d. lestari, d. t. prasetyo, a. p. wibawa, t. widiyaningtiyas, and u. pujianto, “classification of locally grown apple based on its decent consuming using backpropagation artificial neural network,” in 2019 international conference on el... [3] h. aini and h. haviluddin, “crude palm oil prediction based on backpropagation neural network approach,” knowl. eng. data sci., vol. 2, no. 1, p. 1, 2019. [4] p. purnawansyah, h. haviluddin, h. darwis, h. azis, and y. salim, “backpropagation neural network with combination of activation functions for inbound traffic prediction,” knowl. eng. data sci., vol. 4, no. 1, p. 14, aug. 2021. [5] s. afroge, b. ahmed, and f. mahmud, “optical character recognition using back propagation neural network,” in 2016 2nd international conference on electrical, computer & telecommunication engineering (icecte), 2016, pp. 1–4. [6] a. khawaja, s. tingzhi, n. m. memon, and a. rajpar, “recognition of printed chinese characters by using neural network,” in 2006 ieee international multitopic conference, 2006, pp. 169–172. [7] s.-b. cho and j. h. kim, “recognition of large-set printed hangul (korean script) by two-stage backpropagation neural classifier,” pattern recognit., vol. 25, no. 11, pp. 1353–1360, nov. 1992. [8] b. kijsirikul and s. sinthupinyo, “approximate ilp rules by backpropagation neural network: a result on thai character recognition,” in 9th international workshop on inductive logic programming, 2003, pp. 162–173. [9] s. d. budiwati, j. haryatno, and e. m. dharma, “japanese character (kana) pattern recognition application using neural network,” in proceedings of the 2011 international conference on electrical engineering and informatics, 2011, pp. 1–6. [10] h. a. a., “back propagation neural network arabic characters classification module utilizing microsoft word,” j. comput. sci., vol. 4, no. 9, pp. 744–751, sep. 2008. [11] i. prihandi, i. ranggadara, s. dwiasnati, y. s. sari, and suhendra, “implementation of backpropagation method for identified javanese scripts,” j. phys. conf. ser., vol. 1477, no. march, pp. 1–6, mar. 2020. [12] n. nurmila, a. sugiharto, and e. a. sarwoko, “algoritma back propagation neural network untuk pengenalan pola karakter huruf jawa,” j. masy. inform., vol. 1, no. 1, pp. 1–10, 2010. [13] a. setiawan, a. s. prabowo, and e. y. puspaningrum, “handwriting character recognition javanese letters based on artificial neural network,” int. j. comput. netw. secur. inf. syst., vol. 1, no. 1, pp. 39–42. [14] h. w. herwanto, a. n. handayani, k. l. chandrika, and a. p. wibawa, “zoning feature extraction for handwritten javanese character recognition,” in 2019 international conference on electrical, electronics and information engineering (iceeie), 2019... [15] s. khalid, t. khalil, and s. nasreen, “a survey of feature selection and feature extraction techniques in machine learning,” in 2014 science and information conference, 2014, pp. 372–378. [16] g. kumar and p. k. bhatia, “a detailed review of feature extraction in image processing systems,” in 2014 fourth international conference on advanced computing & communication technologies, 2014, pp. 5–12. [17] t. kumar and k. verma, “a theory based on conversion of rgb image to gray image,” int. j. comput. appl., vol. 7, no. 2, pp. 5–12, sep. 2010. [18] k. y. kok and p. rajendran, “a descriptor-based advanced feature detector for improved visual tracking,” symmetry (basel)., vol. 13, no. 8, p. 1337, jul. 2021. [19] m. h. ali, s. kurokawa, and k. uesugi, “vision based measurement system for gear profile,” in 2013 international conference on informatics, electronics and vision (iciev), 2013, pp. 1–6. [20] a. k. ghosh and a. a. ansari, “to analysis and implement image de-noising using fuzzy and wiener filter in wavelet domain,” int. j. trend res. dev., vol. 8, no. 3, pp. 320–373, 2021. [21] s. zhu, s. dianat, and l. k. mestha, “end-to-end system of license plate localization and recognition,” j. electron. imaging, vol. 24, no. 2, p. 023020, mar. 2015. [22] r. samad and h. sawada, “edge-based facial feature extraction using gabor wavelet and convolution filters,” proc. 12th iapr conf. mach. vis. appl. mva 2011, pp. 430–433, 2011. [23] j. c. caicedo et al., “data-analysis strategies for image-based cell profiling,” nat. methods, vol. 14, no. 9, pp. 849–863, sep. 2017. [24] t. jayalakshmi and a. santhakumaran, “statistical normalization and back propagationfor classification,” int. j. comput. theory eng., vol. 3, no. 1, pp. 89–93, 2011. [25] d. singh and b. singh, “investigating the impact of data normalization on classification performance,” appl. soft comput., vol. 97, p. 105524, dec. 2020. [26] a. s. eesa and w. k. arabo, “a normalization methods for backpropagation: a comparative study,” sci. j. univ. zakho, vol. 5, no. 4, p. 319, dec. 2017. [27] a. p. markopoulos, s. georgiopoulos, and d. e. manolakos, “on the use of back propagation and radial basis function neural networks in surface roughness prediction,” j. ind. eng. int., vol. 12, no. 3, pp. 389–400, sep. 2016. [28] s. hulu, p. sihombing, and sutarman, “analysis of performance cross validation method and k-nearest neighbor in classification data,” int. j. res. rev., vol. 7, no. april, pp. 69–73, 2020. [29] g. bueno et al., “automated diatom classification (part a): handcrafted feature approaches,” appl. sci., vol. 7, no. 8, p. 753, jul. 2017. [30] a. bogoliubova and p. tymków, “accuracy assessment of automatic image processing for land cover classification of st. petersburg protected area,” acta sci. pol. geod. descr. terrarum, vol. 13, no. january 2014, pp. 5–22, 2014. [31] o. krestinskaya, k. n. salama, and a. p. james, “learning in memristive neural network architectures using analog backpropagation circuits,” ieee trans. circuits syst. i regul. pap., vol. 66, no. 2, pp. 719–732, feb. 2019. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 2, december 2020, pp. 67–76 eissn 2597-4637 https://doi.org/10.17977/um018v3i22020p67-76 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) a review of accessing big data with significant ontologies jumah y.j sleeman 1, *, jehad a.h hammad 2 department of computer information systems, al-quds open university, beit jalla, the main road-khallat al badd, bethlehem, palestine 1 jsulaiman@qou.edu *; 2 jhammad@qou.edu; * corresponding author i. introduction accessing and managing information in the big data scenarios is extremely difficult due to the multi dimensions of big data: (1) volume which cares about the size of the data, especially the nontraditional data which produce terabytes of data within minutes. (2) variety that represent the data stream such as social media. (3) velocity which refers to the data types. (4) value that refers to the valuable information that is hidden in non-traditiona1 data. ontology-based data access (obda) is a promising paradigm for solving the problem of accessing these massive amounts of accumulated data and to designing effective platforms for accessing data [1]. figure 1 represents obda characteristic that consists of: 1) an ontology that represents a conceptual view of the data for a domain of interest. 2) mapping layer that is able to solve the problems arising from the difference between the basic elements managed by data sources and the elements managed by the ontology. 3) the data sources are the repositories used in the organizations by different services and applications [1][2][3][4]. thus, obda system behaves as a form of information integration that replace the global schema with a general ontology-based and end user oriented query interface over diverse data sources. ontology with the corresponding mappings to the data sources are offering the required documentations for collecting the correct data to be returned to the client. obda specifications focus on the role of answering queries to insure that they give the same answers to the considered queries for all possible extensions of data sources [4]. the life cycle of obda system starts from the point that end-users pass their sparql queries over a visual interface to the ontology layer without any knowledge of the actual structure of the data. ontology rewrites the query obtained using one of the description logic notations that exists behind ontology. the previous query is rewritten again with respect of a mapping assertions over the data sources to get the query answer. in this scenario end-users and experts can access big data without asking it experts. article info a b s t r a c t article history: received 27 august 2020 revised 18 november 2020 accepted 20 december 2020 published online 31 december 2020 ontology based data access (obda) is a recently proposed approach which is able to provide a conceptual view on relational data sources. it addresses the problem of the direct access to big data through providing end-users with an ontology that goes between users and sources in which the ontology is connected to the data via mappings. we introduced the languages used to represent the ontologies and the mapping assertions technique that derived the query answering from sources. query answering is divided into two steps: (i) ontology rewriting, in which the query is rewritten with respect to the ontology into new query; (ii) mapping rewriting the query that obtained from previous step reformulating it over the data sources using mapping assertions. in this survey, we aim to study the earlier works done by other researchers in the fields of ontology, mapping and query answering over data sources. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: ontology big data mapping rewriting ontology rewriting http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v3i22020p67-76 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 68 j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 to make this idea clearer, let us assume that the ontology t is given by a set of semantics represented by description logic’s (dls). d is a relational database compatible with data sources s, and m is the mapping assertions each one of the from, 𝜙(�⃗�) → ѱ(�⃗�) where 𝜙(�⃗�) is a query over s that returning rows of values for �⃗�, and ѱ(�⃗�) is a query over t whose free variables are from �⃗� [2]. later in this paper review, we will see how ontology, mappings as inputs, can help end-users compute a query that can be executed over the data sources ii. motivation in the uniform sources of data, the execution time for queries can be retrieved within minutes or seconds in the different sources. end-users need to collaborate with some it skilled experts to develop queries that retrieve the required data. in this scenario the time round between asking and retrieving the results may be in the range of days or more. so the challenge here, is how end-users and experts can access big data without asking it experts. obda system is a recently proposed approach to address the problem of the direct access to data. it is integrated from several sources to avoid the bottleneck by automating query translation process, obda can be considered as a virtual approach which tells us where the exact direction of data is. obda also solves the problem of structural heterogeneity in which different information systems store their data in different structures and semantic heterogeneity which refers to the content of information items and its intended meanings [5]. there are several features for a successful implementation for obda that lead us to believe it is the right approach for end-users to access big data [2][4][5]: • ontologies: the objective of an ontology on obda system is to describe the domain, classifying and categorizing the elements contained within it. • mapping assertion: ontology plays an important role in information integration; it puts together all information of different formatting. in order to support data integration, mapping connect ontologies with data sources. • query answering: the database queries used in obda are typically conjunctive queries in first-order-logic. these queries can be categorized into two: (i) instance queries (iqs) that ask for the instance of a single concept between obda specifications. (ii) union of conjunctive queries (ucqs) that ask for a set of queries between obda specifications fig. 1. obda characteristic j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 69 in order for end-users to create value of the data which rapidly increase, obda also considered the following points: (1) it is declarative, therefore no need for end-users and it experts, to write special purpose program code. (2) relational databases can remain as they are, hence no need for moving large and complex data sets. (3) obda is an adaptive system according to data scalability so data retrieving remains stable. (4) obda hide the complicity of data sources for the end-users. (5) the relationship between the ontology concepts and the data sources, provides a means for the experts (database administrators) to make their knowledge available to the end-user. iii. problem statements a. data sources and big data data sources can be designated as structured or unstructured data. the term structured data refers to an identifiable structure in which the data is stored based on a methodology of columns and rows; also it is organized for human readers in a way that the data is becoming searchable by its types within content. the term “unstructured data” refers to any data that has no identifiable structure such as videos, emails, documents and texts, each of which has its own structure or format. big data is an expression that refers to a collection of enormous and complex data sets being generated and accumulated through three levels: the employees in companies who enter the data into the computer systems, the users who could generate the wrong data through signing up into websites such as facebook; this level is larger than the first one according to the magnitude, and thirdly the accumulated data are derived from several machines (satellites, sensors, robots, etc.).all the three levels, produce together the big data which have three main characteristics: volume, velocity, and variety. however, [6] adds one more characteristic: value; the justification is that there is a lot of information hidden in larger bodies of nontraditiona1 data so the challenge is to identify what is valuable, and then transform and extract the relevant data for analysis [7]. b. ontology rules ontologies are the structural frameworks for organizing information represented in a formal definition of the types, properties and interrelationships of the entities that exist in some domain. however ontologies take over additional tasks as discussed in following sections. 1) content explication sing1e ontology approaches [2][5][8] in figure 2 single global ontology provide a shared vocabulary, such that all information sources are related to one global ontology and mapped to local data sources for information retrieval. this approach is not effective if one information source has a different view on a domain in addition to its sensitivity to the changes in information sources, any changing imply changes in global ontology and mapping data source. multi ontology approaches [2][5][8] in figure 3: 1) each information source is described by its own ontology. 2) each source ontology can be developed without respect to other sources or their ontologyies. 3) it can simplify the integration task. 4) not effective in comparing different source ontologies due to the lack of a common vocabulary. hybrid ontology approaches [2][5][8] these ontologies are built from a global shared vocabulary to make them comparable. in figure 4: 1) semantic of each source is described by its own ontology. 2) no need for modifications in mapping or shared vocabulary in terms of adding new sources. 3) it is extremely hard to reused existing ontologies because all sources refer to the shared vocabulary. fig. 2. obda characteristic 70 j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 2) ontology knowledge description logic’s are logic’s specifically designed to represent the structured knowledge to represent a domain that composed of objects and structured into: (i) concepts which correspond to a classes and denote sets of objects. (ii) roles which correspond to (a binary) relationships and denote binary relations on objects. web ontology language (owl) is a richer vocabulary description language for describing properties and classes. the formal underpinning of owl is based on description logic’s (dls) knowledge representation formalisms with well-understood computational properties [9]. dl ontology consist of the terminological box (tbox) and assertion box (abox), tbox describe a system in terms of controlled vocabulary such as a set of classes and properties. abox is a tbox statements that represents the ontology vocabu1ary, tbox and abox together representing the base knowledge (kb). dls are a family of logic’s concerned with knowledge representation, it is a decidable fragment of first-order-logic (fol) associated with a set of automatic reasoning procedures. the basic constructs of a dl are the notion of a concepts and the notion of relationship. complex concept and relationship expression can be constructed from atomic concepts and relationships with suitable constructs between them [4][9]. since the ontology is a model of (some aspect of) the world, it can introduce vocabulary relevant to domain with specific meaning (semantics) in terms of a happy cat owner owns a cat and all cats he cares for are healthy which can be formalized using suitable description logic (dl) 𝐻𝑎𝑝𝑝𝑦𝐶𝑎𝑡𝑂𝑤𝑛𝑒𝑟 ⊑ ∃𝑂𝑤𝑛𝑠. 𝐶𝑎𝑡 ⊓ 𝑐𝑎𝑟𝑒𝑠𝐹𝑜𝑟. 𝐻𝑒𝑎𝑙𝑡ℎ𝑦 (1) the most known description logic’s are [10]: • fl͞ : the simplest and less expressive dl consist of the following concepts c, d → a | c⊓d | ∀r.c | ∃r • alc: more practical and expressive dl consist of c, d → a |⊺|⊥ ⇁ a | c ⊓ d| ∀r.c | ∃r.t • shoin: a very popular dl, it’s the logic underlying owl. • dl litea,id family: a very expressive dl capable of representing most database constructs. fig. 3. multiple ontology approach fig. 4. hybrid ontology approach j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 71 dl knowledge base ( ∑) normally separated into two parts ∑ = (tbox, abox), tbox describing the structure of a domain in the form of c ⊑ d, c = d and abox a set of axioms in the form of c(a), r(a, b) describing the data. further details will be found in [3][4][10][11]. figure 5 shows the example of dl knowledge base. tbox example t = (student = person ⊓ ∃name.string ⊓ ∃addrees.string ⊓ ∃enrolled.course student ⊑∃enrolled.course ∃teacher.course ⊑⇁undergrad ⊓ professor) abox example a = (student (john) enrolled (john, cs124) (student ⊔ professor) (paul)) c. mapping the purpose of mapping is to reconcile heterogeneity derived from different designed schema’s even if the people or organizations are model the same domain, mostly these problems happened between the mediated schema and the schema of the data sources. in figure 5, schema mappings describe the relation in which instances of the mediated schema are consistent with current instances of the data sources [12]. i(g)(i(si)): the set of possible instances of the mediates schema g(s). 𝑀𝑅 ⊆ 𝐼(𝐺) × 𝐼(𝑆1) × 𝐼(𝑆2) × … 𝐼(𝑆𝑛 ) (2) mapping 𝑀𝑅represent all possible instances of mediated schema g given instances of sources 𝐼(𝑆1), 𝐼(𝑆2) ,… 𝐼(𝑆𝑛 ) . in other words mapping assertion specifies the semantic relationship between elements of a dl tbox ontology to elements of a data sources [4]. many obda studies focused on understanding which languages for the ontology and mappings allow query answering to be performed taking into account the inconsistency and redundancy for mappings obda [3]. query execution can be performed if (1) the ontology is expressed in description logic dl lite. family ontology language, and (2) the mapping are of types (a) globa1-as-view (gav) in which mediated schema defined as a set of views over the data sources, in which mapping is executed from entities in the global ontology to entities in the original sources (b) local-as-view (lav) in which data sources defined as views over the mediated schema, in which mapping executed from entities in the original sources to the global ontology (c) glav the combination of the two. mapping analysis in obda aims to provide the designer with the useful services that produce a well-founded obda specification, thus two important points should be considered: (1) inconsistent mapping m with respect to ontology o and source schema s means that retrieval, data lead to inconsistent obda specifications even the s schema is non-empty. in other words, no data retrieval or the data are mismatched. (2) when m is subsumes of m’ (m ⊑ m') w.r.t o and s which means that o, s and m∪ m' are equivalent. these proprieties are very important in the life of obda systems to avoid the above problems, especially when executing hundred of queries [11][13]. iv. methodology of obda in figure 6, the query that obtained from the end-user via visual query system divided into two steps: (i) ontology rewriting, in which the query is rewritten with respect to the ontology into new query(ii) mapping rewriting in which the query obtained is reformulated over the data sources using mapping assertions [14]. the specification of obda is a triple of j = (o, s, m) where o is the fig. 5. semantics of schema mappings 72 j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 description logic tbox ontology, s is a source schema with integrated integrity constraints, and m is a mapping between the two consist of assertion of the form 𝜙(𝑥) → ѱ(𝑥) (3) where ϕ(x) is a query over sources and ѱ(x) is a query over ontology [11][13]. we donate to the o with a signature ∑o and description logic language with lo, while s has the signature ∑s and description logic language with ls;x is the number of arguments that the function passes. the functionality of m∈ m mapping assertions with the form of equation (3) [15] means: • ѱ(x) query with the signature ∑o represented by head (m) • ϕ(x) query with the signature ∑s represented by head (m) where mapping assertion mi∈ m and i = {1, 2, 3, ..} with this scenario gav mapping rewriting consists of grouping all sql queries mapping the same ontology role to the database into a single query [3][4]. example: a schema s of the database represented by two tables humantab and areatab for handling information about humans and their strain, the underlined attribute represent the primary key of the table and the attribute area represents the foreign key for the two tables 𝐻𝑢𝑚𝑎𝑛𝑇𝑎𝑏(𝐻𝑢𝑚𝑎𝑛𝐶𝑜𝑑𝑒, 𝑁𝑎𝑚𝑒, 𝑆𝑡𝑟𝑎𝑖𝑛, 𝐴𝑟𝑒𝑎) 𝐴𝑟𝑒𝑎𝑇𝑎𝑏𝑒(𝐴𝑟𝑒𝑎𝐶𝑜𝑑𝑒, 𝐴𝑟𝑒𝑎𝑁𝑎𝑚𝑒) the ontology o is as follow 𝑂 = { 𝐴𝑓𝑟𝑖𝑐𝑎 ⊑ 𝐻𝑢𝑚𝑎𝑛, 𝐴𝑠𝑖𝑎𝑛 ⊑ 𝐻𝑢𝑚𝑎𝑛 𝐴𝑓𝑟𝑖𝑐𝑎𝑛 ⊑⇁ 𝐴𝑠𝑖𝑎𝑛, 𝐻𝑢𝑚𝑎𝑛 ⊑ ∃𝑁𝑎𝑚𝑒, 𝐻𝑢𝑚𝑎𝑛 ⊑ ∃𝐿𝑜𝑐𝑎𝑙𝑖𝑜𝑛, ∃𝐿𝑜𝑐𝑎𝑙𝑖𝑜𝑛 ⊑ 𝐶𝑜𝑑𝑒, 𝐶𝑜𝑑𝑒 ⊑ 𝑁𝑎𝑚𝑒 } in words o, specified asian and african as humans, asian can not be african, and every human has a name and located in a location that has a code. moreover, every code has a name. mapping m between o and s is as follows: 𝑚1: 𝑠𝑒𝑙𝑒𝑐𝑡 𝐻𝑢𝑚𝑎𝑛𝐶𝑜𝑑𝑒 𝑎𝑠 𝑥, 𝑁𝑎𝑚𝑒 𝑎𝑠 𝑦 𝑓𝑟𝑜𝑚 𝐻𝑢𝑚𝑎𝑛𝑇𝑎𝑏 → 𝐻𝑢𝑚𝑎𝑛(𝑥) ∧ 𝑁𝑎𝑚𝑒(𝑥, 𝑦) 𝑚2: 𝑠𝑒𝑙𝑒𝑐𝑡 𝐻𝑢𝑚𝑎𝑛𝐶𝑜𝑑𝑒 𝑎𝑠 𝑥, 𝑁𝑎𝑚𝑒 𝑎𝑠 𝑦 𝑓𝑟𝑜𝑚 𝐻𝑢𝑚𝑎𝑛𝑇𝑎𝑏 𝑤ℎ𝑒𝑟𝑒 𝑆𝑡𝑟𝑎𝑖𝑛 = ”𝐴𝑓𝑟𝑖𝑐𝑎𝑛” → 𝐴𝑓𝑟𝑖𝑐𝑎𝑛(𝑥) ∧ 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛(𝑥, 𝑦) 𝑚3: 𝑠𝑒𝑙𝑒𝑐𝑡 𝐻𝑢𝑚𝑎𝑛𝐶𝑜𝑑𝑒 𝑎𝑠 𝑥, 𝑁𝑎𝑚𝑒 𝑎𝑠 𝑦 𝑓𝑟𝑜𝑚 𝐻𝑢𝑚𝑎𝑛𝑇𝑎𝑏 𝑤ℎ𝑒𝑟𝑒 𝑆𝑡𝑟𝑎𝑖𝑛 = ”𝐴𝑠𝑖𝑎𝑛” → 𝐴𝑠𝑖𝑎𝑛(𝑥) ∧ 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛(𝑥, 𝑦) 𝑚4: 𝑠𝑒𝑙𝑒𝑐𝑡 𝐻𝑢𝑚𝑎𝑛𝐶𝑜𝑑𝑒 𝑎𝑠 𝑥, 𝐴𝑟𝑒𝑎 𝑎𝑠 𝑦 𝑓𝑟𝑜𝑚 𝐻𝑢𝑚𝑎𝑛𝑇𝑎𝑏 → 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛(𝑥, 𝑦) 𝑚5: 𝑠𝑒𝑙𝑒𝑐𝑡 𝐴𝑟𝑒𝑎𝐶𝑜𝑑𝑒 𝑎𝑠 𝑥, 𝐴𝑟𝑒𝑎𝑁𝑎𝑚𝑒 𝑎𝑠 𝑦 𝑓𝑟𝑜𝑚 𝐴𝑟𝑒𝑎𝑇𝑎𝑏 → 𝐶𝑜𝑑𝑒(𝑥) ∧ 𝑁𝑎𝑚𝑒(𝑥, 𝑦) fig. 6. obda query system j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 73 the semantic of obda specifications j with respect of s is legal if 𝐼𝐷 ⊨ 𝑆 (4) where 𝐼𝐷 is a set of facts over ∑s. in other words, for each s a legal instance, always exists. in equation (5) every mapping assertion will denote the existential arguments in the head (m) [3][4][11][13] ∑ 𝜙(𝑥) → ѱ(𝑥)𝑚1 ; 𝑚 ∈ 𝑀 (5) v. discussions and evaluations a. evaluation the main aim of ontology rewriting query is to solve the problem of query answering that comes from the end-user. the idea behind that is to transform the given query and tbox into an expanded query that contains all relevant information captured in the tbox, also to evaluate the expanded query over abox only. the expanded version is also formed by a union of conjunctive queries (ucqs) that avoid keeping the large aboxes in memory [16]. another issue is the size of the rewriting query over ontology which equal the size of tbox and the ordered query. in this case, (ucqs) will contain hundreds or thousands of queries which affect the performance of retrieving information. two types of problems may appear in obda system: (1) syntax error, such that the ontology tbox represented by dl-lite family semantically formulated correctly and the mapping assertion does not contain misspellings. (2) semantic problems, where the ontology does not contain unsatisfiable concepts, roles, or attributes. the semantic problems for the mapping where a mapping assertion m∈ m is semantically anomalous if the answer to either the head query of m or the body query of m is empty, also of the body of the query is empty (sql over database) then the m assertions is useless, but if the head of the query is empty (conjunctive queries) is empty and the body is not, the assertion may lead to a contradiction [15]. b. table summary in this section, we present a discussion related to the obda system that we present. first, we make a comparison between different systems that uses obda for the integration of heterogeneous information sources. we compare the ontology languages as well as to connect ontology with sources via mappings. from table 1 we find that ontology is formulated using dl-lite family [17][18][19][20], and dl behind owl as shown in (1) [14][21][22][23][24]. from table 1 most of the presented platforms used gav mapping rewriting. also, it shows the methodology that implement obda specifications and some important points that shed the light in how these systems derived the data sources. in table 2, we present a discussion related to mapping connection to information sources as follows: (1) straight forward approach that connect ontology to data schema in terms of one-to-one copy of the structure of the database and encode it in a language that makes automated reasoning possible. (2) definition approach does not correspond to the structure of the database, these are only linked to the information by the terms that is defined. (3) structure enrichment which combine the two previously the structure and the information source. (4) meta-annotation that adds semantic in formation to an information sources which present in the world wide web [5]. table 3 summarized the standard languages and the query models that we used in this review paper. gav in which ontology is defined as a set of views over the data sources. in glav approach, each mapping rule is represented by a conjunctive query written in the global schema associated with a conjunctive one written in source schemas. an r2rml is a mapping language that connect the relational databases to rdf dataset throw logical tables to retrieve data from the input database. standard languages also represent dl lite family and owl2ql ontology languages with formally defined meaning [3][4][11][13][25]. table 3 also shows that query answering could be a union of conjunctive query (ucqs) [3] or standalone conjunctive query (cqs) over ontology. 74 j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 table 1. comparison between different platforms that uses obda, in terms of ontology language, mapping assertions, methodology and other important points presented platform ontology language mapping assertions methodology more points ref. optique (visual query) interface dl-lite family logmap system to discover ontology-to ontology mappings * every many-to-many table in bd is mapped to one class in the ontology * every data attribute is mapped to one data prosperity * every, foreign key mapped to one object property drive the ontology from database schema (reverse, engineering) [17] [18] [19] smartdairy farming project topbraid (common ontology using rdfs and owl) logmap system * sensor, equipment for collecting data sources, * set-up, smart ontology using sparql query the project is in an experimental phase [21] mastro project using java tool dl-lite family behind owl gav adding view inclusion to the obda specifications , which eliminate sub-queries contained into other sub-queries of rewriting queries this study focus on the case where data is stored in abox [14] [22] [23] clipper system dl-lite family presenta rewriting-based algorithm for conjunctive query answering over ontologies the experiments used tboxs taxonomies and the quire reasoning with hornshiq (dl) [20] ontop owl2 ql, rewriting of conjunctive queries (cqs), over ontologies (fo) queries. mapping, m, as a set of gav rules using the mechanism of obda with ontologies given in owl2 ql a profile of owl2 designed to support rewriting of conjunctive queries (cqs), over ontologies into first-order (fo) queries. obda is achievable in practice when applied to real-world ontologies, queries and data stored, in relational databases table 2. comparison in terms of mapping connection to information sources mapping connection definition references straight forward approach copy the structure of the database [4][13] definition approach linked to the information by terms that is defined [11][14][17][18][19] structure enrichment copy the structure and the information of the database [3][9][22][23][24] meta-annotation adds semantic information to the sources. [20][21] table 3. comparison in terms of standard languages and query model mapping connection definition ref. gav for mapping assertion perfectmap algorithm union of first-order rewritable conjunctive query (ucqs) [3] gav for mapping assertion glav for mapping assertion tbox formulated in dl-liter conjunctive query (cqs) and instance queries (iqs) [4] owl2ql for ontology definition sparql for query specification r2rml for mapping assertion conjunctive query (cqs)over ontologies specified in sparql [14][22] [23] dl-lite family r2rml for mapping assertion aboxes turn out to be unions of lew queries whose size does not exceed the size of the original query [17][18] [19][21] gav for mapping assertion glav for mapping assertion conjunctive query (cqs) [11][13] j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 75 vi. conclusion a promising obda system is able to solve many challenges related to end use of data access especially on big data. this approach presented a query answering based on two steps (i) ontology rewriting. (ii) mapping rewriting over data sources. a successfully obda implementation can solve the problem of accessing big data as follows (1) there is no need to write a special coding by the endusers or the it experts. (2) data can be left in the relational database. (3) it provides a flexible query language which corresponds to end-users. (4) the ontology will hide the complexity of the source schema for the end-user. (5) database expert’s knowledge will be available to end-users because of the relationship between the ontology and the sources via mapping. from this survey we have found that most of the researchers’ efforts studying how to extract implicit knowledge from big data based on the use of ontologies and the declarative mappings between data and ontology schema’s. also, researchers introduced existing platforms and under constructing ones based on obda systems to give end users the ability to access big data through visual interfaces to write queries. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] k. t. wassif, “a survey on using ontology for addressing end user access to big data,” int. j. comput. syst., vol. 02, no. 08, pp. 363–372, 2015. [2] m. giese et al., “scalable end-user access to big data,” in big data computing, chapman and hall/crc, 2013, pp. 205–244. [3] f. di pinto et al., “optimizing query rewriting in ontology-based data access,” proc. 16th int. conf. ext. database tech. edbt ’13, 2013, p. 561. [4] m. bienvenu and r. rosati, “query-based comparison of obda specifications,” proc. 28th int. work. descr. logics (dl 2015), 2015. [5] h. wache et al., “ontology-based integration of information a survey of existing approaches,” ifcai-01 work. ontol. inf. shar., pp. 108–117, 2001. [6] j. p. dijcks, oracle: big data for the enterprise. oracle white paper, 2012. [7] s. . jeong and i. ghani, “semantic computing for big data: approaches, tools, and emerging directions (2011-2014),” ksii trans. internet inf. syst., vol. 8, no. 6, pp. 2022–2042, jun. 2014. [8] l. zuo, “a semantic and agent-based approach to support information retrieval, interoperability and multi-lateral viewpoints for heterogeneous,” university of london, 2006. [9] b. c. grau, e. kharlamov, and d. zheleznyakov, “how to contract ontologies statement of interest,” pp. 1–5, 2012. [10] m. jarrat, “towards methodological principles for ontology engineering,” llniversiteit brassel, 2005. [11] d. lembo, j. mora, r. rosati, d. f. savo, and e. thorstensen, “mapping analysis in ontology-based data access: algorithms and complexity,” lecture notes in computer science, pp. 217–234, 2015. [12] v. jahns, “principles of data integration by anhai doan, alon halevy, zachary ives,” acm sigsoft softw. eng. notes, vol. 37, no. 5, pp. 43–43, sep. 2012. [13] d. lembo, j. mora, r. rosati, d. f. savo, and e. thorstensen, “towards mapping analysis in ontology-based data access,” conf. int. conf. web reason. rule syst., pp. 108–123, 2014. [14] n. antonioli et al., “ontology-based data access: the experience at the italian department of treasury,” ceur workshop proc., vol. 1017, no. january, pp. 9–16, 2013. [15] d. lembo, r. rosati, m. ruzzi, d. f. savo, and e. tocci, “visualization and management of mappings in ontologybased data access (progress report),” ceur workshop proc., vol. 1193, no. october, pp. 595–607, 2014. [16] h. pérez-urbina, e. rodríguez-díaz, m. grove, g. konstantinidis, and e. sirin, “evaluation of query rewriting approaches for owl 2,” ceur workshop proc., vol. 943, pp. 32–44, 2012. [17] m. giese et al., “optique: zooming in on big data,” computer (long. beach. calif)., vol. 48, no. 3, pp. 60–67, mar. https://pdfcookie.com/documents/a-survey-on-using-ontology-for-addressing-end-user-access-to-big-data-0nlz0wnpexv5 https://pdfcookie.com/documents/a-survey-on-using-ontology-for-addressing-end-user-access-to-big-data-0nlz0wnpexv5 https://doi.org/10.1201/b16014-9 https://doi.org/10.1201/b16014-9 https://doi.org/10.1145/2452376.2452441 https://doi.org/10.1145/2452376.2452441 http://ceur-ws.org/vol-1350/paper-11.pdf http://ceur-ws.org/vol-1350/paper-11.pdf http://ftp.informatik.rwth-aachen.de/publications/ceur-ws/vol-47/wache.pdf http://ftp.informatik.rwth-aachen.de/publications/ceur-ws/vol-47/wache.pdf https://www.oracle.com/assets/big-data-for-enterprise-519135.pdf https://doi.org/10.3837/tiis.2014.06.012 https://doi.org/10.3837/tiis.2014.06.012 http://networks.eecs.qmul.ac.uk/oldpages/documents/zuo-landong-phdthesis.pdf http://networks.eecs.qmul.ac.uk/oldpages/documents/zuo-landong-phdthesis.pdf https://ora.ox.ac.uk/objects/uuid:16d39f45-beaa-4d1f-9842-cbbc98f210dd http://www.jarrar.info/phd-thesis/ https://doi.org/10.1007/978-3-319-25007-6_13 https://doi.org/10.1007/978-3-319-25007-6_13 https://doi.org/10.1145/2347696.2347721 https://doi.org/10.1145/2347696.2347721 https://doi.org/10.1007/978-3-319-11113-1_8 https://doi.org/10.1007/978-3-319-11113-1_8 http://ceur-ws.org/vol-1017/paper2caise_it2013.pdf http://ceur-ws.org/vol-1017/paper2caise_it2013.pdf http://ceur-ws.org/vol-1193/paper_77.pdf http://ceur-ws.org/vol-1193/paper_77.pdf http://ceur-ws.org/vol-943/ssws_hpcsw2012_paper3.pdf http://ceur-ws.org/vol-943/ssws_hpcsw2012_paper3.pdf https://doi.org/10.1109/mc.2015.82 76 j.y.j. sleeman and j.a.h. hammad / knowledge engineering and data science 2020, 3 (2): 67–76 2015. [18] d. calvanese et al., “the optique project: towards obda systems for industry,” ceur workshop proc., vol. 1080, no. january 2015, 2013. [19] d. calvanese et al., “optique: obda solution for big data,” semant. web eswc 2013 satell. ereitfs. springer, pp. 293–295, 2013. [20] t. eiter, m. ortiz, m. šimkus, t. k. tran, and g. xiao, “query rewriting for horn-shiq plus rules,” proc. natl. conf. artif. intell., vol. 1, no. c, pp. 726–733, 2012. [21] j. p. c. verhoosel, m. van bekkum, and f. k. van evert, “ontology matching for big data applications in the smart dairy farming domain,” ceur workshop proc., vol. 1545, pp. 55–59, 2015. [22] d. f. savo et al., “mastro at work: experiences on ontology-based data access,” ceur workshop proc., vol. 573, no. june 2014, pp. 20–31, 2010. [23] d. calvanese et al., “the mastro system for ontology-based data access,” semant. web, vol. 2, no. 1, pp. 43–53, 2011. [24] r. kontchakov, m. rodríguez-muro, and m. zakharyaschev, “ontology-based data access with databases: a short course,” lect. notes comp. sci., 2013, pp. 194–229. [25] l. e. t. neto, v. m. p. vidal, m. a. casanova, and j. m. monteiro, “r2rml by assertion: a semi-automatic tool for generating customised r2rml mappings,” lect. notes comp. sci., 2013, pp. 248–252. https://doi.org/10.1109/mc.2015.82 http://ceur-ws.org/vol-1080/owled2013_20.pdf http://ceur-ws.org/vol-1080/owled2013_20.pdf https://doi.org/10.1007/978-3-642-41242-4_48 https://doi.org/10.1007/978-3-642-41242-4_48 https://ojs.aaai.org/index.php/aaai/article/view/8219 https://ojs.aaai.org/index.php/aaai/article/view/8219 http://ceur-ws.org/vol-1545/om2015_tspaper5.pdf http://ceur-ws.org/vol-1545/om2015_tspaper5.pdf http://ceur-ws.org/vol-573/paper_30.pdf http://ceur-ws.org/vol-573/paper_30.pdf https://doi.org/10.3233/sw-2011-0029 https://doi.org/10.3233/sw-2011-0029 https://doi.org/10.1007/978-3-642-39784-4_5 https://doi.org/10.1007/978-3-642-39784-4_5 https://doi.org/10.1007/978-3-642-41242-4_33 https://doi.org/10.1007/978-3-642-41242-4_33 i. introduction ii. motivation iii. problem statements a. data sources and big data b. ontology rules 1) content explication 2) ontology knowledge c. mapping iv. methodology of obda v. discussions and evaluations a. evaluation b. table summary vi. conclusion declarations author contribution funding statement conflict of interest additional information references [1] k. t. wassif, “a survey on using ontology for addressing end user access to big data,” int. j. comput. syst., vol. 02, no. 08, pp. 363–372, 2015. [2] m. giese et al., “scalable end-user access to big data,” in big data computing, chapman and hall/crc, 2013, pp. 205–244. [3] f. di pinto et al., “optimizing query rewriting in ontology-based data access,” proc. 16th int. conf. ext. database tech. edbt ’13, 2013, p. 561. [4] m. bienvenu and r. rosati, “query-based comparison of obda specifications,” proc. 28th int. work. descr. logics (dl 2015), 2015. [5] h. wache et al., “ontology-based integration of information a survey of existing approaches,” ifcai-01 work. ontol. inf. shar., pp. 108–117, 2001. [6] j. p. dijcks, oracle: big data for the enterprise. oracle white paper, 2012. [7] s. . jeong and i. ghani, “semantic computing for big data: approaches, tools, and emerging directions (2011-2014),” ksii trans. internet inf. syst., vol. 8, no. 6, pp. 2022–2042, jun. 2014. [8] l. zuo, “a semantic and agent-based approach to support information retrieval, interoperability and multi-lateral viewpoints for heterogeneous,” university of london, 2006. [9] b. c. grau, e. kharlamov, and d. zheleznyakov, “how to contract ontologies statement of interest,” pp. 1–5, 2012. [10] m. jarrat, “towards methodological principles for ontology engineering,” llniversiteit brassel, 2005. [11] d. lembo, j. mora, r. rosati, d. f. savo, and e. thorstensen, “mapping analysis in ontology-based data access: algorithms and complexity,” lecture notes in computer science, pp. 217–234, 2015. [12] v. jahns, “principles of data integration by anhai doan, alon halevy, zachary ives,” acm sigsoft softw. eng. notes, vol. 37, no. 5, pp. 43–43, sep. 2012. [13] d. lembo, j. mora, r. rosati, d. f. savo, and e. thorstensen, “towards mapping analysis in ontology-based data access,” conf. int. conf. web reason. rule syst., pp. 108–123, 2014. [14] n. antonioli et al., “ontology-based data access: the experience at the italian department of treasury,” ceur workshop proc., vol. 1017, no. january, pp. 9–16, 2013. [15] d. lembo, r. rosati, m. ruzzi, d. f. savo, and e. tocci, “visualization and management of mappings in ontology-based data access (progress report),” ceur workshop proc., vol. 1193, no. october, pp. 595–607, 2014. [16] h. pérez-urbina, e. rodríguez-díaz, m. grove, g. konstantinidis, and e. sirin, “evaluation of query rewriting approaches for owl 2,” ceur workshop proc., vol. 943, pp. 32–44, 2012. [17] m. giese et al., “optique: zooming in on big data,” computer (long. beach. calif)., vol. 48, no. 3, pp. 60–67, mar. 2015. [18] d. calvanese et al., “the optique project: towards obda systems for industry,” ceur workshop proc., vol. 1080, no. january 2015, 2013. [19] d. calvanese et al., “optique: obda solution for big data,” semant. web eswc 2013 satell. ereitfs. springer, pp. 293–295, 2013. [20] t. eiter, m. ortiz, m. šimkus, t. k. tran, and g. xiao, “query rewriting for horn-shiq plus rules,” proc. natl. conf. artif. intell., vol. 1, no. c, pp. 726–733, 2012. [21] j. p. c. verhoosel, m. van bekkum, and f. k. van evert, “ontology matching for big data applications in the smart dairy farming domain,” ceur workshop proc., vol. 1545, pp. 55–59, 2015. [22] d. f. savo et al., “mastro at work: experiences on ontology-based data access,” ceur workshop proc., vol. 573, no. june 2014, pp. 20–31, 2010. [23] d. calvanese et al., “the mastro system for ontology-based data access,” semant. web, vol. 2, no. 1, pp. 43–53, 2011. [24] r. kontchakov, m. rodríguez-muro, and m. zakharyaschev, “ontology-based data access with databases: a short course,” lect. notes comp. sci., 2013, pp. 194–229. [25] l. e. t. neto, v. m. p. vidal, m. a. casanova, and j. m. monteiro, “r2rml by assertion: a semi-automatic tool for generating customised r2rml mappings,” lect. notes comp. sci., 2013, pp. 248–252. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 1, july 2020, pp. 28–39 eissn 2597-4637 https://doi.org/10.17977/um018v3i12020p28-39 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) earthquake magnitude and grid-based location prediction using backpropagation neural network bagus priambodo a, 1, *, wayan firdaus mahmudy a, 2 , muh arif rahman a, 3 a faculty of computer science, brawijaya university jl. veteran no. 8, malang 65145, indonesia 1 baguspria@student.ub.ac.id *; 2 wayanfm@ub.ac.id; 3 m_arif@ub.ac.id * corresponding author i. introduction one of the inevitable disasters is a natural disaster. it may come without prior notice and has been responsible for the massive scale of deaths [1]. centre for research on epidemiology of disasters (cred) reported an average of 77.144 deaths per year caused by natural disasters since 2000 to 2017 [2]. the natural disaster caused by seismic activities (earthquakes, tsunamis, and volcanic activities) disrupted 3.4 million lives in 2018 [2]. earthquakes have caused the most deaths every year compared to other types of natural disasters, such as drought, flood, landslide, wildfire, and many more—with a toll of 46.173 lives [2]. even though it is inevitable, but it can be anticipated to minimize damage and casualties. past research has been conducted to predict the level of impact caused by earthquakes in real-time [2]. one of those past research is about an early earthquake warning system (eews), which will give an alert when it detects an earthquake [3]. numerous architectures and algorithms have been developed in those studies. various research is the utilization of neural network trained with backpropagation algorithm and optimized using levenberg (lom) to predict hypocenter location, moment magnitude, and the expansion of the earthquake [4], modification of lom to minimize error on eews [3], and utilization of neural tree to predict p and s waves [5]. until now, to the knowledge of the author, the study of using a backpropagation (bp) algorithm to predict earthquake magnitude and grid-based location in indonesia has not been conducted yet. this algorithm is chosen because it has been proven to perform well in broad types of problems, such as regression, pattern recognition, and prediction [6][7][8][9]. in this paper, the study aims to measure the performance of neural network trained using backpropagation algorithm in predicting earthquakes magnitude and grid-based location based on earthquakes magnitude and location data in indonesia recorded from 2000 to 2019. article info a b s t r a c t article history: received 30 june 2020 revised 2 july 2020 accepted 15 july 2020 published online 17 august 2020 earthquakes, a type of inevitable natural disaster, is responsible for the highest average death toll per year compared to other types of a natural disaster. even though it is inevitable, but it can be anticipated to minimize damage and casualties, such as predicting the earthquake‘s magnitude using a neural network. in this study, a backpropagation algorithm is used to train the multilayer neural network to weekly predict the average magnitude of earthquakes in grid-based locations in indonesia. based on the findings in this research, the neural network is able to predict the magnitude of earthquakes in grid-based locations across indonesia with a minimum error rate of 0.094 in 34.475 seconds. this best result is achieved when the neural network is trained for 210 epochs, with 16 neurons used in the input and output layer, one hidden layer consisted of 5 neurons and a learning rate of 0.1. this result showed backpropagation has pretty good generalization capability in order to map the relations between variables when mathematical function is not explicitly available. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: neural network resilient backpropagation prediction magnitude earthquake mailto:baguspria@student.ub.ac.id mailto:wayanfm@ub.ac.id mailto:m_arif@ub.ac.id https://creativecommons.org/licenses/by-sa/4.0/ b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 29 ii. methods numerous studies on the application of neural networks to predict incoming natural disasters have been conducted before. one of them is to predict seismic activities (magnitude and other seismic activities on a large scale—more than 5 ritcher scale) [6]. the findings of that research show acceptable performance (in terms of accuracy) of a 3-layer perceptron neural network model trained using backpropagation, which achieved an accuracy of 80.55%. the other research done is tsunami forecasting [10] using multi-layered perceptron neural networks and backpropagation algorithm. the study shows that the backpropagation algorithm is a lot faster than many other conventional models, and produced high accuracy in terms of predicting height and travel time of tsunami based on earthquake location and size. often neural networks are hybridized with other techniques or algorithms. one example is the study done in 2006 [5], which used a neural tree to pick up p and s waves faster, and more accurately: with precision score achieved is 0.96. numerous studies showed the performance of neural networks in predicting earthquakes. neural networks have been applied to an early earthquake warning system (eews), which is trained using backpropagation with modified levenberg-marquardt algorithm to minimize the error rate in the eews [3]. the error rate that was tried to be minimized was the error on seismic data amplitude prediction based on the chi-chi earthquake in taiwan in 1999. other studies showed increased reliability dan responsiveness of eews when a neural network is applied [11]. a theoretical study by [4] showed a prediction of hypocenter location, moment magnitude, and rupture size of an earthquake is generated almost instantaneously as soon as the p wave is picked up. a. neural network and backpropagation a neural network is a network that consists of connected neurons (to process inputs) whose job is to produce an activation value [12]. this network mimics the structure and the ways of how the human brain works, hoping it can be an artificial intelligence that is almost as intelligent as human— by receiving, adapting, and transferring known and new knowledge and skills, as a lifelong learning action [13]. a neural network consists of an input layer, zero or more hidden layer, and an output layer. information on neurons will be propagated through each layer, starting from the input layer, all the way to the output layer [14][15]. the propagation is done by calculating the weighted sum on each neuron, which then used as the input value for the activation function used. the result of the activation function is then propagated to neurons in the next layer as inputs [11]. a common example of activation function is sigmoid function (1) and hyperbolic tangent (tanh) function (2) [16]. ( ) (1) ( ) (2) the backpropagation helped a neural network to learn the relationship between variables, without explicitly defining the mathematical function that defines that relationship [17]. a neural network trained using backpropagation is based on gradient descent [14]. in the backpropagation algorithm, each neuron in the hidden and output layer will process its own inputs using the sum product of each neuron‘s input value and weight of each neuron, respectively, which then processed through an activation function (most popular used is sigmoid activation function) [17]. then the error will be calculated backward from the output layer to the input layer for the weight update process—this is called: backpropagation. this weight update process usually used gradient descent to direct the weight changes towards the minimum error [18]. during the weight update process, a parameter called the learning rate (η), will determine the ―step width‖ of the updated value, which will finally update the weight in order to upgrade the neural network‘s performance in terms of its generalization capability. in the backpropagation process, the derivative of activation function and error function is needed to calculate the weight change. in (3) and (4) each is the derivative of sigmoid and tanh function, respectively. ( ) ( )( ( )) (3) ( ) ( ) (4) the algorithm starts with the initialization phase. in this phase, all of the weights between neurons will be randomly initialized (between 0 and 1). the learning rate will also be initialized, 30 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 which is usually set to be 0.1. after the initialization phase, the first data will be used as input to the neural network, feedforward is then performed, which is calculating the weighted sum as input to activation function on the next layer. this process is repeated until the output layer has produced the predicted value. after the output layer has produced an output, the neural network will then begin its backpropagation phase. during the backpropagation phase, the partial derivative of the error function with respect to each weight will be calculated, as the weights are directly contributed to the error. from the output layer point of view, the weight from the hidden layer to the output layer contributed to input for activation function in the output layer, resulting in a contribution to the error. from the hidden layer point of view, the weight from the input layer to the hidden layer contributed to input for activation function in the hidden layer, finally contributes to the error. therefore, (5) will be used to calculate the gradient/partial derivative of the error with respect to weights between the hidden and output layer, and (6) will be used for the calculation of partial derivative with respect to weights between input and hidden layer. (5) ( ) (6) ( ) ∑ ( ) ∑ ( ) in (5), is the partial derivative of error concerning weights between the hidden and output layer. it consists of which is the partial derivative of input to the output layer (the weighted sum value) for weight, and is the partial derivative of the activation function in the output layer to input on the output layer, while is the partial derivative of error function to the output. when each component is broken down, the final result of the formula is as in (5), the product of the output of the hidden layer ( ), a derivative of activation function in the output layer (for example the derivative of sigmoid; ( )) and derivative of the error function (10). equation (6) is similar to (5), but it showed the partial derivative for weights between input and hidden layer. when the formula is broken down, the components are input value (input from training data; ), the partial derivative of activation function in the hidden layer ( ( )), and the partial derivative of error concerning output of the hidden layer will need further calculation steps. for partial derivative of the error to the output of the hidden layer, it is the sum of all (written as ― ‖ which represents the number of output neurons) partial derivative of the error with respect to particular output neurons from neurons in the hidden layer. thus, this component is actually the product of weights between the hidden and output layer with from (5). the partial derivative for biases is calculated likewise (as in (5) and (6)), but in (5) and in (6) is changed into 1, because it is the derivative with respect to bias. {    the derivative of the error function used in this study (mae) is shown in (7). this function can only be differentiated only when the difference in predicted, and the target value is not 0. in the context of (5), only one prediction and target value each used because in (5), it is only differentiated with respect to one output neuron only. later for programming purposes, to eliminate the possibility of the unhandled case, the derivative will be 1 when the prediction is bigger or equal to the target value. after partial derivative for each weight (biases included), the weight will then be updated. the weight update will follow the equation written in (8). the new weight for the next iteration ( ( b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 31 )) will be the current weight ( ) reduced by the product of learning rate (η) and the partial derivative of the error ( ). this feedforward, backpropagation, and weight update process will be repeated until the maximum epoch has been reached, where one epoch is counted when all of the training data has been fed to the neural network. ( ) ( ) (8) b. feature extraction of earthquakes earthquake is a geological phenomenon that happened because of the shifting of earth‘s plates, caused by excessive pressure which the earth‘s crust cannot handle [19]. this excessive pressure results in energy release in the form of waves that propagate through the earth‘s crust, causing shockwave people can felt [20]. those waves are picked up by seismographs. there are two types of waves, p wave (the fastest wave), which will be received first by the seismograph, then followed by a stronger wave, s wave, but slower than the p wave [21]. in seismograph, these two waves will produce data in 3 components: vertical, north-south, and east-west motion. [22]. based on data of p wave, s wave, epicenter, magnitude, and peak ground acceleration (pga), an early warning to civilians can be done, and emergency action can be conducted earlier; this can be done with a tool: early earthquake warning system (eews). in this study, the magnitude and location of earthquakes that happened in indonesia, starting from january 1 2000 until december 31 2019, will be used as an input feature. it was chosen based on a previous study done by [23]. in that study, [23] used four features: earthquake number, location (represented in numbers corresponding to grids), magnitude, and hypocenter depth. date and time are not used in that study because preliminary statistical analysis on the data used showed date and time is not representative enough as a feature. in fact, it had too much unrelated/unneeded information. detailed location data—coordinates—is simplified into grids representing an area that has been divided into 16 areas. thus, the location feature consists of an integer number, ranging from 0 to 15. based on this, the current study also divides the location data used in this study into 16 grids, represented using an integer from 0 to 15, as seen in figure 1. the area is divided into 16 rectangles with every rectangle has the almost equal area. there will be 2 rows and 8 columns used, where starting from latitude of -10.909° to 5.907° will be divided into 2 sectors, and from longitude of 95.206° to 140.976° will be divided into 8 sectors. in another study, the magnitude of past earthquakes is used to predict earthquake magnitude for the following day [6]. in that study, location is not used as a feature because it was assumed already known beforehand. therefore, magnitude and location (represented in grid numbers) of an earthquake will be used as an input feature in this study. magnitude will be normalized using minmax normalization to match the characteristics of sigmoid and tanh function, resulting in the more appropriate value. the result can then be denormalized to get the actual predicted magnitude value by rb algorithm. ( ) ( ) ( ) (9) fig. 1. grid-numbering as location feature 32 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 in (9), the minmax normalization equation is shown. variable x is the magnitude value that will be normalized, min(x) is the minimum magnitude over all the data, and max(x) is the maximum magnitude over all of the data. based on (9), the denormalization equation can be inferred as in (10). * ( ) ( )+ ( ) (10) c. evaluation metrics there are many methods that can be used to evaluate the performance of a neural network. in this study, mean absolute error (mae) will be used as evaluation metrics. this metric calculates the distance of predicted value and the target value. this metric is used based on a study done in [6] and [23]. the equation for mae can be seen in (11). ∑ | | (11) in (13), n represents the feature count, represents target value of ith feature, and represents the predicted value of ith feature. d. proposed method the data is queried from united states geological survey (usgs) website and obtained in a comma-separated value (csv) file. the data collected is earthquake data that happened in indonesia, specifically inside these boundaries: latitude of -10.909° to 5.907°, and longitude of 95.206° to 140.976°. earthquake events from january 1 2000 up to january 1 2019 will be used as training data (36,453 records) and earthquake events from january 2 2019 up to december 31 2019 will be used as testing data (2,358 records). the data obtained (total of 38,811 records) is presented as seismic data with 22 columns/attributes, and 4 of them will be taken as features in this study. those 4 attributes are detailed in table 1. after the data is obtained, there is two phases of preprocessing that will be done. first, ‗id‘ attribute will be added as feature to index each data. then, the data on ‗time‘ column will be changed to ‗date‘, because only the dates will be taken as the feature. the ‗latitude‘ and ‗longitude‘ feature will then be mapped into grid numbers (as explained before) and represented as ‗grid‘ feature. the last feature is taken as it is, which is the ‗mag‘ that represents the magnitude of the event. a snippet of first 5 data after this first phase of preprocessing can be seen in table 2. after the preliminary phase of preprocessing has been done, the final phase will be done to create a dataset that is ready to be used by the neural network. in this phase, each event will be grouped into weekly-period data, and for each week, the average magnitude will be calculated for each grid. this will be ‗avg_mag‘ feature. then, this will be normalized using minmax normalization. the final feature added will be the ‗target‘ feature, which is the ‗avg_mag‘ on the following week. table 1. data attributes details no name description data type value 1 time date and time in milliseconds & utc when the even occured. long integer [2000-01-02t12:46:58.770z, 2019-12-31t09:50:41.876z] 2 latitude decimal degrees latitude. decimal [-90,0, 90,0] 3 longitude decimal degress longitude. decimal [-180,0, 180,0] 4 mag magnitude of the event. decimal [-1,0, 10,0] table 2. preliminary phase of preprocessing on the first 5 data id date grid mag 1 1/2/2000 6 4.9 2 1/2/2000 6 4.4 3 1/3/2000 6 4.7 4 1/4/2000 14 4.4 5 1/4/2000 5 3.9 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 33 snippet on first week data that has been preprocessed (earthquake events from january 2 2000 to january 8 2000) can be seen in table 3. the neural network built will consist of 16 neurons in the input and output layer and one hidden layer. the reason only one hidden layer is used is based on numerous study that has shown the good result of approximation on any continuous function [17]. each neuron in the input and output layer represents the average magnitude in each grid. the input layer will receive data of the average magnitude of earthquake events in a week, and the output layer will predict the average magnitude for each grid in the following week. each neuron in the hidden and output layer will have a bias with uniform-value: 1. iii. results and discussions in order to achieve the best result, there will be four components of the neural network configuration that will be tested. the first one is to find the best maximum epochs allowed, then the learning rate, the number of neurons needed in the hidden layer, and lastly, which activation function will be used: sigmoid or tanh function. in the maximum epochs testing testing, the maximum epochs allowed will be tested from 50, then increased by 10 until no significant error rate change occurred. for current testing, the neural network will use 5 neurons in the hidden layer, trained using the sigmoid function, a learning rate of 0.1, and all of the weights (except the biases) will be initialized with uniform value: 0.5. for each number of maximum epochs tested, it will be tested ten times, and the average will be taken. the average error rate and training duration will be measured to determine how many epochs that will produce the best result. figure 2 showed the lowest average error rate achieved is 0.093 when the neural network is trained for 210 epochs, while the highest average error rate achieved is 0.094 when it is trained for 160 epochs. figure 3 showed when a neural network is trained for 50 epochs, it only took 12.93 seconds, but it took 134.49 seconds to complete training for 500 epochs. another parameter that may contribute to the performance of the neural network‘s generalization capability is the learning rate. if it is too big, it may miss the optimal accuracy. if it is too small, the learning process would be too slow. in the learning rate testing, the learning rate will be tested from 0.1, then increased by 0.1 until no significant error rate change occurred. for current testing, the neural network will be trained using sigmoid function in 210 epochs (the best result achieved from the last testing) using 5 neurons in the hidden layer, and all of the weights (except the biases) will be initialized with uniform value: 0.5. for each learning rate tested, it will be tested ten times, and the average will be taken. figure 4 showed the lowest average error rate achieved is 0.093 when the table 3. final phase of preprocessing on the first week data grid avg_mag target 1 0.699 0.000 2 0.000 0.000 3 0.000 0.000 4 0.000 0.000 5 0.582 0.000 6 0.627 0.594 7 0.658 0.603 8 0.000 0.637 9 0.000 0.000 10 0.767 0.616 11 0.676 0.621 12 0.651 0.000 13 0.000 0.000 14 0.589 0.548 15 0.562 0.644 16 0.616 0.000 34 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 neural network is trained using 0.1 as the learning rate, while the highest average accuracy achieved is 0.103 when it is trained using 1.0 as the learning rate. the average training duration is not tested in the current testing phase, as the learning rate does not affect directly to the training duration in terms of computation time. the next test measures the number of neurons needed in the hidden layer will be tested from 5 neurons, then increased by 2 until no significant error rate change occurred. for current testing, the neural network will be trained using a sigmoid function in 210 epochs (the best result achieved from the last testing) with a learning rate of 0.1, and all of the weights (except the biases) will be initialized with uniform value: 0.5. for each number of neurons tested, it will be tested ten times, and the average will be taken. the average error rate and duration will be measured to determine how many neurons in the hidden layer that will produce the best result. figure 5 showed the lowest average error rate achieved is 0.093 when the neural network is trained using 5 neurons in the hidden layer, while the highest average error rate achieved is 0.095 when it is trained using 23 neurons in the hidden layer. the average duration for training, as shown in figure 6 is similar to the result in the last testing: as the number of neurons increased, the training duration also increased. the best training time achieved is 57.23 seconds when trained using 5 neurons only, and the worst result is 75.06 when trained using 23 neurons. fig. 2.the average testing error rate for each maximum epochs value fig. 3. average training duration for each maximum epochs value b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 35 fig. 4. the average testing error rate for each learning rate value fig. 5. average testing accuracy for each neuron count in the hidden layer fig. 6. average training duration for each neuron count in the hidden layer 36 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 the activation function testing used a sigmoid function or tanh function. for current testing, the neural network will be trained in 210 epochs, the learning rate of 0.1, using 5 neurons in the hidden layer (the best result achieved from the last testing), but all of the weights (except the biases) will be initialized with a random value. for each number of maximum epochs tested, it will be tested ten times, and the average will be taken. the sigmoid function produced an average error rate of 0.094, which is better than the tanh function, which produced an average error rate of 0.881. the sigmoid function is also better than tanh function in terms of training duration, which only took 60.81 seconds compared to 60.91 seconds when trained using tanh function. thus, the sigmoid function will be used as the activation function. based on the testing results, when the neural network is trained with sigmoid as the activation function in 210 epochs using 5 neurons in the hidden layer and 0.1 as learning rate, it achieves its best performance with an error rate of 0.094 in 60.81 seconds. as shown in figure 2, there is a tendency in the beginning that as the number of maximum epochs increased, the average accuracy also increased. this is also shown in figure 7. figure 7 shows the error rate of the neural network on the training data for each epoch. it is shown that as it learns the data over and over at each epoch, it became better in terms of error rate. this proves that the backpropagation algorithm is able to adapt to the nature of data more accurately as it received more training [24]. this is where the weight adaptation process in backpropagation takes part. through the weight adaptation process, it finally produces weights that are well-suited to the training data and reached better generalization capability when tested against new data in the testing phase. as the learning rate increases, it is shown that the performance of the neural network is getting worse. this may happen because of the step width is too large that the local minimum may be missed. it is also shown that as the number of neurons in the hidden layer increases, the performance is not getting better. this may be the result of overfitting, a condition where the neural network generalization capability became weak, and only achieve a good result when tested to the training data only. therefore, when the neural network is tested using the testing data, the error rate is higher. the duration needed for the neural network to predict earthquake event magnitude (for events in 2019) is the total duration from training phase until testing phase; the prediction duration. the average prediction duration is 0.005 seconds. thus, overall duration needed is 60.81 seconds (average training duration) plus 0.005 seconds (prediction duration): 60.815 seconds. this is the total duration needed to predict 51 rows of data earthquake event magnitude based on training result on 988 rows of data, with 16 features each. as shown in table 4, it is a snippet of prediction and target comparison for the first two weeks. in prediction result, there are 12.99% (106 events) of the prediction that missed the target by 1 up to 5.2, and the others (about 87.01%—710 events) missed the target less than 1, with minimum error recorded is 2.56 × 10 -4 . fig. 7. error rate during training on each epoch b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 37 iv. conclusion there are two key findings in this study. first, to build input features based on magnitude and location of earthquake event, detailed information of location (such as latitudes and longitudes) needed to be mapped into grids, then the magnitude will be averaged weekly for each grid number. the average magnitude weekly and based on each grid will be the input features. secondly, the lowest error rate achieved by backpropagation algorithm (when trained in 210 epochs using sigmoid activation function, 5 neurons in the hidden layer and 0.1 as learning rate) to predict the magnitude of earthquake event in the following week is 0.094, with 60.815 seconds needed for the neural network to learn from 988 rows of data and predict 51 rows of data. based on those key findings, it is recommended to further studying the importance and impact of other input features, configuration of the neural network, and hybridizing with other algorithms, which can further maximize the performance in predicting earthquakes more accurately and in the small-time window to increase preparedness. table 4. comparison of prediction and target produced by resilient backpropagation algorithm date grid prediction target 1/9/2019 1 4.524 4.767 1/9/2019 2 0.001 0.000 1/9/2019 3 0.001 0.000 1/9/2019 4 0.001 0.000 1/9/2019 5 4.398 4.200 1/9/2019 6 4.571 4.467 1/9/2019 7 0.004 4.400 1/9/2019 8 0.005 0.000 1/9/2019 9 0.002 0.000 1/9/2019 10 4.510 5.000 1/9/2019 11 4.285 4.300 1/9/2019 12 4.317 0.000 1/9/2019 13 4.482 4.540 1/9/2019 14 4.350 4.488 1/9/2019 15 4.308 4.467 1/9/2019 16 4.515 0.000 1/16/2019 1 4.519 4.300 1/16/2019 2 0.001 4.300 1/16/2019 3 0.001 0.000 1/16/2019 4 0.001 0.000 1/16/2019 5 4.386 4.433 1/16/2019 6 4.568 4.688 1/16/2019 7 0.003 0.000 1/16/2019 8 0.004 0.000 1/16/2019 9 0.001 0.000 1/16/2019 10 4.542 4.350 1/16/2019 11 4.302 4.300 1/16/2019 12 4.410 4.200 1/16/2019 13 4.499 4.975 1/16/2019 14 4.364 4.443 1/16/2019 15 4.316 4.450 1/16/2019 16 4.475 0.000 38 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] d. guha-sapir and f. vos, ―earthquakes, an epidemiological perspective on patterns and trends,‖ advances in natural and technological hazards research, pp. 13–24, dec. 2010. [2] centre for research on epidemiology of disasters (cred), ―press release: embargo 11.00 cet, january 24 24,‖ 2019. [3] j. w. lin, c. t. chao, and j. s. chiou, ―backpropagation neural network as earthquake early warning tool using a new modified elementary levenberg-marquardt algorithm to minimise backpropagation errors,‖ geosci. instrumentation, methods data syst., vol. 7, no. 3, pp. 235–243, 2018, doi: 10.5194/gi-7-235-2018. [4] m. böse, f. wenzel, and m. erdik, ―preseis: a neural network-based approach to earthquake early warning for finite faults,‖ bull. seismol. soc. am., vol. 98, no. 1, pp. 366–382, 2008, doi: 10.1785/0120070002. [5] s. gentili and a. michelini, ―automatic picking of p and s phases using a neural tree,‖ j. seismol., vol. 10, no. 1, pp. 39–63, 2006, doi: 10.1007/s10950-006-2296-6. [6] m. moustra, m. avraamides, and c. christodoulou, ―artificial neural networks for earthquake prediction using time series magnitude data or seismic electric signals,‖ expert syst. appl., vol. 38, no. 12, pp. 15032–15039, nov. 2011, doi: 10.1016/j.eswa.2011.05.043. [7] n. r. sari, w. f. mahmudy, and a. p. wibawa, ―backpropagation on neural network method for inflation rate forecasting in indonesia,‖ int. j. adv. soft comput. its appl., vol. 8, no. 3, 2016. [8] f. a. huda, w. f. mahmudy, and h. tolle, ―android malware detection using backpropagation neural network,‖ indones. j. electr. eng. comput. sci., vol. 4, no. 1, 2016, doi: 10.11591/ijeecs.v4.i1.pp240-244. [9] h. aini and h. haviluddin, ―crude palm oil prediction based on backpropagation neural network approach,‖ knowl. eng. data sci., vol. 2, no. 1, pp. 1–9, 2019. [10] m. romano et al., ―artificial neural network for tsunami forecasting,‖ j. asian earth sci., vol. 36, no. 1, pp. 29–37, 2009, doi: 10.1016/j.jseaes.2008.11.003. [11] c. j. lin, z. shen, and s. huang, ―predicting structural response with on-site earthquake early warning system using neural networks,‖ weather, no. 226, 2011. [12] j. schmidhuber, ―deep learning in neural networks: an overview,‖ neural networks, vol. 61, pp. 85–117, 2015, doi: 10.1016/j.neunet.2014.09.003. [13] g. i. parisi, r. kemker, j. l. part, c. kanan, and s. wermter, ―continual lifelong learning with neural networks: a review,‖ neural networks, vol. 113, pp. 54–71, may 2019, doi: 10.1016/j.neunet.2019.01.012. [14] g. t. hicham, e. a. chaker, and e. lotfi, ―comparative study of neural networks algorithms for cloud computing cpu scheduling,‖ int. j. electr. comput. eng., vol. 7, no. 6, pp. 3570–3577, 2017, doi: 10.11591/ijece.v7i6.pp35703577. [15] c. dewi, s. sundari, and m. mardji, ―texture feature on determining quantity of soil organic matter for patchouli plant using backpropagation neural network,‖ j. inf. technol. comput. sci., vol. 4, no. 1, pp. 1–14, 2019. [16] k. chandrasekaran and s. p. simon, ―binary/real coded particle swarm optimization for unit commitment problem,‖ in international conference on power, signals, controls and computation, jan. 2012, no. 3, pp. 1–6, doi: 10.1109/epscicon.2012.6175240. [17] a. t. c. goh, ―back-propagation neural networks for modeling complex systems,‖ artif. intell. eng., vol. 9, no. 3, pp. 143–151, jan. 1995, doi: 10.1016/0954-1810(94)00011-s. [18] m. riedmiller and h. braun, ―a direct adaptive method for faster backpropagation learning: the rprop algorithm,‖ in ieee international conference on neural networks, 1993, pp. 586–591, doi: 10.1109/icnn.1993.298623. [19] k. mogi, ―earthquake prediction in japan,‖ j. phys. earth, vol. 43, no. 5, pp. 533–561, 1995. [20] u.s. geological survey, ―what is an earthquake and what causes them to happen?,‖ u.s. department of the interior, 2019. [21] incorporated research institutions for seismology (iris), ―seismic wave behavior — effect on buildings‖ . [22] incorporated research institutions for seismology (iris), ―3-component seismograph,‖ 2017. [23] a. s. n. alarifi, n. s. n. alarifi, and s. al-humidan, ―earthquakes magnitude predication using artificial neural https://doi.org/10.1007/978-90-481-9455-1_2 https://doi.org/10.1007/978-90-481-9455-1_2 https://cred.be/sites/default/files/pressreleasereview2018.pdf https://cred.be/sites/default/files/pressreleasereview2018.pdf https://doi.org/10.5194/gi-7-235-2018 https://doi.org/10.5194/gi-7-235-2018 https://doi.org/10.5194/gi-7-235-2018 https://doi.org/10.1785/0120070002 https://doi.org/10.1785/0120070002 https://doi.org/10.1007/s10950-006-2296-6 https://doi.org/10.1007/s10950-006-2296-6 https://doi.org/10.1016/j.eswa.2011.05.043 https://doi.org/10.1016/j.eswa.2011.05.043 https://doi.org/10.1016/j.eswa.2011.05.043 http://home.ijasca.com/data/documents/id-28_pg70-87_backpropagation-on-neural-network-method-for-inflation-rate-forecasting-in-indonesia_4.pdf http://home.ijasca.com/data/documents/id-28_pg70-87_backpropagation-on-neural-network-method-for-inflation-rate-forecasting-in-indonesia_4.pdf https://doi.org/10.11591/ijeecs.v4.i1.pp240-244 https://doi.org/10.11591/ijeecs.v4.i1.pp240-244 https://doi.org/10.17977/um018v2i12019p1-9 https://doi.org/10.17977/um018v2i12019p1-9 https://doi.org/10.1016/j.jseaes.2008.11.003 https://doi.org/10.1016/j.jseaes.2008.11.003 https://www.semanticscholar.org/paper/predicting-structural-response-with-on-site-early-lin-shen/524a3bf6d696b5d95bc3cdd25d6848b00c847df2 https://www.semanticscholar.org/paper/predicting-structural-response-with-on-site-early-lin-shen/524a3bf6d696b5d95bc3cdd25d6848b00c847df2 https://doi.org/10.1016/j.neunet.2014.09.003 https://doi.org/10.1016/j.neunet.2014.09.003 https://doi.org/10.1016/j.neunet.2019.01.012 https://doi.org/10.1016/j.neunet.2019.01.012 https://doi.org/10.11591/ijece.v7i6.pp3570-3577 https://doi.org/10.11591/ijece.v7i6.pp3570-3577 https://doi.org/10.11591/ijece.v7i6.pp3570-3577 https://doi.org/10.25126/jitecs.20194168 https://doi.org/10.25126/jitecs.20194168 https://doi.org/10.1109/epscicon.2012.6175240 https://doi.org/10.1109/epscicon.2012.6175240 https://doi.org/10.1109/epscicon.2012.6175240 https://doi.org/10.1016/0954-1810(94)00011-s https://doi.org/10.1016/0954-1810(94)00011-s https://doi.org/10.1109/icnn.1993.298623 https://doi.org/10.1109/icnn.1993.298623 https://doi.org/10.4294/jpe1952.43.533 https://www.usgs.gov/faqs/what-earthquake-and-what-causes-them-happen?qt-news_science_products=0#qt-news_science_products https://www.usgs.gov/faqs/what-earthquake-and-what-causes-them-happen?qt-news_science_products=0#qt-news_science_products https://www.iris.edu/hq/files/programs/education_and_outreach/aotm/6/seismicwavebehavior_building.pdf https://www.iris.edu/hq/files/programs/education_and_outreach/aotm/9/3-componentseismograph.pdf https://doi.org/10.1016/j.jksus.2011.05.002 b. priambodo et al. / knowledge engineering and data science 2020, 3 (1): 28–39 39 network in northern red sea area,‖ j. king saud univ. sci., vol. 24, no. 4, pp. 301–313, oct. 2012, doi: 10.1016/j.jksus.2011.05.002. [24] i. wahyuni, n. r. adam, w. f. mahmudy, and a. iriany, ―modeling backpropagation neural network for rainfall prediction in tengger east java,‖ in proceedings 2017 international conference on sustainable information engineering and technology, siet 2017, 2018, vol. 2018-janua, doi: 10.1109/siet.2017.8304130. https://doi.org/10.1016/j.jksus.2011.05.002 https://doi.org/10.1016/j.jksus.2011.05.002 https://doi.org/10.1109/siet.2017.8304130 https://doi.org/10.1109/siet.2017.8304130 https://doi.org/10.1109/siet.2017.8304130 i. introduction ii. methods a. neural network and backpropagation b. feature extraction of earthquakes c. evaluation metrics d. proposed method iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] d. guha-sapir and f. vos, “earthquakes, an epidemiological perspective on patterns and trends,” advances in natural and technological hazards research, pp. 13–24, dec. 2010. [2] centre for research on epidemiology of disasters (cred), “press release: embargo 11.00 cet, january 24 24,” 2019. [3] j. w. lin, c. t. chao, and j. s. chiou, “backpropagation neural network as earthquake early warning tool using a new modified elementary levenberg-marquardt algorithm to minimise backpropagation errors,” geosci. instrumentation, methods data syst., vol [4] m. böse, f. wenzel, and m. erdik, “preseis: a neural network-based approach to earthquake early warning for finite faults,” bull. seismol. soc. am., vol. 98, no. 1, pp. 366–382, 2008, doi: 10.1785/0120070002. [5] s. gentili and a. michelini, “automatic picking of p and s phases using a neural tree,” j. seismol., vol. 10, no. 1, pp. 39–63, 2006, doi: 10.1007/s10950-006-2296-6. [6] m. moustra, m. avraamides, and c. christodoulou, “artificial neural networks for earthquake prediction using time series magnitude data or seismic electric signals,” expert syst. appl., vol. 38, no. 12, pp. 15032–15039, nov. 2011, doi: 10.1016/j.eswa.2 [7] n. r. sari, w. f. mahmudy, and a. p. wibawa, “backpropagation on neural network method for inflation rate forecasting in indonesia,” int. j. adv. soft comput. its appl., vol. 8, no. 3, 2016. [8] f. a. huda, w. f. mahmudy, and h. tolle, “android malware detection using backpropagation neural network,” indones. j. electr. eng. comput. sci., vol. 4, no. 1, 2016, doi: 10.11591/ijeecs.v4.i1.pp240-244. [9] h. aini and h. haviluddin, “crude palm oil prediction based on backpropagation neural network approach,” knowl. eng. data sci., vol. 2, no. 1, pp. 1–9, 2019. [10] m. romano et al., “artificial neural network for tsunami forecasting,” j. asian earth sci., vol. 36, no. 1, pp. 29–37, 2009, doi: 10.1016/j.jseaes.2008.11.003. [11] c. j. lin, z. shen, and s. huang, “predicting structural response with on-site earthquake early warning system using neural networks,” weather, no. 226, 2011. [12] j. schmidhuber, “deep learning in neural networks: an overview,” neural networks, vol. 61, pp. 85–117, 2015, doi: 10.1016/j.neunet.2014.09.003. [13] g. i. parisi, r. kemker, j. l. part, c. kanan, and s. wermter, “continual lifelong learning with neural networks: a review,” neural networks, vol. 113, pp. 54–71, may 2019, doi: 10.1016/j.neunet.2019.01.012. [14] g. t. hicham, e. a. chaker, and e. lotfi, “comparative study of neural networks algorithms for cloud computing cpu scheduling,” int. j. electr. comput. eng., vol. 7, no. 6, pp. 3570–3577, 2017, doi: 10.11591/ijece.v7i6.pp3570-3577. [15] c. dewi, s. sundari, and m. mardji, “texture feature on determining quantity of soil organic matter for patchouli plant using backpropagation neural network,” j. inf. technol. comput. sci., vol. 4, no. 1, pp. 1–14, 2019. [16] k. chandrasekaran and s. p. simon, “binary/real coded particle swarm optimization for unit commitment problem,” in international conference on power, signals, controls and computation, jan. 2012, no. 3, pp. 1–6, doi: 10.1109/epscicon.2012.6175240. [17] a. t. c. goh, “back-propagation neural networks for modeling complex systems,” artif. intell. eng., vol. 9, no. 3, pp. 143–151, jan. 1995, doi: 10.1016/0954-1810(94)00011-s. [18] m. riedmiller and h. braun, “a direct adaptive method for faster backpropagation learning: the rprop algorithm,” in ieee international conference on neural networks, 1993, pp. 586–591, doi: 10.1109/icnn.1993.298623. [19] k. mogi, “earthquake prediction in japan,” j. phys. earth, vol. 43, no. 5, pp. 533–561, 1995. [20] u.s. geological survey, “what is an earthquake and what causes them to happen?,” u.s. department of the interior, 2019. [21] incorporated research institutions for seismology (iris), “seismic wave behavior — effect on buildings” . [22] incorporated research institutions for seismology (iris), “3-component seismograph,” 2017. [23] a. s. n. alarifi, n. s. n. alarifi, and s. al-humidan, “earthquakes magnitude predication using artificial neural network in northern red sea area,” j. king saud univ. sci., vol. 24, no. 4, pp. 301–313, oct. 2012, doi: 10.1016/j.jksus.2011.05.002. [24] i. wahyuni, n. r. adam, w. f. mahmudy, and a. iriany, “modeling backpropagation neural network for rainfall prediction in tengger east java,” in proceedings 2017 international conference on sustainable information engineering and technology, siet 20� keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 2, december 2020, pp. 77–88 eissn 2597-4637 https://doi.org/10.17977/um018v3i22020p77-88 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) convolutional neural network on tanned and synthetic leather textures faadihilah ahnaf faiz 1, ahmad azhari 2, * department of informatics, universitas ahmad dahlan yogyakarta, indonesia 1 fadhfaiz98@gmail.com; 2 ahmad.azhari@tif.uad.ac.id * * corresponding author i. introduction leather can be produce various types of leather goods or handicraft products such as bags, shoes, wallets, belt, leather keychain, etc. before becomes a leather goods product, the leather material have to get a process calls tanning. leather tanning is an important step to carry out on a raw of animal’s skin or hide [1]. the tanning process are divided into two methods. the first method is use chemicals process calls chrome tanning [2]. the other method is by using natural process that calls vegetable tanned leather [3]. the reason of animal’s skin or hide have to be tanned are to make the leather pliable or soft to use, and as a preventive way to avoid it from defect or rot. the vegetable tanned process is also possible to coloring the leather into any colors. tanned leather certainly has not a cheap price, it depends on the tanning method and done by the people who are experts in the tanning process [4]. the expensive price of the tanned leather become a reason for the production of synthetic or imitation leather. normally, the synthetic leather made by poliuretan (pu) or polivinyl chloride (pvc) with the cheaper prices than the tanned leather from animal’s skin [5]. the texture of synthetic leather that looks similar with the result of animal leather tanned process becomes another reason. as the role of technology of the imitation leather developments, it is quite difficult to distinguish between real animal skin and imitation leather when only look it visually. it will be different if the two leather materials are still in sheet form. there will be a possibility to distinguish their authenticity, but if it already become goods or products it will be difficult to distinguish which one is made of genuine animal leather or imitation leather. one of the visible characteristics of the skin is the texture of the surface. genuine leather tan surfaces tend to have natural micro-textures and random patterns so it holds a lot of information that can be analyse then [6]. real animal skins also after going through the tanning process will have varying textures and colors [7]. meanwhile, synthetic leather generally tends to have a repetitive surface pattern texture. article info a b s t r a c t article history: received 29 august 2020 revised 02 december 2020 accepted 22 december 2020 published online 31 december 2020 tanned leather is an output from complex processes called tanning. leather tanning is an important step that used to protect the fiber or protein structure of animal’s skin. another reason of tanning process is to prevent the animal’s skin from any defect or rot. after the tanning is complete, the leather can be applied to produce a wide variety of leather products. thus, the leather prices usually more expensive because it takes longer time in process. another way to get cheaper price is make non-animal leather that usually known as synthetic or imitation leather. the purpose of this paper is to classify the tanned leather and synthetic leather by using convolutional neural network (cnn). the tanned leather consist of cow, goat and sheep leathers. the proposed method will classify into four class, they are cow, goat, sheep and synthetic leathers. this research consist of 1280 training data with 448×448 pixels size as the input. with cnn method, this research shows a good result for the accuracy about 92.1%. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: tanned leather synthetic leather classification deep learning convolutional neural network http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v3i22020p77-88 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 78 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 from previous researchers on texture leather classification [8], presents with backpropagation neural network (bpnn) as classification method has been done to classify between animal skin texture and synthetic materials or fake leather with two segmentation models. their study shows highest performance results for the animal skin, except the synthetic materials shows the poorest rate. in other research [9], classified leather surface defect with morphological processing and thresholding to improve performance of the segmentation. based on previous literature and research, the purpose of this research is to see how well the convolutional neural network (cnn) method is able to classify and find the similarities on the surface or textures of the skin, especially between genuine leather (animal skin) and synthetic leather. ii. methods a. research pseudocode this research method consist of four main sections, the first section is collecting the datasets known as images acquisition using smartphone camera and take the leathers images then divided it into four categories or class such as cow, goat, sheep and synthetic leather. next section, the raw images that has been taken in the images acquisition will preprocessed by resizing the images into 448×448 pixels and the images will changed from rgb into grayscale and gaussian images to be used as third section input data. third section is the main of research, this architecture of convolutional neural network as the method build and divided into two parts that is feature extraction layer and fully-connected layer. the last section is training the dataset and classify the results. after cnn shows the result and success to classify the leather types, then the confussion matrix will evaluate the method. the main block of this research can be seen at following convolutional neural network on tanned and synthetic leather textures pseudocode. 1) variable var class_name, train_images_x, train_images_y : var train_labels: var images_size: var extend_train_images, extend_train_labels: def preprocess_grayscale(), def preprocess_gaussian() 2) algorithm begin image acquisition load images from drive then extract label from file_name image preprocessing if file_name == “sheep” then return 0 else if file_name == “synthetic” then return 1 else if file_name == “goat” then return 2 else return 3 endif def preprocess_grayscale(images_size, 448) def preprocess_gaussian(images_size, 448) image segmentation for x in preprocess_grayscale() write to train_images_x endfor for x in preprocess_gaussian() write to train_images_y endfor f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 79 %extend the data training extend_train_images ← train_images_x extend_train_images ← train_images_y training process samples: 1280 images, 80% for training and 20% for validation data training and validation : extend_train_images data training and validation labels : extend_train_labels build a convolutional neural network architecture 1280 images with 448x448 pixel as input of cnn set 6 convolutional layer and 6 max-pooling layer set relu as activation function each convolutional layer 1 output layer with 4 classes (neuron) set softmax activation function at the output layer set the models learning_rate ← 0.0001 number of epochs ← 259 number of batch_size ← 64 % performs training classify and identify the results evaluate the data test with 32 different images in each class with confussion matrix. end b. image acquisition and preprocessing the images acquisition processes in this research used a smartphone camera to capture the leather images with distances between the object and the camera lens about 10-20 centimeters and added external flash lamp for help increase the sharpness of leather images. the general purpose of image acquisition is to transform an optical image into an array of numerical data and also computer can manipulate it [10]. after the dataset captured, in order to simplify the computation process in the cnn architecture, the leather images will be preprocessed with resizing into 448×448 pixels images. this preprocessing step aims to shows how much the impact of dataset to the accuracy results [11]. this research collected about 640 primary datasets with 160 images each class and set the labels in each images. some examples leather images of the dataset are shown in figure 1. where l1 are the image of cow leathers, l2 images of goat leathers, l3 images of sheep leathers and l4 are the example images of synthetic leather. fig. 1. leather images dataset 80 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 c. image segmentation image segmentation in this research used to changes the rgb images into grayscale images. this method can also be used as a technique to increase the data training which aims to find out another informations that contained in the digital images. the main aim of segmentation process used to recognise the object in an images [12]. this research using two images segmentation methods, there are grayscale thresholding and gaussian thresholding. both of them will be used as input data on the cnn models. by using both of segmentation methods, the datasets in this research will be increased into 1280 images leather, there are about 640 images with grayscale threshold and 640 images with gaussian threshold as the data training. total of the datasets can be seen in table 1 and the examples of the images segmentation result shown in figure 2. d. convolutional neural network convolutional neural network also known as convnet or cnn is one of class deep learning neural network classification method. it use the architectures of a multi-layer perceptron (mlp) approach [13]. cnn use matrix of pixels from digital images as main object operations of the input data [14]. the training data in this method is a labeled dataset that calls supervised learning. cnn are widely used in computer vision technology [15]. in addition to digital images processing, cnn can also applied to the text dataset or natural language processing [16][17][18] or video [19]. a convolutional neural network architectures have several layers such as input layer, feature extraction layer that consist of convolutional layer followed by the activation function and some of cases followed by subsampling or pooling layer [20]. the last layer is fully connected layer, in this layer the last feature map matrix will be flattened into a vector [21] and the classification results will be shown. cnn shows a good result to classify some cases such as health [13][14][22][23], social [24][25][26][27] and research sectors [20][28]. e. input layer the first layer in the architecture of convolutional neural network is input layer, it represents the input of images into the cnn. in this research a grayscale image as the input data that has only one channel and the values between 0 and 255. it will be different if the input uses a rgb images. the rgb images as input will have three channels represents red, green and blue channel. fig. 2. sample of original goat leather (left) and after segmentation; grayscale threshold image (center) and gaussian threshold image (right) table 1. images datasets leather types images augmentation total grayscale threshold gaussian threshold cow leather 160 160 320 goat leather 160 160 320 sheep leather 160 160 320 synthetic leather 160 160 320 total 640 640 1280 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 81 f. convolutional layer the convolutional layer is core of building blocks and computations in a convolutional neural network architecture [29]. the first convolutional layer will extracting feature some generally motifs from the input layer as a matrix [30]. in this layer processing an operation named convolution that has a small matrix known as kernel. the kernel also called as kernel matrix refers tho the matrix size that has width × height (kernel height equals with kernel width). each convolutional operation always has an output called feature map or activation map, the output typically passed through the activation function. the activation function aims to enable the learning of nonlinear decision boundaries [31]. in this research will applies relu activation function on each convolutional layer. this activation function is used to speeds up and increase the performance of the model [32]. the convolutional operation can seen in figure 3. as example, a matrix with input size 5×5 pixels, then added the hyperparamater such as padding, kernel with size 3×3 and single stride (stride = 1). stride is a parameter that modifies over the movement of kernel on the matrix. the operation shows an output matrix with size 5×5, if added a padding it should use the equation (1) and the output operation done by equation (2). the size is not changed because it used a padding or also known as zero padding, it adding zero values in each matrix side [33]. in figure 4 illustrated the kernel movements on the matrix pixels. 𝑃𝑎𝑑𝑑𝑖𝑛𝑔 = 𝐾𝑒𝑟𝑛𝑒𝑙−1 2 (1) 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝐼𝑛𝑝𝑢𝑡−𝐾𝑒𝑟𝑛𝑒𝑙+2∗𝑃𝑎𝑑𝑑𝑖𝑛𝑔 𝑆𝑡𝑟𝑖𝑑𝑒 + 1 (2) fig. 3. convolutional operation fig. 4. kernel movements on convolutional matrix with padding 82 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 g. pooling or subsampling layer in order to reduce amount of the dimension [34] or parameter neurons from the feature map of convolutional layer [24], pooling operations should be applied [23]. this layers can also to reduce the overfitting. although reduce the dimension sizes, pooling operations would not remove the informations values that saved from convolutional feature map. in the cnn architectures there are two general types of pooling layer, max pool and average pool [16][35]. max pooling calculate the maximum values for each kernel from feature map, while average pooling calculate the average values for each kernel from feature map. this research use max pooling operations, equation (3) show the formula of maximum pooling where b is the bias value that added into each elements cn from feature maps. 𝑀𝑎𝑥 𝑃𝑜𝑜𝑙𝑖𝑛𝑔 = { max (𝐶1 + 𝑏1) max(𝐶2 + 𝑏2) ∙ max (𝐶𝑛 + 𝑏𝑛) (3) figure 5 illustrated the max pooling operations where the input matrix is came from output of convolutional layer (feature maps) with 5×5 size. paramater in this pooling example used 2×2 kernel size and single stride (stride = 1). this pooling operations get output with 4×4 size of pixels. the results of pooling layer uses as input matrix in the next convolution operation or it will flattened into a vector if it is the last pooling layer in the feature extraction section. h. fully-connected layer in this layer all of the last neurons from feature extractions layer flattened with flatten function. fully-connected layer are used at the end of the architecture of cnn after feature extractions layer [36]. flatten purposed to converts the three-dimensional layer from previous section into onedimensional layer as a vector. for example, 6×6×192 feature map output would be converted into 6912 of vector size. sometimes, dropout function will be applied to reduces overfitting and improve the training performance by randomly disabling of neurons in each layers [37]. it fine sholud be used between the fully-connected layer and after the last pooling layer. in order to classify the last image categories, it will use softmax activation function. this activation commonly used to classify multiclass categorial data [38]. the final result on this research classify four categories of leather types. iii. results and discussions a. training, validation and testing data split splitting of the dataset in this research includes training data, validation data and testing data. the ratio of splitting data will be explained by the table 2. about 1280 dataset images that was described in the image segmentation section above will used as training and validaton data with auto splitting ratio around 80% as training data and 20% as validation data. the testing data consist about 32 new images in each category that different and separated from 1280 of training dataset. fig. 5. max pooling operations with single stride f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 83 b. build a convolutional neural network architecture this research used a convolutional neural network sequential architecture that was builded and modified by trial and error until get the high accuracy result. structure or architecture of cnn as we can see in in figure 6 shows the feature extraction layer and figure 7 shows the fully-connected layer. the architecture consist of an input layer, six convolutional layer with relu as the activation function, six max pooling layer, fully-connected layer with softmax activation function and output layer that have four class category. table 3 shows the detail or summary of both figure 6 and figure 7 from the input size, filters, activation function and output size in each layers. as the result this model architecture has 1.566.724 of parameters. table 2. splitting dataset images type training data validation data testing data total cow leather 263 57 32 352 goat leather 254 66 32 352 sheep leather 249 71 32 352 synthetic leather 258 62 32 352 total 1024 256 128 1408 fig. 6. architecture of input layer and feature extractions layer fig. 7. architecture of input layer and feature extractions layer 84 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 c. training the cnn architecture model after builds and define the architecture of convolutional neural network, continued by define the compile model such as the optimizers, loss function and metric accuracy. this research using adam algoritm as optimizer because this optimizer method combines two popular optimization adagard and rmsprop [39]. sparse categorical crossentropy for the loss function and sparse categorical accuracy for accuracy metrics because our data are converted into categorical. furthermore, to fit the model we have to add the batch size and epochs value. table 4 shows the summary of our model compile and summary of model fit before the training start. the training processes just completed and we are getting the validation accuracy result around 92.1% and the validation loss value around 0.82. it shown in figure 8 for the accuracy and loss result. in addition to the plots figure 8, orange lines define the validation data and the green lines define the training data results. d. evaluate the model this research use confussion matrix as the perfromance measurment, it can be evaluated for classification problems where the output has two or more categories. figure 9 shows the final measurment result from confussion matrix with validation data as the data evaluation. form figure 9, the confussion matrix result shows if the sheep leather, goat and synthetic leather from validation data has high matched score of accuracy between the actual values and the predicted values. for the data testing this research added 32 unlabeled images in each category, in example from the figure 10, this research will testing a folder that contains 32 unlabeled images of goat leathers. table 3. details of summary cnn architecture layer types filters kernel output shape parameters 0 input 1 3×3 (448, 448, 1) 0 1 convolutional (conv2d) 64 3×3 (448, 448, 64) 640 2 pooling (maxpooling2d) 64 3×3 (223, 223, 64) 0 3 convolutional (conv2d) 64 3×3 (223, 223, 64) 36.928 4 pooling (maxpooling2d) 64 3×3 (111, 111, 64) 0 5 convolutional (conv2d) 96 3×3 (111, 111, 96) 55.392 6 pooling (maxpooling2d) 96 3×3 (55, 55, 96) 0 7 convolutional (conv2d) 128 3×3 (55, 55, 128) 110.720 8 pooling (maxpooling2d) 128 3×3 (27, 27, 128) 0 9 convolutional (conv2d) 160 3×3 (27, 27, 160) 184.480 10 pooling (maxpooling2d) 160 3×3 (13, 13, 160) 0 11 convolutional (conv2d) 192 3×3 (13, 13, 192) 276.672 12 pooling (maxpooling2d) 192 3×3 (6, 6, 192) 0 13 flatten 6912 0 14 dense 128 884.864 15 dropout 128 0 16 dense 128 16.512 17 output 4 516 total 1.566.724 table 4. summary of model compile and model fit keras attributes args values model compile optimizers adam loss function sparce categorical crossentropy metrics accuracy sparse categorical accuracy learning rate 0.0001 model fit epoch 259 batch size 64 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 85 (a) (b) fig. 8. validation result; (a) accuracy and (b) loss fig. 9. confussion matrix results from validation 86 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 the testing result shows 29 images are true detected as goat leather, two images detected as sheep leather and one images as cow leather. tabel 5 explained more the evaluation of data testing with confussion matrix. as can be seen from table 5, the accuracy calculation of data testing by sum the number of correctly predicted label values and divide it with the total number of data testing. the correctly classified images are located diagonally from the upper-left into the lower-right of confussion matrix tabel above and get the accuracy result from the data testing is about 89.0%. then uploads the new image from data testing with loaded the weights sum from the data training that was saved with h5 model. figure 11 will shows the comparison of the accuracy from the training, validation and testing result. table 5. confussion matrix result for data testing confussion matrix of data test predict label total sheep synthetic goat cow true label sheep 31 0 0 1 32 synthetic 0 25 3 4 32 goat 2 0 29 1 32 cow 2 0 0 30 32 total 128 fig. 11. visualization accuracy result 86 87 88 89 90 91 92 93 94 95 training validation testing accuracy result fig. 10. sample testing about 32 unlabeled images of goat leathers f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 87 iv. conclusion convolutional neural network method in this research shows a good performance and done classified between the genuine leather that consist of cow, goat, sheep and the synthetic leather. the final results as we can see from the confussion matrix, the goat leather slightly have a similarity with the sheep leather. not only the goat or sheep that have a similarities result, but also several the synthetic leather are classified into the cow leather. although the accuracy of the cnn models is excellent, but the loss value from the model is still high. in future research, hopefully this research can be an initiate to explore more of the dataset from the others leather types and increase the number of leathers dataset in each type. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] l. wang and c. liu, “tanning leather classification using an improved statistical geometrical feature method,” 2007 int. conf. mach. learn. cyb., pp. 19–22, august 2007. [2] c. wu, w. zhang, x. liao, y. zeng, and b. shi, “transposition of chrome tanning in leather making,” j. am. leather chem. assoc., vol. 109, no. 6, pp. 176–183, 2014. [3] h. mahdi, k. palmina, a. gurshi, and d. covington, “potential of vegetable tanning materials and basic aluminum sulphate in sudanese leather industry,” j. eng. sci. technol., vol. 4, no. 1, pp. 20–31, 2009. [4] m. jawahar, n. k. c. babu, and k. vani, “leather texture classification using wavelet feature extraction technique,” 2014 ieee int. conf. comput. intell. comput. res. ieee iccic 2014, pp. 6–9, 2015. [5] n. purwaningsih, “penerapan multilayer perceptron untuk klasifikasi jenis kulit sapi tersamak,” j. teknoif, vol. 4, no. 1, pp. 1–7, 2016. [6] h. chen, “the research of leather image segmentation using texture analysis techniques hong chen,” vol. 1032, pp. 1846–1850, 2014. [7] s. winiarti, a. prahara, murinto, and d. p. ismi, “pre-trained convolutional neural network for classification of tanning leather image,” int. j. adv. comput. sci. appl., vol. 9, no. 1, pp. 212–217, 2018. [8] s. a. m. hashim, n. jamaluddin, and a. hasbullah, “automatic classification of animal skin for leather products using backpropagation neural network,” 4th natl. conf. res. educ., no. may 2017, 2018. [9] c. kwak, j. a. ventura, and k. tofang-sazi, “automated defect inspection and classification of leather fabric,” intell. data anal., vol. 5, no. 4, pp. 355–370, 2001. [10] d. sugimura, t. mikami, h. yamashita, and t. hamamoto, “enhancing color images of extremely low light scenes based on rgb/nir images acquisition with different exposure times,” ieee trans. image process., vol. 24, no. 11, pp. 3586–3597, 2015. [11] k. s. sudeep and k. k. pal, “preprocessing for image classification by convolutional neural networks,” 2016 ieee int. conf. recent trends electron. inf. commun. technol. rteict 2016 proc., pp. 1778–1781, 2017. [12] n. h. dar and h.r. ramya, “image segmentation techniques and its applications,” 2020. [13] k. pai and a. giridharan, “convolutional neural networks for classifying skin lesions,” ieee reg. 10 annu. int. conf. proceedings/tencon, vol. 2019-octob, pp. 1794–1796, 2019. [14] m. anthimopoulos, s. christodoulidis, l. ebner, a. christe, and s. mougiakakou, “lung pattern classification for interstitial lung diseases using a deep convolutional neural network,” ieee trans. med. imaging, vol. 35, no. 5, pp. 1207–1216, 2016. [15] k. simonyan and a. zisserman, “very deep convolutional networks for large-scale image recognition,” 3rd int. conf. learn. represent. iclr 2015 conf. track proc., pp. 1–14, 2015. [16] a. severyn and a. moschitti, “twitter sentiment analysis with deep neural networks,” proc. 38th int. acm sigir conf. res. dev. inf. retr., no. november, pp. 959–962, 2016. [17] m. alali, n. mohd sharef, m. a. azmi murad, h. hamdan, and n. a. husin, “narrow convolutional neural network for arabic dialects polarity classification,” ieee access, vol. 7, pp. 96272–96283, 2019. https://doi.org/10.1109/icmlc.2007.4370433 https://doi.org/10.1109/icmlc.2007.4370433 https://journals.uc.edu/index.php/jalca/article/view/3544 https://journals.uc.edu/index.php/jalca/article/view/3544 http://jestec.taylors.edu.my/vol%204%20issue%201%20march%2009/vol_4_1_20-31_mahdi%20haroun.pdf http://jestec.taylors.edu.my/vol%204%20issue%201%20march%2009/vol_4_1_20-31_mahdi%20haroun.pdf https://doi.org/10.1109/iccic.2014.7238475 https://doi.org/10.1109/iccic.2014.7238475 https://ejournal.itp.ac.id/index.php/tinformatika/article/view/510 https://ejournal.itp.ac.id/index.php/tinformatika/article/view/510 https://doi.org/10.4028/www.scientific.net/amr.1030-1032.1846 https://doi.org/10.4028/www.scientific.net/amr.1030-1032.1846 https://doi.org/10.14569/ijacsa.2018.090129 https://doi.org/10.14569/ijacsa.2018.090129 https://www.researchgate.net/publication/328687176_automatic_classification_of_animal_skin_for_leather_products_using_backpropagation_neural_network https://www.researchgate.net/publication/328687176_automatic_classification_of_animal_skin_for_leather_products_using_backpropagation_neural_network https://doi.org/10.3233/ida-2001-5406 https://doi.org/10.3233/ida-2001-5406 https://doi.org/10.1109/tip.2015.2448356 https://doi.org/10.1109/tip.2015.2448356 https://doi.org/10.1109/tip.2015.2448356 https://doi.org/10.1109/rteict.2016.7808140 https://doi.org/10.1109/rteict.2016.7808140 https://www.researchgate.net/publication/340087951_image_segmentation_techniques_and_its_application https://doi.org/10.1109/tencon.2019.8929461 https://doi.org/10.1109/tencon.2019.8929461 https://doi.org/10.1109/tmi.2016.2535865 https://doi.org/10.1109/tmi.2016.2535865 https://doi.org/10.1109/tmi.2016.2535865 https://arxiv.org/abs/1409.1556 https://arxiv.org/abs/1409.1556 https://doi.org/10.1145/2766462.2767830 https://doi.org/10.1145/2766462.2767830 https://doi.org/10.1109/access.2019.2929208 https://doi.org/10.1109/access.2019.2929208 88 f.a. faiz and a. azhari / knowledge engineering and data science 2020, 3 (2): 77–88 [18] c. n. dos santos and m. gatti, “deep convolutional neural networks for sentiment analysis of short texts,” coling 2014 25th int. conf. comput. linguist. proc. coling 2014 tech. pap., pp. 69–78, 2014. [19] p. wang, y. cao, c. shen, l. liu, and h. t. shen, “temporal pyramid pooling-based convolutional neural network for action recognition,” ieee trans. circuits syst. video technol., vol. 27, no. 12, pp. 2613–2622, 2017. [20] c. k. dewa, a. l. fadhilah, and a. afiahayati, “convolutional neural networks for handwritten javanese character recognition,” ijccs (indonesian j. comput. cybern. syst., vol. 12, no. 1, p. 83, 2018. [21] j. gu et al., “recent advances in convolutional neural networks,” pattern recognit., vol. 77, pp. 354–377, 2018. [22] h. chougrad, h. zouaki, and o. alheyane, “convolutional neural networks for breast cancer screening: transfer learning with exponential decay,” no. nips, 2017. [23] q. li, w. cai, x. wang, y. zhou, d. d. feng, and m. chen, “medical image classification with convolutional neural network,” 2014 13th int. conf. control autom. robot. vision, icarcv 2014, vol. 2014, no. december, pp. 844–848, 2014. [24] l. pigou, s. dieleman, p. j. kindermans, and b. schrauwen, “sign language recognition using convolutional neural networks,” lect. notes comput. sci. (including subser. lect. notes artif. intell. lect. notes bioinformatics), vol. 8925, pp. 572–578, 2015. [25] a. verma, p. singh, and j. s. rani alex, “modified convolutional neural network architecture analysis for facial emotion recognition,” int. conf. syst. signals, image process., vol. 2019-june, pp. 169–173, 2019. [26] k. yanai and y. kawano, “food image recognition using deep convolutional network with pre-training and finetuning,” int. conf. multimed. expo work. . ieee, pp. 1–6, 2014. [27] g. levi and t. hassncer, “age and gender classification using convolutional neural networks,” ieee comput. soc. conf. comput. vis. pattern recognit. work., vol. 2015-octob, pp. 34–42, 2015. [28] j. zhou, d. xiao, and m. zhang, “feature correlation loss in convolutional neural networks for image classification,” proc. 2019 ieee 3rd inf. technol. networking, electron. autom. control conf. itnec 2019, no. itnec, pp. 219–223, 2019. [29] t. guo, j. dong, h. li, and y. gao, “simple convolutional neural network on image classification,” 2017 ieee 2nd international conference on big data analysis (icbda), mar. 2017. [30] a. khan, a. sohail, u. zahoora, and a. s. qureshi, “a survey of the recent architectures of deep convolutional neural networks,” artif. intell. rev., pp. 1–70, 2020. [31] j. kang, h. s. choi, and h. lee, “deep recurrent convolutional networks for inferring user interests from social media,” j. intell. inf. syst., vol. 52, no. 1, pp. 191–209, 2019. [32] t. f. gonzalez, “handbook of approximation algorithms and metaheuristics,” handb. approx. algorithms metaheuristics, pp. 1–1432, 2007. [33] y. sun, b. xue, m. zhang, and g. g. yen, “evolving deep convolutional neural networks for image classification,” ieee trans. evol. comput., vol. 24, no. 2, pp. 394–407, 2020. [34] l. kang, p. ye, y. li, and d. doermann, “convolutional neural networks for no-reference image quality assessment,” proc. ieee comput. soc. conf. comput. vis. pattern recognit., pp. 1733–1740, 2014. [35] k. o’shea and r. nash, “an introduction to convolutional neural networks,” pp. 1–11, 2015. [36] j. bernal et al., “deep convolutional neural networks for brain image analysis on magnetic resonance imaging: a review,” artif. intell. med., vol. 95, no. december, pp. 64–81, 2019. [37] g. e. hinton, n. srivastava, a. krizhevsky, i. sutskever, and r. r. salakhutdinov, “improving neural networks by preventing co-adaptation of feature detectors,” pp. 1–18, 2012. [38] r. hu, b. tian, s. yin, and s. wei, “efficient hardware architecture of softmax layer in deep neural network,” int. conf. digit. signal process. dsp, vol. 2018-novem, pp. 323–326, 2019. [39] d. p. kingma and j. l. ba, “adam: a method for stochastic optimization,” 3rd int. conf. learn. represent. iclr 2015 conf. track proc., pp. 1–15, 2015. https://www.aclweb.org/anthology/c14-1008/ https://www.aclweb.org/anthology/c14-1008/ https://doi.org/10.1109/tcsvt.2016.2576761 https://doi.org/10.1109/tcsvt.2016.2576761 https://doi.org/10.22146/ijccs.31144 https://doi.org/10.22146/ijccs.31144 https://arxiv.org/abs/1512.07108 https://arxiv.org/abs/1711.10752 https://arxiv.org/abs/1711.10752 https://doi.org/10.1109/icarcv.2014.7064414 https://doi.org/10.1109/icarcv.2014.7064414 https://doi.org/10.1109/icarcv.2014.7064414 https://doi.org/10.1007/978-3-319-16178-5_40 https://doi.org/10.1007/978-3-319-16178-5_40 https://doi.org/10.1007/978-3-319-16178-5_40 https://doi.org/10.1109/iwssip.2019.8787215 https://doi.org/10.1109/iwssip.2019.8787215 https://doi.org/10.1109/icmew.2015.7169816 https://doi.org/10.1109/icmew.2015.7169816 https://doi.org/10.1109/cvprw.2015.7301352 https://doi.org/10.1109/cvprw.2015.7301352 https://doi.org/10.1109/itnec.2019.8729534 https://doi.org/10.1109/itnec.2019.8729534 https://doi.org/10.1109/itnec.2019.8729534 https://doi.org/10.1109/icbda.2017.8078730 https://doi.org/10.1109/icbda.2017.8078730 https://doi.org/10.1007/s10462-020-09825-6 https://doi.org/10.1007/s10462-020-09825-6 https://doi.org/10.1007/s10844-018-0534-3 https://doi.org/10.1007/s10844-018-0534-3 https://doi.org/10.1201/9781420010749 https://doi.org/10.1201/9781420010749 https://doi.org/10.1109/tevc.2019.2916183 https://doi.org/10.1109/tevc.2019.2916183 https://doi.org/10.1109/cvpr.2014.224 https://doi.org/10.1109/cvpr.2014.224 https://arxiv.org/abs/1511.08458 https://doi.org/10.1016/j.artmed.2018.08.008 https://doi.org/10.1016/j.artmed.2018.08.008 https://arxiv.org/abs/1207.0580 https://arxiv.org/abs/1207.0580 https://doi.org/10.1109/icdsp.2018.8631588 https://doi.org/10.1109/icdsp.2018.8631588 https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/1412.6980 i. introduction ii. methods a. research pseudocode 1) variable 2) algorithm b. image acquisition and preprocessing c. image segmentation d. convolutional neural network e. input layer f. convolutional layer g. pooling or subsampling layer h. fully-connected layer iii. results and discussions a. training, validation and testing data split b. build a convolutional neural network architecture c. training the cnn architecture model d. evaluate the model iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] l. wang and c. liu, “tanning leather classification using an improved statistical geometrical feature method,” 2007 int. conf. mach. learn. cyb., pp. 19–22, august 2007. [2] c. wu, w. zhang, x. liao, y. zeng, and b. shi, “transposition of chrome tanning in leather making,” j. am. leather chem. assoc., vol. 109, no. 6, pp. 176–183, 2014. [3] h. mahdi, k. palmina, a. gurshi, and d. covington, “potential of vegetable tanning materials and basic aluminum sulphate in sudanese leather industry,” j. eng. sci. technol., vol. 4, no. 1, pp. 20–31, 2009. [4] m. jawahar, n. k. c. babu, and k. vani, “leather texture classification using wavelet feature extraction technique,” 2014 ieee int. conf. comput. intell. comput. res. ieee iccic 2014, pp. 6–9, 2015. [5] n. purwaningsih, “penerapan multilayer perceptron untuk klasifikasi jenis kulit sapi tersamak,” j. teknoif, vol. 4, no. 1, pp. 1–7, 2016. [6] h. chen, “the research of leather image segmentation using texture analysis techniques hong chen,” vol. 1032, pp. 1846–1850, 2014. [7] s. winiarti, a. prahara, murinto, and d. p. ismi, “pre-trained convolutional neural network for classification of tanning leather image,” int. j. adv. comput. sci. appl., vol. 9, no. 1, pp. 212–217, 2018. [8] s. a. m. hashim, n. jamaluddin, and a. hasbullah, “automatic classification of animal skin for leather products using backpropagation neural network,” 4th natl. conf. res. educ., no. may 2017, 2018. [9] c. kwak, j. a. ventura, and k. tofang-sazi, “automated defect inspection and classification of leather fabric,” intell. data anal., vol. 5, no. 4, pp. 355–370, 2001. [10] d. sugimura, t. mikami, h. yamashita, and t. hamamoto, “enhancing color images of extremely low light scenes based on rgb/nir images acquisition with different exposure times,” ieee trans. image process., vol. 24, no. 11, pp. 3586–3597, 2015. [11] k. s. sudeep and k. k. pal, “preprocessing for image classification by convolutional neural networks,” 2016 ieee int. conf. recent trends electron. inf. commun. technol. rteict 2016 proc., pp. 1778–1781, 2017. [12] n. h. dar and h.r. ramya, “image segmentation techniques and its applications,” 2020. [13] k. pai and a. giridharan, “convolutional neural networks for classifying skin lesions,” ieee reg. 10 annu. int. conf. proceedings/tencon, vol. 2019-octob, pp. 1794–1796, 2019. [14] m. anthimopoulos, s. christodoulidis, l. ebner, a. christe, and s. mougiakakou, “lung pattern classification for interstitial lung diseases using a deep convolutional neural network,” ieee trans. med. imaging, vol. 35, no. 5, pp. 1207–1216, 2016. [15] k. simonyan and a. zisserman, “very deep convolutional networks for large-scale image recognition,” 3rd int. conf. learn. represent. iclr 2015 conf. track proc., pp. 1–14, 2015. [16] a. severyn and a. moschitti, “twitter sentiment analysis with deep neural networks,” proc. 38th int. acm sigir conf. res. dev. inf. retr., no. november, pp. 959–962, 2016. [17] m. alali, n. mohd sharef, m. a. azmi murad, h. hamdan, and n. a. husin, “narrow convolutional neural network for arabic dialects polarity classification,” ieee access, vol. 7, pp. 96272–96283, 2019. [18] c. n. dos santos and m. gatti, “deep convolutional neural networks for sentiment analysis of short texts,” coling 2014 25th int. conf. comput. linguist. proc. coling 2014 tech. pap., pp. 69–78, 2014. [19] p. wang, y. cao, c. shen, l. liu, and h. t. shen, “temporal pyramid pooling-based convolutional neural network for action recognition,” ieee trans. circuits syst. video technol., vol. 27, no. 12, pp. 2613–2622, 2017. [20] c. k. dewa, a. l. fadhilah, and a. afiahayati, “convolutional neural networks for handwritten javanese character recognition,” ijccs (indonesian j. comput. cybern. syst., vol. 12, no. 1, p. 83, 2018. [21] j. gu et al., “recent advances in convolutional neural networks,” pattern recognit., vol. 77, pp. 354–377, 2018. [22] h. chougrad, h. zouaki, and o. alheyane, “convolutional neural networks for breast cancer screening: transfer learning with exponential decay,” no. nips, 2017. [23] q. li, w. cai, x. wang, y. zhou, d. d. feng, and m. chen, “medical image classification with convolutional neural network,” 2014 13th int. conf. control autom. robot. vision, icarcv 2014, vol. 2014, no. december, pp. 844–848, 2014. [24] l. pigou, s. dieleman, p. j. kindermans, and b. schrauwen, “sign language recognition using convolutional neural networks,” lect. notes comput. sci. (including subser. lect. notes artif. intell. lect. notes bioinformatics), vol. 8925, pp. 572–578... [25] a. verma, p. singh, and j. s. rani alex, “modified convolutional neural network architecture analysis for facial emotion recognition,” int. conf. syst. signals, image process., vol. 2019-june, pp. 169–173, 2019. [26] k. yanai and y. kawano, “food image recognition using deep convolutional network with pre-training and fine-tuning,” int. conf. multimed. expo work. . ieee, pp. 1–6, 2014. [27] g. levi and t. hassncer, “age and gender classification using convolutional neural networks,” ieee comput. soc. conf. comput. vis. pattern recognit. work., vol. 2015-octob, pp. 34–42, 2015. [28] j. zhou, d. xiao, and m. zhang, “feature correlation loss in convolutional neural networks for image classification,” proc. 2019 ieee 3rd inf. technol. networking, electron. autom. control conf. itnec 2019, no. itnec, pp. 219–223, 2019. [29] t. guo, j. dong, h. li, and y. gao, “simple convolutional neural network on image classification,” 2017 ieee 2nd international conference on big data analysis (icbda), mar. 2017. [30] a. khan, a. sohail, u. zahoora, and a. s. qureshi, “a survey of the recent architectures of deep convolutional neural networks,” artif. intell. rev., pp. 1–70, 2020. [31] j. kang, h. s. choi, and h. lee, “deep recurrent convolutional networks for inferring user interests from social media,” j. intell. inf. syst., vol. 52, no. 1, pp. 191–209, 2019. [32] t. f. gonzalez, “handbook of approximation algorithms and metaheuristics,” handb. approx. algorithms metaheuristics, pp. 1–1432, 2007. [33] y. sun, b. xue, m. zhang, and g. g. yen, “evolving deep convolutional neural networks for image classification,” ieee trans. evol. comput., vol. 24, no. 2, pp. 394–407, 2020. [34] l. kang, p. ye, y. li, and d. doermann, “convolutional neural networks for no-reference image quality assessment,” proc. ieee comput. soc. conf. comput. vis. pattern recognit., pp. 1733–1740, 2014. [35] k. o’shea and r. nash, “an introduction to convolutional neural networks,” pp. 1–11, 2015. [36] j. bernal et al., “deep convolutional neural networks for brain image analysis on magnetic resonance imaging: a review,” artif. intell. med., vol. 95, no. december, pp. 64–81, 2019. [37] g. e. hinton, n. srivastava, a. krizhevsky, i. sutskever, and r. r. salakhutdinov, “improving neural networks by preventing co-adaptation of feature detectors,” pp. 1–18, 2012. [38] r. hu, b. tian, s. yin, and s. wei, “efficient hardware architecture of softmax layer in deep neural network,” int. conf. digit. signal process. dsp, vol. 2018-novem, pp. 323–326, 2019. [39] d. p. kingma and j. l. ba, “adam: a method for stochastic optimization,” 3rd int. conf. learn. represent. iclr 2015 conf. track proc., pp. 1–15, 2015. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 3, no 2, december 2020, pp. 89–98 eissn 2597-4637 https://doi.org/10.17977/um018v3i22020p89-98 ©2020 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) simple modification for an apriori algorithm with combination reduction and iteration limitation technique adie wahyudi oktavia gama a, 1, *, ni made widnyani b, 2 a department of information technology, universitas pendidikan nasional bedugul st number 39, denpasar, 80224, indonesia b department of digital business, universitas bali internasional seroja st jeruk block number 9, denpasar, 80239, indonesia 1 gama.adiewahyudi@gmail.com *; 2 nimadewidnyani90@gmail.com * corresponding author i. introduction management information systems or systems that are related to transactions will produce data that continue to grow every time a process is carried out. those data that continue to grow are not balanced with the acquisition of information for other decision support. the information produced is usually monotonous in the form of daily, weekly or annual reports. this phenomenon is often referred to as "rich information poor data", which means an increase in the amount of data is not comparable to the information obtained. this is due to a lack of analysis of the data stack. data mining is the solution to this phenomenon. data mining is a method that applies data analysis and algorithms to create a specific identification of designs or model over the data [1]. data mining will analyse large data to extract valuable new information or knowledge. one of data mining techniques that can be used to explore the relationship of new knowledge in the form of a combination of items hidden in the database is association analysis. the relationship can be represented in the form of association rules [2][3]. association analysis measures the relationship between two or more hidden items in the database. the form of association rules is "if antecedent then consequent", which means that the relationship strength of a product is determined when it is purchased with other products. the strength of the relationship of an associative rule can be measured by two parameters called support and confidence. support value is the percentage of occurrence of a combination of items in the database. confidence value or value of certainty reflects the strength of the relationships between items forming combinations in associative rules formed by the association. article info a b s t r a c t article history: received 19 september 2020 revised 13 october 2020 accepted 03 november 2020 published online 31 december 2020 apriori algorithm is one of the methods with regard to association rules in data mining. this algorithm uses knowledge from an itemset previously formed with frequent occurrence frequencies to form the next itemset. an a priori algorithm generates a combination by iteration methods that are using repeated database scanning process, pairing one product with another product and then recording the number of occurrences of the combination with the minimum limit of support and confidence values. the a priori algorithm will slow down to an expanding database in the process of finding frequent itemset to form association rules. modification techniques are needed to optimize the performance of a priori algorithms so as to get frequent itemset and to form association rules in a short time. modifications in this study are obtained by using techniques combination reduction and iteration limitation. testing is done by comparing the time and quality of the rules formed from the database scanning using a priori algorithms with and without modification. the results of the test show that the modified a priori algorithm tested with data samples of up to 500 transactions is proven to form rules faster with quality rules that are maintained. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: data mining association rules apriori algorithm frequent itemset apriori optimization http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v3i22020p89-98 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 90 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 apriori algorithm is one of the forming rules of association in data mining. the initial research conducted by agrawal in 1993 with the title "mining association rules between sets of items in large databases" was the beginning of the development of association methods using apriori algorithms [4]. in 1994, agrawal and srikant again conducted research with the association method entitled "fast algorithms for mining association rules" [5]. the research was then focused on refining apriori algorithms that had been developed previously and from there apriori algorithm was known as one of the association rules forming algorithms. apriori algorithm takes an iterative approach, that is, generating k-itemset that is used to form the next (k + 1) -itemset. the principle of apriori algorithm is if an itemset often appears frequently, then all subsets of the itemset must also appear frequently in all transactions stored in a database [2]. in this algorithm candidate (k + 1) -itemset is generated by combining two itemset on domain / size k. candidates of (k + 1) -itemset containing the frequency of subset that rarely appears or below the threshold will be trimmed and not used in determining association rules [2]. in accordance with association rules, apriori algorithms also use minimum support and minimum confidence to determine itemset rules which are suitable for use in decision making. 1-itemset is used to find 2-itemset, which is a combination of 2 items, for example, if buy shirt then buy long pants. 2-itemset is then used to find 3-itemset which is a combination of 3 items, for example if you buy shirts and buy pens then buy long pants and so on until there are no more kitemset that can be found in the database transaction [6]. apriori reasoning uses prior knowledge of an itemset with frequent occurrence frequencies. it uses an iterative approach where k-itemset is used to explore (k + 1)-itemset [6]. candidate (k + 1) -itemset is generated from merging two itemset on domain k. candidate (k + 1) -itemset containing the frequency of subset that rarely appears or below the threshold will be trimmed and not used to form association rules [2]. there is a relatively huge amount of research on apriori algorithms [7][8][9][10][11][12][13]. studies related to the application of apriori algorithms that are used as references in this study are as follows: 1. the application of apriori algorithms that had been previously developed without using optimization techniques to obtain more efficient association rules [14]. 2. improvised the apriori algorithm by determining "set size" and "set size frequency". set size is the number of items per transaction while the set size frequency is the number of transactions that have at least "set size" items. this set size and set size frequency are used to eliminate insignificant key candidates [15]. 3. optimization of apriori algorithms by reducing or pruning the number of candidates for frequent itemset candidates on itemset ck [16]. 4. improvised the apriori algorithm by reducing the number of transactions (transaction reduction) whose number of items transaction did not meet the specified limit value. reducing these transactions has an impact on efficiency improvement when scanning databases [17]. 5. the utilization of apriori algorithms to establish customer segmentation in the smes sector [18]. 6. application of apriori algorithms to form associations in sales database [19][20][21]. the essence of all research on optimization of apriori algorithms is limiting the frequent itemset candidates that are generated by bypassing unwanted transactions so that it does not overtake or repeat database scanning excessively; so that it will produce better and faster association rules. apriori algorithm has the disadvantage that it is less efficient on a larger database. its performance will slow down because it has to do a large database scanning with a large number of transactions. iteration is done repeatedly to get the frequent itemset combination in forming the right association rules. modification techniques are needed to optimize the performance of apriori algorithms so as to get frequent itemset to form association rules in a short time [22][23][24][25][26][27][28][29]. modifications in this study are obtained through combining combination reduction and iteration limitation techniques. a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 91 ii. method a. association analysis the association method is often used to analyse the contents of a consumer shopping cart in a transaction process [30][31]. the association method is also known as the market basket analysis. a simple example of an association method application is an analysis of a product purchased at a clothing store. the results will be obtained in the analysis, for example, the degree of possibility of consumers buying trousers and clothes together. the application of the association method in the example can later help the shop owner to arrange the placement of goods and the inventory, or to make a promotion by giving special discounts for the combination of items that are often purchased. association analysis can be explained as a process to explore association rules that meet minimum support and minimum confidence requirements, where support and confidence are explained as follows: 1. analysis of high frequency patterns, at this stage, is looking for item combinations that meet the minimum requirements of the support value in the database. the support value of an item is obtained by the following formula: 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) = 𝐿𝑜𝑡𝑠 𝑜𝑓 𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝐼𝑡𝑒𝑚 𝐴 𝑇𝑜𝑡𝑎𝑙 𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒 (1) the support value of 2 items is explained by the formula below: 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐴, 𝐵 = 𝑃(𝐴 ∩ 𝐵) (2) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐴, 𝐵 = ∑ 𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝐼𝑡𝑒𝑚𝑠 𝐴 𝑎𝑛𝑑 𝐵 ∑ 𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒 (3) 2. formation of association rules is sought after all high frequency patterns have been found: those which meet the minimum requirements for confidence by calculating the confidence value of associative rules a → b. the confidence value of a → b rules is obtained from the formula as following: 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃(𝐵|𝐴) = ∑ 𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝐼𝑡𝑒𝑚𝑠 𝐴 𝑎𝑛𝑑 𝐵 ∑ 𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝐼𝑡𝑒𝑚 𝐴 (4) the following is an example of clothing sales data. each transaction data written as in the table 1. the sales data on table 1 is translated into tabular forms 1-itemset as in the table 2. the results of the translation will be used to form the next candidates (k + 1) –itemset. a combination of 2-itemsets that might be obtained by pairing one product with another product from table 2, then calculating the number of occurrences in each transaction by scanning the database. the result of the combination written as the table 3. table 1. sales table id date clothing name 1 2017-08-01 jacket, t-shirt 2 2017-08-01 t-shirt, shirt, trousers 3 2017-08-01 shirt, trousers 4 2017-08-01 shirt, shorts, trousers 5 2017-08-01 shirt, trousers, jacket, t-shirt table 2. description of transactions to form 1-itemset no jacket t-shirt shirt trousers shorts 1 1 1 0 0 0 2 0 1 1 1 0 3 0 0 1 1 0 4 0 0 1 1 1 5 1 1 1 1 0 total 2 3 4 4 1 92 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 table 3 shown the prospective of 2-itemsets candidates. if the threshold value (min_support) = 2 is obtained for the candidate on table 3, frequent 2-itemset is as follows f2 = jacket, t-shirt t-shirt, shirt t-shirt, trousers shirt, trousers the frequent 3-itemsets candidates are formed in the same way. similar method is used in pairing item one with other items to form a 3-itemset candidate as in the table 3. the threshold value (min_support) has been predetermined = 2. therefore frequent 3-itemset from table 4 is obtained as follows: f3 = t-shirt, shirt, trousers if (k + 1) -itemset that can be formed no longer exists, the support and confidence value for each frequent itemset combination is calculated. association rules are formed based on selected frequent (k + 1) -itemset. the selected association rules are a rule that has a confidence value greater than or equal to the min_confidence value. the min_confidence value is 80%. the following table 7 forms the association rules on table 5 and table 6. the rules of final association shown on table 7 aims to choose the most suitable rules as a guide to improve decision making and marketing strategies. this stage produces output in the form of frequent itemset or rule with the highest multiplication of support and confidence value. this stage is the final conclusion of the apriori process which later explains that association rules with the strongest influence are rules that have the highest multiplication of support and confidence values. table 3. prospective of 2-itemset candidates combination number jacket t-shirt 2 jacket shirt 1 jacket trousers 1 jacket shorts 0 t-shirt shirt 2 t-shirt trousers 2 t-shirt shorts 0 shirt trousers 4 shirt shorts 1 trousers shorts 1 table 4. prospective of 3-itemset candidate combination number jacket t-shirt shirt 1 jacket t-shirt trousers 1 jacket t-shirt shorts 0 jacket shirt trousers 1 jacket shirt shorts 0 jacket trousers shorts 0 t-shirt shirt trousers 2 t-shirt shirt shorts 0 t-shirt trousers shorts 0 shirt trousers short 1 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 93 apriori algorithm uses all items in the database transaction every time the process of the scanning database generates combinations. it is very timely inefficient, because the items that rarely appear are still used in forming combinations. figure 1 shows the flowchart of apriori algorithm. the explanation of the flowchart on figure 1 can be described as follows: 1. determining the minimum support and minimum confidence value using approximate values by trial and error. in this research, this has been determined for minimum support = 2 and minimum confidence = 80% 2. apriori algorithm using the iterative approach for k-itemset is generated to form the next (k + 1) –itemset. 3. (k + 1) -itemset candidates which are formed with frequencies that rarely appear in the database or below the threshold (min_support) will be eliminated and not used in determining association rules. 4. 1-itemset is formed by scanning a database and then the number of occurrences of each item on each transaction is counted. 5. furthermore, the itemset is used to form 2-itemsets. candidates for 2-itemset are formed by pairing one item with another item so that it forms a 2-itemset combination. 6. the value of 2-itemsets that have been formed is then calculated for its appearance on every transaction. the threshold (min_support) value is determined to eliminate candidates that are not frequent. 7. the support and confidence values of the 2-itemset that qualify are then calculated. 2-itemset whose support and confidence values are above or equal to min_support and min_confidence will be used to form association rules. 8. then iteration is repeated by using formed 2-itemset to find 3-itemset and so on until there is no more frequent (k + 1)-items left. 9. after all association rules from frequent (k + 1) -itemset are formed, then the values of support and confidence are calculated. multiplication results from the highest support and confidence values are the best association rules of all transactions in the database. table 5. prospective association rules of f2 if antecedent, then consequent support confidence if jacket, then t-shirt 2 / 5 = 40% 2 / 2 = 100% if t-shirt, then jacket 2 / 5 = 40% 2 / 3 = 66.7% if t-shirt, then shirt 2 / 5 = 40% 2 / 3 = 66.7% if shirt, then t-shirt 2 / 5 = 40% 2 / 4 = 50% if t-shirt, then trousers 2 / 5 = 40% 2 / 3 = 66.7% if trousers, then t-shirt 2 / 5 = 40% 2 / 4 = 50% if shirt, then trousers 4 / 5 = 80% 4 / 4 = 100% if trousers, then shirt 4 / 5 = 80% 4 / 4 = 100% table 6. candidate rules association of f3 if antecedent, then consequent support confidence if t-shirt and shirts, then trousers 2/5 = 40% 2/2 = 100% if t-shirt and trousers, then shirt 2/5 = 40% 2/2 = 100% if shirt and trousers, then t-shirt 2/5 = 40% 2/4 = 50% table 7. rules of final association if antecedent, then consequent support confidence support * confidence if buy jacket, then buy t-shirt 2 / 5 = 40% 2 / 2 = 100% 0.4 if buy shirt, then buy trousers 4 / 5 = 80% 4 / 4 = 100% 0.8 if buy trousers, then buy shirt 4 / 5 = 80% 4 / 4 = 100% 0.8 if buy t-shirt and shirt, then buy trousers 2 / 5 = 40% 2 / 2 = 100% 0.4 if buy t-shirt and trousers, then buy shirt 2 / 5 = 40% 2 / 2 = 100% 0.4 94 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 apriori algorithm has the disadvantage that it is less efficient on a larger database. the apriori algorithm performance will slow down because it has to perform an extensive database scanning with a large number of transactions and repeated iterations to get the combination of frequent itemset so that it forms the right association rules. these weaknesses can be overcome by applying modification techniques on the formation of candidates of the frequent itemset combination. b. combination reduction the modified algorithm in this study employs methods of reduction combination or different generated reduction combinations. combination reduction handles frequent itemset or a combination of the results of the previous scanning database to form the next itemset candidate. the generated combination then contains frequent itemset from the results of previous scanning database. the combination formed by this method is certainly fewer than combinations that are formed by apriori method without modification and have more opportunities to become frequent itemset because the combination used to form the next itemset is a frequent itemset. apriori method without modification consumes more time because of repeated scanning to generate all combinations without regard to the previous frequent itemset. start set min_sup set min_conf scan database to generate 1-itemset k = k + 1 scan database to generate candidate of k-itemset generated set = null ? sca n the databa se to calculate the number of occurrences (s) of each k-itemset no delete k-itemset with s < min_sup stop yes calculate the support and confidence for each k-itemset delete k-itemset < min_conf output association rule input transaction database fig. 1. flowchart of apriori algorithm [11] a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 95 1) specifying items that are used to generate combinations (1-items) finding 1-itemset has to be completed before generating a possible combination that appears. the 1-itemsets must qualify the minimum support emergence that will be used to form combinations in the search for frequent itemset. 1-itemset is searched by scanning the database and accumulating the number of occurrences of each item in all transactions. items whose occurrence values are less than the minimum support are not used in determining the combination of (k + 1)itemset while items that are qualified will be used as a combination pair in forming the next itemsets. 2) generating itemset combinations based on previous frequent itemset after obtaining frequent itemset from 2-itemset resulting from the initiate database scanning, then the combination 3-itemset candidate is generated by simply pairing frequent itemset from 2itemset with other items that meet the minimum support. 3-combination itemset candidates that do not contain frequent itemset from 2-itemset and items unqualified for the minimum support do not need to be generated. this will result in considerable time saving, low computing and avoiding the memory allocation to run out. for example; shorts item in the table 8 above are removed. this is because the occurrence values less then minimum support = 2. after going through the process of forming a combination with apriori algorithm, frequent 2-itemset is obtained from the scanning database, namely: f2 = 1. jacket, t-shirt 2. t-shirt, shirt 3. t-shirt, trousers 4. shirt, trousers results from frequent 2-itemset then is used to make 3-itemset candidate. iteration is obtained through similar previous method by pairing just a combination that includes frequent 2-itemset only with one other item that meets the minimum support. the 3-itemset candidates are obtained as in the table 9 which illustrates that combinations are generated only for those that contain frequent itemset of 2-itemset paired with other items with minimal support qualification. consequently, the unqualified items have been removed. this combination reduction will reduce computation in forming combinations so that it saves time and accelerates the apriori algorithm to find association rules. iteration in apriori algorithms is not limited until all combinations of generated itemset appear in transaction data; in which in this case are as many as the number of items contained in the transaction. the application of iteration limitation here is to limit the repetition of the scanning database in generating a combination of (k + 1) -itemset. it is obtained by using the mode formula to find out how many items are often purchased in one transaction that often occurs. table 8. items that meet the minimum support no jacket t-shirt shirt trousers 1 1 1 0 0 2 0 1 1 1 3 0 0 1 1 4 0 0 1 1 5 1 1 1 1 number 2 3 4 4 table 9. prospective 3-itemset candidates combination amount jackets t-shirt shirts 1 jackets t-shirts trousers 1 jackets shirts trousers 1 t-shirt shirt trousers 2 96 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 in the example, 100 transaction samples and 25 items that meet the minimum support are exercised, and the number of transactions that often appear are as many as 2 items in the transaction. most consumers buy 2 items in one transaction. this can be used as an iterative delimiter where up to 2 frequent itemset in accordance with the habits are often done by consumers and this process faster and more efficient. based on the transaction data on table 10, the set size of transactions that often appear is set size = 2 and set size = 4, then the value used is the largest value: k = 4 because the possibility of getting the best association rules becomes greater. iteration to search (k + 1) -itemset with apriori algorithm will be halted until no more frequent itemset and the maximal limit of iteration is k <= 4. iii. results and discussions this study was conducted to determine the results of the comparison of apriori algorithm without modification and with a modified apriori algorithm. modifications of apriori algorithm are expected to be faster in generating association rules so that they are more time-efficient. ratio results are measured in terms of required time between apriori algorithms without modification with the apriori algorithm that have been modified. both algorithms are exercised with several database samples with the number of transactions that continue to grow and each experiment’s required time is calculated until it establishes the association rule. the algorithm’s required time is obtained from the algorithm’s expiry time calculation. this aims to obtain the algorithm of timereduced association rules which is executed in accordance with the formula as follows: $𝑟𝑒𝑞𝑡𝑖𝑚𝑒 = $𝑡_𝑠𝑡𝑎𝑟𝑡𝑡𝑖𝑚𝑒 − $𝑡_𝑒𝑛𝑑𝑡𝑖𝑚𝑒 (5) the results of the required time comparison of the algorithm can be recorded in table 11 which shows comparison on data sample 400 and 500 in apriori without modification is failed because the database server is error (time out). the memory bandwidth cannot accommodate the large iteration of data. the result graphs of measurement in terms of time and number of transactions which shown in figure 2: graph of research time comparison on figure 2 shows that the apriori algorithm that has been modified is more time efficient in order to obtain association rules. the horizontal data show the number of transactions while the vertical data shows the required time to get the association rules. the red lines represent the development of the results of apriori with modification, while the blue lines represent the development of the results of the apriori without modification. apriori without modification represented by a blue line shows a sharp increase, meaning that the more data increases, the higher the computation in the combination formation process and the more time needed to obtain table 11. comparison of apriori algorithm time with multiple sample transaction no. sample transaction required time (in microseconds) apriori apriori modification 1 100 6.16 0.81 2 200 144.90 5.16 3 300 942.87 18.08 4 400 failed 36.71 5 500 failed 75.64 table 10. transaction data with set size for iteration limitation tid transaction name item name set size 1 2013-06-10 jacket, t-shirt 2 2 2013-06-10 t-shirt, shirt, trousers 3 3 2013-06-10 shirt, trousers 2 4 2013-06-10 shirt, shorts, t-shirt, trousers 4 5 2013-06-10 shirt, trousers, jacket, t-shirt 4 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 97 frequent itemset. the red line shows an increase that is not too sharp and tends to be flat, meaning that even though the transaction data continues to increase the required time is proportional to the increase in transaction data. the results of several trials with several transaction samples show that the quality of association rules obtained by apriori modification algorithms is no different from unmodified algorithms. the association rules obtained from the apriori algorithm without modification with the modified association algorithm are the same in several attempts. the results of the experiment show that there is no quality degradation from the established association rules. iv. conclusion apriori algorithm is suitable to be applied to transactions in large database to find frequent itemset. association rules that result from frequent itemset can then be used for improving decisions in organizing item displays, arranging inventory or promotion strategies with the example of applying discounts for combination items that often appear in transactions according to the established association rules. apriori performance that slows down in larger databases can be optimized by using modification method. apriori algorithm that has been modified with combination reduction and iteration limitation techniques has proven to be more efficient in terms of time than the performance of unmodified algorithms in generating association rules. the quality of the resulting rules is also unchanged, in other words the results obtained are similar between the apriori algorithm without modification and the modified apriori algorithm. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] u. fayyad, g. p. shapiro, and p. smyth, “from data mining to knowledge discovery in databases,” ai mag., vol. 17, no. 3, pp. 37–54, 1996. [2] p. n. tan, m. steinbach, and v. kumar, introduction to data mining. united states of america: pearson addisonwesley, 2006. [3] j. pamungkas and y. handrianto, “assosiation rules for product sales data analysis using the apriori algorithm,” sink. junal penelit. tek. inform., vol. 5, no. 1, p. 84, 2020. fig. 2. time comparison chart of apriori algorithms https://doi.org/10.1609/aimag.v17i3.1230 https://doi.org/10.1609/aimag.v17i3.1230 https://www.pearson.com/us/higher-education/program/tan-introduction-to-data-mining/pgm93748.html https://www.pearson.com/us/higher-education/program/tan-introduction-to-data-mining/pgm93748.html https://doi.org/10.33395/sinkron.v5i1.10599 https://doi.org/10.33395/sinkron.v5i1.10599 98 a.w.o. gama and n.m. widnyani / knowledge engineering and data science 2020, 3 (2): 89–98 [4] r. agrawal, “mining association rules between sets of items in large databases,” in proceeding of the 1993 acm sigmod conference washington dc, usa, 1993, pp. 1–10. [5] r. agrawal and r. srikant, “fast algorithms for mining association rules,” proceeding 20th vldb conf. santiago, chile., 1994. [6] j. han and m. kamber, data mining: concepts and techniques second edition. united states of america: elsevier inc., 2006. [7] l. f. panjaitan, y. handrianto, and a. nurhadi, “apriori algorithm on car rental analysis with the most popular brands,” sink. junal penelit. tek. inform., vol. 4, no. 2, p. 47, 2020. [8] e. irfiani, “application of apriori algorithms to determine associations in outdoor sports equipment stores,” sink. junal penelit. tek. inform., vol. 3, no. 2, p. 218, 2019. [9] g. danon, m. schneider, m. last, m. litvak, and a. kandel, “an apriori-like algorithm for extracting fuzzy association rules between keyphrases in text documents,” cs.bgu.ac.il, 2006. [10] luthfiah and k. ditha tania, “k-means and apriori algorithm for pharmaceutical care medicine (case study: eye hospital of south sumatera province),” in journal of physics: conference series, 2019, pp. 1–7. [11] a. ezhilvathani and k. raja, “implementation of parallel apriori algorithm on hadoop cluster,” int. j. comput. sci. mob. comput., vol. 2, no. 4, pp. 513–516, 2013. [12] n. a. harun, m. makhtar, a. a. aziz, z. a. zakaria, f. s. abdullah, and j. a. jusoh, “the application of apriori algorithm in predicting flood areas,” int. j. adv. sci. eng. inf. technol., vol. 7, no. 3, pp. 763–769, 2017. [13] n. badal and s. tripathi, “frequent data itemset mining using vs _ apriori algorithms,” int. j. comput. sci. eng., vol. 2, no. 4, pp. 1111–1118, 2010. [14] j. suresh and t. ramanjaneyulu, “mining frequent itemsets using apriori algorithm,” int. j. comput. trends technol., vol. 4, no. 4, pp. 760–764, 2013. [15] s. a. abaya, “association rule mining based on apriori algorithm in minimizing candidate generation,” int. j. sci. eng. res., vol. 3, no. 7, pp. 1–4, 2012. [16] j. yabing, “research of an improved apriori algorithm in data mining association rules,” int. j. comput. commun. eng., vol. 2, no. 1, pp. 25–27, 2013. [17] j. singh, h. ram, and j. s. sodhi, “improving efficiency of apriori algorithm using transaction reduction,” int. j. sci. res. publ., vol. 3, no. 1, pp. 1–4, 2013. [18] j. silva, n. varela, l. a. b. lópez, and r. h. r. millán, “association rules extraction for customer segmentation in the smes sector using the apriori algorithm,” in procedia computer science, 2019, pp. 1207–1212. [19] a. w. o. gama, i. k. g. d. putra, and i. p. a. bayupati, “implementasi algoritma apriori untuk menemukan frequent itemset dalam keranjang belanja,” tekonologi elektro, vol. 15, no. 2, pp. 27–32, 2016. [20] k. k. widiartha, d. putu, and d. kumala, “shopping cart analysis system in product layout management with apriori algorithm,” int. j. appl. comput. sci. inform. eng., vol. 1, no. 2, pp. 53–64, 2019. [21] k. s. raju, a. d. devi, and d. d. d. suribabu, “mining frequent item sets using apriori algorithm on shopping dataset,” mukth shabd j., vol. 9, no. 5, pp. 6309–6320, 2020. [22] b. patel, v. k. chaudhari, r. k. karan, and y. . rana, “optimization of association rule mining apriori algorithm using aco,” int. j. soft comput. eng., vol. 1, no. 1, pp. 24–26, 2011. [23] m. f. akas, a. g. m. zaman, and a. khan, “combined item sets generation using modified apriori algorithm,” in acm international conference proceeding series, 2020, pp. 4–6. [24] h. yu, j. wen, h. wang, and j. li, “an improved apriori algorithm based on the boolean matrix and hadoop,” procedia eng., vol. 15, pp. 1827–1831, 2011. [25] z. jie and w. gang, “intelligence data mining based on improved apriori algorithm,” j. comput., vol. 14, no. 1, pp. 52–62, 2019. [26] x. liu, y. zhao, and m. sun, “an improved apriori algorithm based on an evolution-communication tissue-like p system with promoters and inhibitors,” discret. dyn. nat. soc., vol. 2017, 2017. [27] r. sun and y. li, “applying prefixed-itemset and compression matrix to optimize the mapreduce-based apriori algorithm on hadoop,” in acm international conference proceeding series, 2020, pp. 89–93. [28] x. yuan, “an improved apriori algorithm for mining association rules,” in aip conference proceedings, 2017, pp. 1– 6. [29] d. t. larose, an introduction to data mining, vol. 134. canada: john wiley & sons, inc, 2005. [30] y. kurnia, y. isharianto, y. c. giap, a. hermawan, and riki, “study of application of data mining market basket analysis for knowing sales pattern (association of items) at the o! fish restaurant using apriori algorithm,” in journal of physics: conference series, 2019, pp. 1–6. [31] j. r. delos arcos and a. a. hernandez, “analyzing online transaction data using association rule mining: misumi philippines market basket analysis,” in acm international conference proceeding series, 2019, pp. 45–49. https://doi.org/10.1145/170035.170072 https://doi.org/10.1145/170035.170072 https://dl.acm.org/doi/10.5555/645920.672836 https://dl.acm.org/doi/10.5555/645920.672836 https://www.elsevier.com/books/data-mining-southeast-asia-edition/han/978-0-12-373584-3 https://www.elsevier.com/books/data-mining-southeast-asia-edition/han/978-0-12-373584-3 https://doi.org/10.33395/sinkron.v4i2.10506 https://doi.org/10.33395/sinkron.v4i2.10506 https://doi.org/10.33395/sinkron.v3i2.10089 https://doi.org/10.33395/sinkron.v3i2.10089 http://www.math.s.chiba-u.ac.jp/~yasuda/open2all/paris06/ipmu2006/html/finalpapers/p617.pdf http://www.math.s.chiba-u.ac.jp/~yasuda/open2all/paris06/ipmu2006/html/finalpapers/p617.pdf https://doi.org/10.1088/1742-6596/1196/1/012051 https://doi.org/10.1088/1742-6596/1196/1/012051 https://www.ijcsmc.com/docs/papers/april2013/v2i4201396.pdf https://www.ijcsmc.com/docs/papers/april2013/v2i4201396.pdf http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&itemid=1&article_id=1463 http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&itemid=1&article_id=1463 http://www.enggjournals.com/ijcse/doc/ijcse10-02-04-73.pdf http://www.enggjournals.com/ijcse/doc/ijcse10-02-04-73.pdf http://ijcttjournal.org/archives/ijctt-v4i4p164 http://ijcttjournal.org/archives/ijctt-v4i4p164 https://www.ijser.org/onlineresearchpaperviewer.aspx?association-rule-mining-based-on-apriori-algorithm-in-minimizing-candidate-generation.pdf https://www.ijser.org/onlineresearchpaperviewer.aspx?association-rule-mining-based-on-apriori-algorithm-in-minimizing-candidate-generation.pdf http://dx.doi.org/10.7763/ijcce.2013.v2.128 http://dx.doi.org/10.7763/ijcce.2013.v2.128 http://www.ijsrp.org/research-paper-1301/ijsrp-p1397.pdf http://www.ijsrp.org/research-paper-1301/ijsrp-p1397.pdf https://doi.org/10.1016/j.procs.2019.04.173 https://doi.org/10.1016/j.procs.2019.04.173 https://ojs.unud.ac.id/index.php/jte/article/view/id19726 https://ojs.unud.ac.id/index.php/jte/article/view/id19726 https://doi.org/10.33173/acsie.55 https://doi.org/10.33173/acsie.55 http://shabdbooks.com/gallery/687-may2020.pdf http://shabdbooks.com/gallery/687-may2020.pdf https://www.ijsce.org/wp-content/uploads/papers/v1i1/a008021111.pdf https://www.ijsce.org/wp-content/uploads/papers/v1i1/a008021111.pdf https://doi.org/10.1145/3377049.3377125 https://doi.org/10.1145/3377049.3377125 https://doi.org/10.1016/j.proeng.2011.08.340 https://doi.org/10.1016/j.proeng.2011.08.340 https://doi.org/10.17706/jcp.14.1.52-62 https://doi.org/10.17706/jcp.14.1.52-62 https://doi.org/10.1155/2017/6978146 https://doi.org/10.1155/2017/6978146 https://doi.org/10.1145/3384544.3384610 https://doi.org/10.1145/3384544.3384610 https://doi.org/10.1063/1.4977361 https://doi.org/10.1063/1.4977361 http://dx.doi.org/10.1002/0471687545 https://doi.org/10.1088/1742-6596/1175/1/012047 https://doi.org/10.1088/1742-6596/1175/1/012047 https://doi.org/10.1088/1742-6596/1175/1/012047 https://doi.org/10.1145/3377170.3377226 https://doi.org/10.1145/3377170.3377226 i. introduction ii. method a. association analysis b. combination reduction 1) specifying items that are used to generate combinations (1-items) 2) generating itemset combinations based on previous frequent itemset iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] u. fayyad, g. p. shapiro, and p. smyth, “from data mining to knowledge discovery in databases,” ai mag., vol. 17, no. 3, pp. 37–54, 1996. [2] p. n. tan, m. steinbach, and v. kumar, introduction to data mining. united states of america: pearson addison-wesley, 2006. [3] j. pamungkas and y. handrianto, “assosiation rules for product sales data analysis using the apriori algorithm,” sink. junal penelit. tek. inform., vol. 5, no. 1, p. 84, 2020. [4] r. agrawal, “mining association rules between sets of items in large databases,” in proceeding of the 1993 acm sigmod conference washington dc, usa, 1993, pp. 1–10. [5] r. agrawal and r. srikant, “fast algorithms for mining association rules,” proceeding 20th vldb conf. santiago, chile., 1994. [6] j. han and m. kamber, data mining: concepts and techniques second edition. united states of america: elsevier inc., 2006. [7] l. f. panjaitan, y. handrianto, and a. nurhadi, “apriori algorithm on car rental analysis with the most popular brands,” sink. junal penelit. tek. inform., vol. 4, no. 2, p. 47, 2020. [8] e. irfiani, “application of apriori algorithms to determine associations in outdoor sports equipment stores,” sink. junal penelit. tek. inform., vol. 3, no. 2, p. 218, 2019. [9] g. danon, m. schneider, m. last, m. litvak, and a. kandel, “an apriori-like algorithm for extracting fuzzy association rules between keyphrases in text documents,” cs.bgu.ac.il, 2006. [10] luthfiah and k. ditha tania, “k-means and apriori algorithm for pharmaceutical care medicine (case study: eye hospital of south sumatera province),” in journal of physics: conference series, 2019, pp. 1–7. [11] a. ezhilvathani and k. raja, “implementation of parallel apriori algorithm on hadoop cluster,” int. j. comput. sci. mob. comput., vol. 2, no. 4, pp. 513–516, 2013. [12] n. a. harun, m. makhtar, a. a. aziz, z. a. zakaria, f. s. abdullah, and j. a. jusoh, “the application of apriori algorithm in predicting flood areas,” int. j. adv. sci. eng. inf. technol., vol. 7, no. 3, pp. 763–769, 2017. [13] n. badal and s. tripathi, “frequent data itemset mining using vs _ apriori algorithms,” int. j. comput. sci. eng., vol. 2, no. 4, pp. 1111–1118, 2010. [14] j. suresh and t. ramanjaneyulu, “mining frequent itemsets using apriori algorithm,” int. j. comput. trends technol., vol. 4, no. 4, pp. 760–764, 2013. [15] s. a. abaya, “association rule mining based on apriori algorithm in minimizing candidate generation,” int. j. sci. eng. res., vol. 3, no. 7, pp. 1–4, 2012. [16] j. yabing, “research of an improved apriori algorithm in data mining association rules,” int. j. comput. commun. eng., vol. 2, no. 1, pp. 25–27, 2013. [17] j. singh, h. ram, and j. s. sodhi, “improving efficiency of apriori algorithm using transaction reduction,” int. j. sci. res. publ., vol. 3, no. 1, pp. 1–4, 2013. [18] j. silva, n. varela, l. a. b. lópez, and r. h. r. millán, “association rules extraction for customer segmentation in the smes sector using the apriori algorithm,” in procedia computer science, 2019, pp. 1207–1212. [19] a. w. o. gama, i. k. g. d. putra, and i. p. a. bayupati, “implementasi algoritma apriori untuk menemukan frequent itemset dalam keranjang belanja,” tekonologi elektro, vol. 15, no. 2, pp. 27–32, 2016. [20] k. k. widiartha, d. putu, and d. kumala, “shopping cart analysis system in product layout management with apriori algorithm,” int. j. appl. comput. sci. inform. eng., vol. 1, no. 2, pp. 53–64, 2019. [21] k. s. raju, a. d. devi, and d. d. d. suribabu, “mining frequent item sets using apriori algorithm on shopping dataset,” mukth shabd j., vol. 9, no. 5, pp. 6309–6320, 2020. [22] b. patel, v. k. chaudhari, r. k. karan, and y. . rana, “optimization of association rule mining apriori algorithm using aco,” int. j. soft comput. eng., vol. 1, no. 1, pp. 24–26, 2011. [23] m. f. akas, a. g. m. zaman, and a. khan, “combined item sets generation using modified apriori algorithm,” in acm international conference proceeding series, 2020, pp. 4–6. [24] h. yu, j. wen, h. wang, and j. li, “an improved apriori algorithm based on the boolean matrix and hadoop,” procedia eng., vol. 15, pp. 1827–1831, 2011. [25] z. jie and w. gang, “intelligence data mining based on improved apriori algorithm,” j. comput., vol. 14, no. 1, pp. 52–62, 2019. [26] x. liu, y. zhao, and m. sun, “an improved apriori algorithm based on an evolution-communication tissue-like p system with promoters and inhibitors,” discret. dyn. nat. soc., vol. 2017, 2017. [27] r. sun and y. li, “applying prefixed-itemset and compression matrix to optimize the mapreduce-based apriori algorithm on hadoop,” in acm international conference proceeding series, 2020, pp. 89–93. [28] x. yuan, “an improved apriori algorithm for mining association rules,” in aip conference proceedings, 2017, pp. 1–6. [29] d. t. larose, an introduction to data mining, vol. 134. canada: john wiley & sons, inc, 2005. [30] y. kurnia, y. isharianto, y. c. giap, a. hermawan, and riki, “study of application of data mining market basket analysis for knowing sales pattern (association of items) at the o! fish restaurant using apriori algorithm,” in journal of physics: c... [31] j. r. delos arcos and a. a. hernandez, “analyzing online transaction data using association rule mining: misumi philippines market basket analysis,” in acm international conference proceeding series, 2019, pp. 45–49. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 1–16 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p1-16 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) non-gaussian analysis of herbarium specimen damage to optimize specimen collection management aris yaman a, 1, *, yulia aris kartika a, 2 , ariani indrawati a, 3 , zaenal akbar a, 4 , lindung p. manik b, 5 , wita wardani c, 6 , tutie djarwaningsih c, 7 , taufik mahendra d, 8 , dadan r. saleh a, 9 a research center for computing, national research and innovation agency, kawasan cisitu bandung, jl. sangkuriang, dago, coblong, bandung, jawa barat 40135, indonesia b research center for data and information sciences, national research and innovation agency, kawasan cisitu bandung, jl. sangkuriang, dago, coblong, bandung, jawa barat 40135, indonesia c research center for biosystematics and evolution, national research and innovation agency, cibinong science center, jl. raya jakarta-bogor, pakansari, cibinong, bogor, jawa barat 16915, indonesia d directorate of scientific collection management, national research and innovation agency, gedung inacc, cibinong science center, jl. raya jakarta-bogor, cibinong, bogor, jawa barat 16915, indonesia 1 aris.yaman@brin.go.id *; 2 yulia.aris.kartika@gmail.com; 3 indrawati.ariani@gmail.com, 4 zaenal.akbar@gmail.com, 5 lindung.manik@gmail.com, 6 wt.wardani@gmail.com, 7 tutie_teresia@yahoo.com, 8 taufikmahendra337@gmail.com, 9 dadan.rs@gmail.com * corresponding author i. introduction the herbarium is no longer a place to store preserved and classified plant samples. moreover, the herbarium has become an important supporting facility that provides valuable information on preserved flora specimens collections for many uses, especially in biodiversity. extinct, uncommon, endemic, and common plant species are preserved in herbarium collections to serve as a reference for future study. herbarium collections are widely used in a remarkable number of ways: to identify and discover species [1][2], to study specific biological events in the past [3][4], to understand ecological interactions [5][6], to learn about the benefits of flora such as for medication [7][8], to investigate biomolecular based on dna [9][10], and many more uses of herbarium collections. a herbarium has to protect the herbarium specimens against loss or damage. they must provide a safe and secure environment for all specimen collections and guarantee that the collection's condition is well maintained and done according to conservation standards. however, unfortunately, pests, poor storage conditions, irresponsible handling, and other factors have significantly harmed the herbarium article info a b s t r a c t article history: received 16 november 2021 revised 17 march 2022 accepted 4 june 2022 published online 7 november 2022 damage to specimen collections occurs in practically every herbarium across the world. hence, some precautions must be taken, such as investigating the factors that cause specimen damage in their collections and evaluating their herbarium collection handling and usage policy. however, manual investigation of the causes of herbarium collection damage requires a lot of effort and time. only a few studies have attempted to investigate the causes of herbarium collection damage. so far, the non-gaussian approach to detecting the causes of damage to herbarium specimens has not been studied before. this study attempted to explore the effect of species type, time, location, storage, and remounting status on the level of damage to herbarium specimens, especially those in the genus excoecaria. gaussian modeling is not good enough to model the counted data phenomenon (the amount of damage to herbarium specimens). negative binomial regression (nbr) provides a better model when compared to generalized poisson regression and ordinary gaussian regression approaches. nbr detects non-uniformity in the storage process, causing damage to herbarium specimens. natural damage to herbarium specimens is caused by differences in species and the origin of specimens. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: counted data damage analysis herbarium specimen nbr poisson regression 2 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 collection over the years. damage to the herbarium collection can be seen in figure 1. these circumstances may cause bias in herbarium specimen data and uncertainty in decision-making and study outcomes. herbarium bogoriense (bo) is the largest herbarium center in southeast asia and one of the top three in the world. this herbarium collection comprises a comprehensive collection of flowering plants, gymnosperms, ferns and lycophytes, mosses, liverworts, fungi, and many more. nearly one million specimens from the malesian region (indonesia, singapore, malaysia, brunei darussalam, timor-leste, papua new guinea, and philippines) obtained through field expeditions and gifts or exchanges between herbariums around the world [11]. the herbarium specimens, both dry and wet collections, are stored and arranged in the space provided by the curator. collections are classified according to their respective taxons. the collection is placed separately from the collection of monocots and dicots. arrangement of collections alphabetically by family, genus, species, and sites. specimen sheets using acid-free paper, species folders, and genus maps. the placement of type specimens is separated from the general collection [11]. bo, one of the main reference centers for research on tropical plant taxonomy, ecology, ethnobiology, physiology, morphogenetics, and phytochemistry in the malesian region, must ensure that all its collections are always of good quality and minimize the possibility of damage. keeping the herbarium collection in good condition throughout the process, from specimen collecting to storage, was challenging for the curator. in some cases, the herbarium sheet itself represents the plant, as all the plants may be lost in that place. so, protecting the sheets from fungal and insect pests is an important step. after the collection has been preserved, it should be checked regularly to ensure that the plants are healthy and free of insects or excessive dampness. insects have the potential to destroy herbarium collections. insects will inevitably attack the species, even with the most meticulous care and the best equipment. the curators also routinely check [12][13] the specimens to see if any specimens are damaged, especially damage caused by fungi or insects. although preventive measures have been taken to eliminate insects and fungi that could damage the specimens, the curators still found some damaged specimens. the specimens most damaged by insects or fungi were from the genus excoecaria. so, they took the initiative to investigate the factors that cause the specimens' damage in their collections. several studies have investigated the damage. meineke used digital herbarium specimens to study long-term insect-plant interactions [14]. for phenological research, pearson uses machine learning on digital herbarium specimens [15]. fig. 1. herbarium collection damage caused by natural damage, mounting or remounting process, and insect a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 3 it is a vital strategy to review and evaluate the policy of their herbarium collection handling and usage. however, manual investigation of the causes of herbarium collection damage requires a lot of effort and time. only a few studies have attempted to investigate the causes of herbarium collection damage. many metadata-based studies have been carried out before. studies have been conducted to discover time series patterns and specimen distributions of genetic changes in a specimen. studies link herbarium specimen metadata to climate change patterns [16][17][18]. on the other hand, this study looks at how labels on herbarium specimen metadata affect the damage to herbarium specimens. the curator assesses specimen damage. if the specimen is damaged, the curator will mark the damaged area in the photo and offer details on the source of the damage. the damage marker box size varies and depends on the specimen's damage. one specimen sheet can have several flaws from various sources. herbarium specimens are damaged in three ways: before processing (bp), inprocessing (ip), and caused by insects. the first category includes damage that occurred before collection (i.e., damage caused by natural forces in nature). the second category includes damage that occurred during the collection or remounting of herbarium specimens (in-process collecting damage). insect damage is the last type of damage that can occur to herbarium specimens. damage identification in a herbarium specimen is based on the number of damaged spots and the source of damage (bp, ip, or insect). thus, the study's response variable is counted data. so linear regression cannot be used to model the phenomena in this investigation. the generalized linear model (glm) can model data with non-linear characteristics. glm modeling requires three essential components: random, systematic, and link functions [19]. non-linear regression with counted data is achievable using generalized poisson and negative binomial regression [20]. generalized poisson regression (gpr) is suitable for modeling with counted data [20]. the generalized poisson distribution is used to distribute the response variables in the gpr model (gpd). this gpd can model overdispersion and underdispersion well [20][21]. negative binomial regression can also be used to model counted data. the negative binomial distribution is a poissongamma mixed function. it can accommodate overdispersion in poisson regression because it does not require equidispersion [20][22]. ii. methods the stages of analysis in this study are depicted in figure 2. the first step is the herbarium damage quantification specimen. at this stage, we annotate each type of damage per herbarium specimen. in the second stage, we will evaluate whether the three types of damage are multivariate phenomena (identification through the correlation value of each pair of types of damage). multivariate modeling will be carried out if there is a significant correlation between each pair of types of damage. otherwise, univariate modeling will be carried out. the next stage is modeling with non-gaussian regression. at this stage, modeling two types of non-gaussian regression (nbr and poisson) is carried out. as a comparison, gaussian regression modeling is still being carried out. in the last stage, we will evaluate the model based on the results obtained from the previous stage. aic parameters are used to evaluate the best type of modeling. fig. 2. research analysis flow chart 4 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 a. specimen overview recently, the scientific curator of bo reported that his collection was damaged. several genera were damaged, such as antidesma, baccaurea, breynia, excoecaria, etc. however, the most damage occurred in the genus excoecaria. in that genus, curators found 2,146 defects in 175 excoecaria specimens. it includes damage from nature, damage from mounting or remounting, and damage caused by insects, all types of damage that can happen. excoecaria is a genus of plants in the euphorbiaceae family [23]. excoecaria is derived from the latin word excaeco, which means "to blind," and refers to the sap of some plants that can induce temporary blindness [24]. excoecaria is shrubs or trees with milky latex, glabrous, monoecious, or dioecious. leaves alternate with two glands at the petiole-lamina junction. inflorescences have a spike or raceme with flowers clustered in the axils of bracts; female inflorescences are shorter than males. perianth segments 2 or 3. stamens 2 or 3, filaments basally fused. ovary 2-or 3-locular, solitary ovule in each loculus; style 3, linear, free. 3-lobed capsules. the milky latex irritates the skin and can cause injury and blindness if applied to the eye. distribution and frequency of occurrence: 40 species worldwide, from tropical africa to malaysia and australia [25]. there can be only one cause of damage on a specimen, but there can also be more than one source of damage. examples of specimens that suffered damage caused by a single source of damage can be seen in figure 1. b. specimen herbarium damage quantification we quantified herbivory on a few genus excoecaria specimens collected in indonesia, new guinea, malaysia, and the philippines and preserved within the herbarium bogoriense. we chose the genus excoecaria because specimens from the genus excoecaria were the most damaged in the herbarium bogoriense. the curator assesses specimen damage. if there is damage to the specimen, the curator will put a checkmark on the damaged part in the specimen photo and provide information on the source of the damage. the size of the damaged marker box is not uniform. the size of the damaged marker box depends on the size of the damaged part of the specimen. one specimen sheet can consist of one or more defects with different sources of damage. the causes of damage to herbarium specimens are classified into three categories. first, damage that occurred before the specimen collection process or damage caused by natural factors in nature (natural damage). the second damage cause was identified as damage caused during the collection process or remounting herbarium specimens (in the process of collecting damage). the last cause or source of damage is herbarium specimen damage caused by insects at the specimen storage location (damage by insects). differentiating between pre-collection and post-collection herbivory on herbarium specimens is a challenge. pre-collection herbivory on the leaves of some plant species can be distinguished by the presence of a thin and darkening contour around the damaged area. it means the plant was still alive when the herbivory killed the cells in a specific area [6]. if localized cell death does not occur surrounding the injured area, post-collection herbivory or storage-related damage is assumed [26]. we discovered leaf damage morphology in excoecaria was similar before and after collection, so we used the same method to distinguish pre-collected herbivores and used the curator's opinion to differentiate preand post-collection damage. the specimen damage due to the mounting or remounting process is usually indicated by the presence of an envelope attached to the specimen sheet. the envelope helps accommodate broken stems or torn leaf pieces. process-damaged leaves are often seen at the leaf tips or margins, not on the inside of the leaves. one of the causes of leaf damage during the process is leaf folds during the drying process, which causes the leaf shape to become imperfect. in addition, the leaves and stems are ripped or broken during the transfer procedure from the old specimen paper to the new specimen paper because of their fragility. c. statistical analysis this study was divided into three causes of damage to herbarium specimens (as a response variable). first, damage that occurred before the specimen collection process or damage caused by natural factors in nature (natural damage/bp). the second damage cause was identified as damage caused during the collection process or remounting herbarium specimens (in collecting damage/ip). a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 5 the last cause or source of damage is herbarium specimen damage caused by insects at the specimen storage location (damage by insects). systematic identification of damage in a herbarium specimen is based on the number of damage spots along with identifying the source of damage (bp, ip, and caused by insect). based on this, the response variable in the study is the counted data. the kolmogorov-smirnov test was applied to assess distribution fit inferentially [27]. so, it cannot use the usual linear regression approach to model the phenomena in this study. the generalized linear model (glm) approach can model data whose parameters are not linear. modeling with glm requires three main components: a random component, a systematic component, and a link function [19]. there are at least two non-linear regression approaches with the counted data in response: generalized poisson regression and negative binomial regression [20]. generalized poisson regression (gpr) has been proven to be good in modeling the response variable in the form of counted data [20]. as the name implies, the response variables in the gpr model are distributed according to the generalized poisson distribution (gpd). this gpd is good at modeling overdispersion and under-dispersion data conditions [20][21]. another approach to modeling the counted data is negative binomial regression. in this study, the negative binomial distribution is a mixed function between poisson-gamma. the gamma distribution can accommodate overdispersion in poisson regression because it does not assume equi-dispersion conditions in its application [20][22]. this study attempted to explore the effect of species type, time, location, storage, and remounting status on the level of damage to herbarium specimens (especially those in the genus excoecaria). in all models, the response was the total number of spots with bp, ip, and caused by insect damage to herbarium specimens (hs). the models were defined as: number of spots damage before collecting process (bp): ( ) ( ) ( ) ( ) (1) number of spots damage caused by collecting process (ip): ( ) ( ) ( ) ( ) ( ) (2) number of spots damage caused by insect at storage collection (insect): ( ) ( ) ( ) ( ) ( ) ( ) (3) as shown in the above equation, there are three models of the level of damage to herbarium specimens. the first model, logit (bp) is a function of variable a, intercept, species type (categorical variable), age of collection (numeric variable), and origin of species (categorical variable). the second model, logit (ip)/level of damage due to the collection/remounting process, is a function of variables a, intercept, species type (categorical variable), age of collection (numeric variable), the origin of species (categorical variable), collection storage location, number of damage caused before collection (bp), and number of damage caused by insects in storage collection (categorical variables). precisely for this second model, the samples used in the modeling are herbarium specimens that have undergone a remounting process. the third model, logit (insect), is a function of variables and intercept, species type (categorical variable), age of collection (numeric variable), origin of species (categorical variable), collection storage location (categorical variable), remounting status, number of damaged before the collecting process (bp), and the number of damaged insects in the storage collection (insect). this study observed four species belonging to the genus excoecaria, namely: excoecaria agallocha, excoecaria cochinchinensis, excoecaria humilis, and excoecaria oppositifolia. the origin of the specimens in the study was spread across nine locations, including borneo, celebes, java, kawasan_ii, malaypen, moluccas, new guinea, the philippines, and sumatra. meanwhile, there are nine different collection storage locations in the focus of this research. the explanatory variable for remounting status is a variable that states whether a specimen has experienced remounting or not before. 6 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 this study's explanatory variables are descriptions or labels (metadata) in a herbarium specimen. the data cleansing stage produced as many as 175 herbarium data specimens (which could be further analyzed). furthermore, this study's entire sample of specimens will be modeled into three models described previously. a pre-analysis was conducted to see the relationship pattern between the response variables (bp, ip, and insects). if there is a significant correlation between them, it is necessary to do multivariate modeling. on the other hand, if there is no significant correlation between the response variables, it is sufficient to do univariate modeling (partial modeling for each response variable). after assessing the closeness of the relationship between the response variables, the stages of statistical analysis are modeled with several modeling schemes, including modeling based on gpr or negative binomial regression. as a comparison, modeling based on simple multiple linear regression is also carried out. the aic (akaike information criteria) parameter is used to assess which model best models the phenomena in this study. the lower the aic value, the better the resulting model for modeling the phenomena contained in the study [22]. after obtaining the best model based on the lowest aic value, the next stage tests to see which explanatory variables significantly affect the built model. this study applies a partial f test to see which explanatory variables significantly impact the model. the partial f test is a test that compares the full model (a model with all explanatory variables) with a partial model (a model without one of the explanatory variables, which will be tested). the logic is built to see the change in goodness models if one of the explanatory variables is omitted [28]. however, the wald test was used for categorical variables to see which level of the categorical variables had the most significant impact on the damage to herbarium specimens [29]. iii. results and discussion a. exploratory data analysis in this study, the causes of damage were divided into three categories: firstly, the cause of damage is natural processes that occur while the specimen is still in nature (natural damage/before the collecting process). secondly, the damage caused during the specimen collection process (inprocess damage), and the third was the damage to herbarium specimens caused by insects at the collection storage location (preservation damage by insects). in order to determine the modeling procedure later, the first step is to evaluate the correlations among the various causes of damage. this evaluation is intended to determine whether there is a correlation between the sources of damage. when there is a significant relationship between response variables, it is better to carry out a multivariate analysis procedure. on the other hand, if there is no correlation between the response variables (the source of the damage to the specimen), then partial modeling (univariate analysis) is carried out. table 1 shows the correlation between sources causing damage to herbarium specimens, with a p-value exceeding α (5%), which indicates no significant correlation between the response variables. it indicates that there is no significant correlation between the response variables. so, a partial analysis procedure (univariate analysis) was applied in this study. figure 3 shows a comparison plot of the number of damage events for each pair of sources causing damage to herbarium specimens: (a) between before process (natural damage) and inprocess damage; (b) between natural damage and preservation damage by insects; and (c) between in-process and preservation damage by insects. the picture shows the number of damage points on the herbarium specimens. due to the collection process, the distribution pattern of damage points on herbarium specimens looks the same as the distribution pattern of collection damage points due to table 1. correlation between response variables (source of damage) correlation/p-value num_damage_bp num_damage_ip num_damage_ip 0.146 0.104 num_damage_insect 0.069 0.069 0.440 0.441 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 7 natural factors (natural damage). insects in the storage of herbarium collections caused less damage to the collections than the other two causes.specimen distribution collection dates for excoecaria specimens in the herbarium bogoriense (bo) span 154 years, from 1866 through 2020. most collections (36; 27; and 24 out of 175) were collected from java island, sumatra island, and moluccas, and only five samples came from malaysia-peninsula. the (a) (b) (c) fig. 3. scatterplot for each pair cause damage to the herbarium specimen: (a) natural damage & in-process damage, (b) natural damage & damage by insect, (c) in-process damage & damage by insect 8 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 low sample size for the malaysia-peninsula region could have caused a bias caused by the nonrepresentation of the region [26]. excoecaria agallocha is the most abundant species in the collection to be analyzed in this study (121 out of 175). meanwhile, there were only four samples of excoecaria oppositifolia. because these specimens were given to the herbarium bogoriense by other researchers, there is a low number of specimens from specific species and places. figure 4 shows that the existing data are not normally distributed. figure 5 shows the distribution of damage for each of the analyzed species. figure 5a shows that the highest level of damage before the collection process occurred in excoecaria cochinchinensis. however, we cannot conclude that this species was the most severely damaged before the collection process. in the box plot, there are slices of the same amount of damage as excoecaria agallocha and excoecaria humilis. in contrast to the pattern of damage caused by the remounting process (figure 5b), it is seen that tremendous damage occurred in excoecaria oppositifolia. the way that tends to be homogeneous occurs in the damage caused by insects in the collection storage area (figure 5c). visually, for each species, the level of damage tends to be the same. these visual findings need to be clarified inferentially. it is to obtain valid conclusions. visually exploring whether differences in specimen origin affect the damage to herbarium specimens is shown in figure 6, which shows no significant differences between the origin of the specimen and the degree of damage (figure 6a and figure 6c). different things can be seen in figure 6b, it can be seen that specimens from malaysia-peninsula have the highest level of damage compared to other specimens from the origin. similar to the species variable, the specimen origin variable needs to be tested for inference to see a valid level of significance for the damage level of the specimen. other variables also need to be clarified regarding their influence on specimen damage at the modeling stage. b. model fitting the normality distribution test for each damage cause is a critical process that must be performed to select the suitable model for analysis. because the distribution of damage occurrences for all causes of specimen collecting damage is not normally distributed, as shown in figure 4, poisson or negative binomial models can be utilized in this investigation. table 2 shows the kolmogorov smirnov distribution fittest results for those models. the p-value on the negative binomial exceeded 5% for all sources of damage, indicating that the negative binomial is the best model to study the factors that cause specimen collection damage. the aic value comparison between multiple linear regression, generalized poisson regression, and negative binomial regression confirms it. the negative binomial regression approach obtains the aic optimal score (the last one), as shown in table 3. table 2. goodness-of-fit test for distribution response variable p-value normality test poisson test negative binomial test num. of damage before process 0.0032 4.536627e-24 0.363 num. of damage in process collection 0.0006 5.398994e-42 0.248 num. of damage by insect 1.829e-09 2.897935e-36 0.437 alternative hypothesis: data not distributed as a test (p-value > α, accept alternative hypothesis) table 3. goodness-of-fit model (aic) response variable aic multiple linear regression generalized poisson regression negative binomial regression num. of damage before process 728.87 768.85 681.24 num. of damage in process collection 752.10 789.58 678.58 num. of damage by insect 623.68 537.20 426.52 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 9 (a) (b) (c) fig. 4. histogram of each response variable: (a) natural damage, (b) in-process damage, (c) damage by insect 10 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 (a) (b) (c) fig. 5. distribution of damage for each species: (a) natural damage, (b) in-process damage, (c) damage by insect a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 11 (a) (b) (c) fig. 6. distribution of damage for each origin of specimen: (a) natural damage, (b) in-process damage, (c) damage by insect 12 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 c. statistical modeling the modeling in table 4 (partial f-test) shows that the explanatory variables of specimen origin and species significantly affect the level of specimen damage before the collection process. the wald test was carried out as shown in table 5. this test was to see which group significantly affected the level of specimen damage before the collection process on each explanatory variable. in addition, this test also shows the direction of influence of each explanatory variable. excoecaria cochinchinensi is a species that significantly affects the damage to herbarium specimens (bp). a positive value in the estimated coefficient of this variable indicates that this species has a higher vulnerability to damage than the other three species. natural damage was more common in the specimens of e. cochinchinensis than in the other three species. table 6 shows that the difference in storage places significantly affects the damage during the remounting process. it indicates that different storage locations can affect the level of specimen damage due to this technical factor (remounting). no_ph7 has a higher level of damage due to remounting than other storage areas (see table 7). modeling with the response variable of the level of damage due to insects at the storage location shows that only the explanatory variables of the storage area and the level of natural damage have a significant effect (table 8). the wald test in table 9 shows the direction of the influence of the variable level of natural damage and the specimen's storage place. it is seen that the more damaged the specimen is due to natural factors, the higher the level of damage due to insects in the storage location. meanwhile, locations no_ph10 and no_ph15 significantly adversely affected the level of specimen damage due to insects. it means that both storage areas have a lower level of damage than other storage table 4. partial f-test effect for predictor variable (response variable: natural damage before collecting process/bp) model theta resid. df 2 x log-lik. test df lr stat. pr(chi) 1 age_specimen + species 2.59 121 -667.05 2 origin_spec + species 3.02 114 -653.29 3 origin_spec + age_specimen 2.67 116 -664.22 4 origin_spec + age_specimen + species 3.02 113 -653.24 1 vs 4 8 13.81 0.09 2 vs 4 1 0.06 0.81 3 vs 4 3 10.98 0.01 alternative hypothesis: have significant effect, (p-value/pr(chi) < α, accept alternative hypothesis) table 5. wald test for response variable: natural damage before collecting process coefficients estimate std. error z value pr(>|z|) (intercept) 1.416783 0.398122 3.559 0.000373 *** origin_speccelebes -0.325470 0.422238 -0.771 0.440806 origin_specjava 0.277354 0.347138 0.799 0.424306 origin_speckawasan_ii 0.149753 0.521106 0.287 0.773826 origin_specmalaypen 0.505655 0.617376 0.819 0.412764 origin_specmolucas 0.188239 0.333341 0.565 0.572276 origin_specnew_guinea 0.371863 0.415613 0.895 0.370929 origin_specphilipphine 0.150636 0.410971 0.367 0.713964 origin_specsumatra -0.360010 0.345490 -1.042 0.297398 age_specimen 0.000612 0.002635 0.232 0.816367 speciesexcoecaria cochinchinensi 0.409421 0.196950 2.079 0.037635 * speciesexcoecaria humilis 0.367186 0.247606 1.483 0.138090 speciesexcoecaria oppositifolia -0.716610 0.529615 -1.353 0.176028 -- signif. codes: 0’***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 13 areas. d. discussion excoecaria cochinchinensi is a species that significantly affects the damage to herbarium specimens (bp). this species has the highest level of damage before the collection process compared to other species (the highest level of natural damage). the specimen's origin also significantly determines the level of susceptibility to damage to the specimen before undergoing the collection process. so, specimens from such locations as analyzed and excoecaria cochinchinensis need to be treated more intensely in the following collection process. table 6. partial f-test effect for predictor variable (response variable: damage caused by remounting process/ip) model theta resid. df 2 x log-lik. test df lr stat. pr(chi) 1 age_specimen + no_ph + species + num_damage_bp + num_damage_insect 3.68 53 -301.92 2 origin_spec + no_ph + species + num_damage_bp + num_damage_insect 4.74 48 -290.73 3 origin_spec + age_specimen + species + num_damage_bp + num_damage_insect 3.52 51 -303.27 4 origin_spec + age_specimen + no_ph + num_damage_bp + num_damage_insect 4.53 48 -291.27 5 origin_spec + age_specimen + no_ph + species + num_damage_insect 4.41 48 -292.07 6 origin_spec + age_specimen + no_ph + species + num_damage_bp 4.91 48 -289.80 7 origin_spec + age_specimen + no_ph + species + num_damage_bp + num_ damage_insect 4.96 47 -298.77 1 vs 7 6 12.15 0.059 2 vs 7 1 0.96 0.328 3 vs 7 4 13.50 0.009 4 vs 7 1 1.50 0.220 5 vs 7 1 2.30 0.129 6 vs 7 1 0.03 0.852 table 7. wald test for response variable: damage caused by remounting process coefficients: estimate std. error z value pr(>|z|) (intercept) 38.71 47450000 0 1 origin_speccelebes -38.35 47450000 0 1 origin_specjava -38.52 47450000 0 1 origin_speckawasan_ii -1.311 1.029 -1.274 0.2026 origin_specmalaypen -1.046 1.097 -0.954 0.3402 origin_specmolucas -1.491 0.7259 -2.053 0.04 * origin_specnew_guinea -1.562 1.002 -1.559 0.119 origin_specsumatra -1.005 0.748 -1.343 0.1792 age_specimen 0.004875 0.004903 0.994 0.3201 no_phph8 1.188 0.4783 2.483 0.013 * no_phph12 -37.37 47450000 0 1 no_phph13 -37.04 47450000 0 1 no_phph15 0.7252 0.4267 1.7 0.0892 no_phph16 -36.83 47450000 0 1 no_phph17 -35.89 47450000 0 1 speciesexcoecaria humilis -0.7847 0.641 -1.224 0.2209 num_damage_bp 0.03377 0.02226 1.517 0.1292 num_damage_insect -0.008157 0.0422 -0.193 0.8467 -- signif. codes: 0’***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 14 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 the damage caused by the remounting process on herbarium specimens is primarily due to the specimen storage area. there is a difference in the quality of the specimen storage area. it indicates the existence of non-uniformity in the management of storage media. meanwhile, the damage caused by insects at the collection storage location is caused by the factors where the specimen is table 8. partial f-test effect for predictor variable (response variable: preservation damage by insect) model theta resid. df 2 x log-lik. test df lr stat. pr(chi) 1 age_specimen + stat_remounting + no_ph + species + num_damage_bp + num_damage_ip 0.49 111 -389.76 2 origin_spec + stat_remounting + no_ph + species + num_damage_bp + num_damage_ip 0.58 105 -379.52 3 origin_spec + age_specimen + no_ph + species + num_damage_bp + num_damage_ip 0.58 105 -379.31 4 origin_spec + age_specimen + stat_remounting + species + num_damage_bp + num_damage_ip 0.46 110 -393.18 5 origin_spec + age_specimen + stat_remounting + no_ph + num_damage_bp + num_damage_ip 0.57 105 -381.10 6 origin_spec + age_specimen + stat_remounting + no_ph + species + num_damage_ip 0.55 105 -383.67 7 origin_spec + age_specimen + stat_remounting + no_ph + species + num_damage_bp 0.58 105 -379.12 8 origin_spec + age_specimen + stat_remounting + no_ph + species + num_damage_bp + num_damage_ip 0.59 104 -379.09 1 vs 8 7 10.67 0.154 2 vs 8 1 0.44 0.509 3 vs 8 1 0.23 0.633 4 vs 8 6 14.09 0.029 5 vs 8 1 2.02 0.155 6 vs 8 1 4.58 0.032 7 vs 8 1 0.04 0.845 table 9. wald test for response variable : preservation damage by insect coefficients: estimate std. error z value pr(>|z|) (intercept) 6.14e-01 9.51e-01 0.646 0.518 origin_speccelebes 6.40e-01 8.68e-01 0.737 0.461 origin_specjava 1.02e+00 1.03e+00 0.993 0.321 origin_speckawasan_ii -3.57e+00 2.73e+00 -1.308 0.191 origin_specmalaypen -4.12e+01 3.00e+07 0 1.000 origin_specmolucas -2.42e+00 1.75e+00 -1.383 0.167 origin_specnew_guinea -1.93e+00 1.88e+00 -1.027 0.305 origin_specphilippine -2.70e+00 1.98e+00 -1.367 0.172 origin_specsumatra -3.24e+00 1.85e+00 -1.756 0.079 age_specimen -5.58e-03 6.68e-03 -0.836 0.403 stat_remountingwith_remounting 2.34e-01 4.46e-01 0.524 0.600 no_phph8 -1.53e+00 9.33e-01 -1.639 0.101 no_phph9 -6.21e-01 9.65e-01 -0.643 0.520 no_phph10 -2.76e+00 1.25e+00 -2.21 0.027 * no_phph12 7.37e-01 1.84e+00 0.402 0.688 no_phph13 2.98e+00 1.92e+00 1.554 0.120 no_phph15 -2.55e+00 7.63e-01 -3.341 0.001 *** no_phph16 1.95e+00 1.96e+00 0.995 0.320 no_phph17 3.75e+00 3.22e+00 1.163 0.245 speciesexcoecaria humilis -2.42e+00 1.73e+00 -1.404 0.160 num_damage_bp 9.64e-02 4.00e-02 2.411 0.016 * num_damage_ip 7.55e-03 3.50e-02 0.216 0.829 -- signif. codes: 0’***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 15 stored and the specimen level of damage before the collection process (natural damage before the collecting process). storage areas appear to affect the rate of insect damage significantly. it indicates clearly that due to poor quality in certain storage places, in other words, the need for standardized specimen management. in addition, it can be seen that if specimens found before the collection process were damaged, they are more likely to be damaged by insects when stored. iv. conclusion this study attempted to explore the effect of species type, time, location, storage, and remounting status on the level of damage to herbarium specimens (especially those in the genus excoecaria). the response was the total number of spots with bp, ip, and insect damage herbarium specimens (hs) with negative binomial regression (nbr), poisson regression, and ordinary gaussian regression approaches. the experiment shows that the typical distribution-based regression modeling approach was not practical enough in modeling the damage phenomenon in herbarium specimens. the method based on the distribution of the enumerated data (amount of damage to herbarium specimens), predominantly negative binomial regression, can better model the phenomenon of damage to herbarium specimens compared to gpr modeling and ordinary gaussian regression models. based on negative binomial regression modeling, it was detected that there was a nonuniformity in the storage process. the storage location factor significantly positively affects damage to herbarium specimens (caused by insects and the remounting process). the procedure for storing herbarium specimens needs to be standardized. meanwhile, damage due to natural factors is caused by factors of different types of species. bo management needs to be concerned and handle the excoecaria cochinchinensis species. this research is limited to modeling excoecaria cochinchinensis species. it will probably have an impact on different species. new species will be added in the future to make the results obtained more general than the existing model. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] n. o. kin, n. y. demchenko, and s. n. ryabtsov, “rare plants of the voronezh region in ecosystems of khrenovsky pine forest,” iop conf. ser. earth environ. sci., vol. 817, no. 1, 2021. [2] a. a. pinto, j. j. c. mont, d. e. m. jiménez, a. g. noriega, j. j. barrios, and a. c. mccormick, “characterization of riparian tree communities along a river basin in the pacific slope of guatemala,” forests, vol. 12, no. 7, pp. 1–12, 2021. [3] t. k. miller, a. s. gallinat, l. c. smith, and r. b. primack, “comparing fruiting phenology across two historical datasets: thoreau’s observations and herbarium specimens,” ann. bot., vol. 128, no. 2, pp. 159–170, 2021. [4] a. g. auffret, “historical floras reflect broad shifts in flowering phenology in response to a warming climate,” ecosphere, vol. 12, no. 7, 2021. [5] l. a. jenny, l. r. shapiro, c. c. davis, j. davies, n. e. pierce, and e. k. meineke, “herbarium specimens reveal herbivory patterns across the genus cucurbita,” bioarvix, 2021. [6] e. k. meineke, a. t. classen, n. j. sanders, and t. jonathan davies, “herbarium specimens reveal increasing herbivory over the past century,” j. ecol., vol. 107, no. 1, pp. 105–117, 2019. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1088/1755-1315/817/1/012048 https://doi.org/10.1088/1755-1315/817/1/012048 https://doi.org/10.3390/f12070898 https://doi.org/10.3390/f12070898 https://doi.org/10.1093/aob/mcab019 https://doi.org/10.1093/aob/mcab019 https://doi.org/10.1002/ecs2.3683 https://doi.org/10.1002/ecs2.3683 https://doi.org/10.1101/2021.07.21.452357 https://doi.org/10.1101/2021.07.21.452357 https://doi.org/10.1111/1365-2745.13057 https://doi.org/10.1111/1365-2745.13057 16 a. yaman et al. / knowledge engineering and data science 2022, 5 (1): 1–16 [7] w. milliken, b. e. walker, m. j. r. howes, f. forest, and e. nic lughadha, “plants used traditionally as antimalarials in latin america: mining the tree of life for potential new medicines,” j. ethnopharmacol., vol. 279, no. march, 2021. [8] p. maher et al., “the value of herbarium collections to the discovery of novel treatments for alzheimer’s disease, a case made with the genus eriodictyon,” front. pharmacol., vol. 11, no. march, 2020. [9] s. acha, a. linan, j. macdougal, and c. edwards, “the evolutionary history of vines in a neotropical biodiversity hotspot: phylogenomics and biogeography of a large passion flower clade (passiflora section decaloba),” mol. phylogenet. evol., vol. 164, no. july, p. 107260, 2021. [10] n. forin, a. vizzini, f. fainelli, e. ercole, and b. baldan, “taxonomic re‐examination of nine rosellinia types (ascomycota, xylariales) stored in the saccardo mycological collection,” microorganisms, vol. 9, no. 3, 2021. [11] d. girmansyah, y. santika, rugayah, and j. s. rahajoe, index herbariorum indonesianum. 2018. [12] v. bestandssituation, “bärlappe in thüringen –verbreitung und bestandssituation,” landschaftspfl. und naturschutz thüringen, vol. 52, no. 2, pp. 51–54, 2015. [13] a. güntsch, w. berendsohn, and p. mergen, “the biocase project a biological collections access service for europe,” ferrantia, vol. 51, no. june 2014, pp. 103–108, 2007. [14] e. k. meineke, c. tomasi, s. yuan, and k. m. pryer, “applying machine learning to investigate long-term insect– plant interactions preserved on digitized herbarium specimens,” appl. plant sci., vol. 8, no. 6, pp. 1–11, 2020. [15] k. d. pearson et al., “machine learning using digitized herbarium specimens to advance phenological research,” bioscience, vol. 70, no. 7, pp. 610–620, 2020. [16] i. koh et al., “modeling the status, trends, and impacts of wild bee abundance in the united states,” proc. natl. acad. sci. u. s. a., vol. 113, no. 1, pp. 140–145, 2016. [17] c. meyer, p. weigelt, and h. kreft, “multidimensional biases, gaps and uncertainties in global plant occurrence information,” ecol. lett., vol. 19, no. 8, pp. 992–1006, 2016. [18] m. a. jamieson, a. l. carper, c. j. wilson, v. l. scott, and j. gibbs, “geographic biases in bee research limits understanding of species distribution and response to anthropogenic disturbance,” front. ecol. evol., vol. 7, no. jun, pp. 1–8, 2019. [19] a. agresti, c. franklin, and b. klingenberg, the art and science of learning from data, fourth. ed., vol. 53, no. 95. new york: pearson, 2012. [20] p. c. consul and f. famoye, “generalized poisson regression model,” communications in statistics theory and methods, vol. 21, no 1. pp. 89–109, 1992. [21] a. zeileis, c. kleiber, and s. jackman, “regression models for count data in r,” j. stat. softw., vol. 27, no. 8, pp. 1– 25, 2008. [22] j. m. hilbe, “modeling count data,” model. count data, no. 3, pp. 1–294, 2014. [23] c. von linnei, systema naturae, editio dec. impensis direct. laurentii salvii, 1759. [24] flora fauna web, “excoecaria cochinchinensis lour..,” singapore national parks, 2019. https://www.nparks.gov.sg/florafaunaweb/flora/2/0/2010 (accessed jan. 28, 2021). [25] t. a. james and g. j. harden, “genus excoecaria,” new south wales flora online, 2021. https://plantnet.rbgsyd.nsw.gov.au/cgi-bin/nswfl.pl?page=nswfl&lvl=gn&name=excoecaria. [26] e. k. meineke and b. h. daru, “bias assessments to expand research harnessing biological collections,” trends ecol. evol., pp. 1–12, 2021. [27] a. hazra, “an exact kolmogorov–smirnov test for the negative binomial distribution with unknown probability of success,” res. rev. j. stat., vol. 2, no. 1, pp. 1–13, 2013. [28] m. jamshidian, r. i. jennrich, and w. liu, “a study of partial f tests for multiple linear regression models,” comput. stat. data anal., vol. 51, no. 12, pp. 6269–6284, 2007. [29] c. m. woods, l. cai, and m. wang, “the langer-improved wald test for dif testing with multiple groups: evaluation and comparison to two-group irt,” educ. psychol. meas., vol. 73, no. 3, pp. 532–547, 2013. https://doi.org/10.1016/j.jep.2021.114221 https://doi.org/10.1016/j.jep.2021.114221 https://doi.org/10.3389/fphar.2020.00208 https://doi.org/10.3389/fphar.2020.00208 https://doi.org/10.1016/j.ympev.2021.107260 https://doi.org/10.1016/j.ympev.2021.107260 https://doi.org/10.1016/j.ympev.2021.107260 https://doi.org/10.3390/microorganisms9030666 https://doi.org/10.3390/microorganisms9030666 https://lipipress.lipi.go.id/detailpost/index-herbariorum-indonesianum https://nabu-gera-greiz.de/media/pages/aktuelles/landschaftspflege-und-naturschutz-in-thuringen-heft-2-2015-erschienen/ebe4a9ed5b-1606124733/151019-leseprobe.pdf https://nabu-gera-greiz.de/media/pages/aktuelles/landschaftspflege-und-naturschutz-in-thuringen-heft-2-2015-erschienen/ebe4a9ed5b-1606124733/151019-leseprobe.pdf https://www.researchgate.net/publication/263083477_the_biocase_project_-_a_biological_collections_access_service_for_europe https://www.researchgate.net/publication/263083477_the_biocase_project_-_a_biological_collections_access_service_for_europe https://doi.org/10.1002/aps3.11369 https://doi.org/10.1002/aps3.11369 https://doi.org/10.1093/biosci/biaa044 https://doi.org/10.1093/biosci/biaa044 https://doi.org/10.1073/pnas.1517685113 https://doi.org/10.1073/pnas.1517685113 https://doi.org/10.1111/ele.12624 https://doi.org/10.1111/ele.12624 https://doi.org/10.3389/fevo.2019.00194 https://doi.org/10.3389/fevo.2019.00194 https://doi.org/10.3389/fevo.2019.00194 https://www.pearson.com/en-us/subject-catalog/p/statistics-the-art-and-science-of-learning-from-data/p200000007454 https://www.pearson.com/en-us/subject-catalog/p/statistics-the-art-and-science-of-learning-from-data/p200000007454 https://doi.org/10.1080/03610929208830766 https://doi.org/10.1080/03610929208830766 https://doi.org/10.18637/jss.v027.i08 https://doi.org/10.18637/jss.v027.i08 https://doi.org/10.1007/978-3-642-04898-2_369 https://doi.org/10.5962/bhl.title.542 https://www.nparks.gov.sg/florafaunaweb/flora/2/0/2010 https://www.nparks.gov.sg/florafaunaweb/flora/2/0/2010 https://plantnet.rbgsyd.nsw.gov.au/cgi-bin/nswfl.pl?page=nswfl&lvl=gn&name=excoecaria https://plantnet.rbgsyd.nsw.gov.au/cgi-bin/nswfl.pl?page=nswfl&lvl=gn&name=excoecaria https://doi.org/10.1016/j.tree.2021.08.003 https://doi.org/10.1016/j.tree.2021.08.003 https://sciencejournals.stmjournals.in/index.php/rrjost/article/view/2594 https://sciencejournals.stmjournals.in/index.php/rrjost/article/view/2594 https://doi.org/10.1016/j.csda.2007.01.015 https://doi.org/10.1016/j.csda.2007.01.015 https://doi.org/10.1177/0013164412464875 https://doi.org/10.1177/0013164412464875 knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 87–100 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p87-100 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) the effect of resampling on classifier performance: an empirical study utomo pujianto a,1,*, muhammad iqbal akbar a,2, niendhitta tamia lassela a,3, deni sutaji b,4 a department of electrical engineering, universitas negeri malang, jl. semarang no. 5, malang 65145, indonesia b bilgisayar bilimleri (computer science), gazi üniversitesi emniyet, milas sk. no:30, 06560 yenimahalle/ankara, turkey 1 utomo.pujianto.ft@um.ac.id*; 2 iqbal.akbar.ft@um.ac.id; 3 niendhittatamia.1605356@students.um.ac.id; 4 deni.sutaji@gazi.edu.tr * corresponding author i. introduction classification is one of the activities in data mining that aims to group data into a class. in general, a dataset contains two or more class labels. however, most data in a dataset have an unbalanced amount of data between classes. that means one of the classes in the dataset has more data than another class which is called the majority class. the impact of existence of a majority class creates a minority class [1]. the minority class is a class that has less amount of data than the majority class. the occurrence of a majority class and a minority class in the dataset is called class imbalance in the data. when performing the classification process using a dataset with a class imbalance, the majority class will dominate the occurrence of the majority class label so that the result is a decrease in the performance of the classification algorithm [2]. one of the solutions to overcome the class imbalance in the dataset is using a data-level approach, such as resampling and synthesizing data [3]. the purpose of resampling is to support the recognition of minority data to make it more recognizable by the algorithm by adjusting the distribution of minority and majority classes. the class balance in the dataset can be obtained by eliminating the majority class and adding the minority class. article info a b s t r a c t article history: received 2 january 2021 revised 25 january 2021 accepted 4 june 2022 published online 7 november 2022 an imbalanced class on a dataset is a common classification problem. the effect of using imbalanced class datasets can cause a decrease in the performance of the classifier. resampling is one of the solutions to this problem. this study used 100 datasets from 3 websites: uci machine learning, kaggle, and openml. each dataset will go through 3 processing stages: the resampling process, the classification process, and the significance testing process between performance evaluation values of the combination of classifier and the resampling using paired t-test. the resampling used in the process is random undersampling, random oversampling, and smote. the classifier used in the classification process is naïve bayes classifier, decision tree, and neural network. the classification results in accuracy, precision, recall, and fmeasure values are tested using paired t-tests to determine the significance of the classifier's performance from datasets that were not resampled and those that had applied the resampling. the paired t-test is also used to find a combination between the classifier and the resampling that gives significant results. this study obtained two results. the first result is that resampling on imbalanced class datasets can substantially affect the classifier's performance more than the classifier's performance from datasets that are not applied the resampling technique. the second result is that combining the neural network algorithm without the resampling provides significance based on the accuracy value. combining the neural network algorithm with the smote technique provides significant performance based on the amount of precision, recall, and f-measure. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: classifier performance resampling techniques paired t-test http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 88 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 several examples of methods are included in the resampling technique, such as random undersampling, random oversampling, and smote [4]. undersampling is a resampling by reducing the data in the majority class. this technique is effective in overcoming class imbalance because a lot of the majority class data is ignored so that the dataset becomes more balanced, and the data training process becomes faster [5]. oversampling is a resampling technique by adding data to the minority class. oversampling can add necessary information to minority classes and prevent misclassification [6]. this study will use random undersampling, random oversampling, and smote as resampling techniques to overcome class imbalances in the used dataset. related research on resampling showed that random undersampling significantly improves the classification performance of imbalanced classes in medicare big data [7]. the results obtained from the research were random undersampling, which got an auc score of 97%. related research on using the random oversampling and smoote technique outperformed the other resampling [8]. majorityto-minority resampling (mmr), a hybrid approach to pick switched instances, adaptively selects potential instances from the majority class to enhance the minority class, showing that the result of the proposed approach outperforms several strong baselines across standard metrics for imbalanced data [9]. similarity oversampling and undersampling preprocessing (soup), which resamples tough cases, outperforms specialist preprocessing methods for multi-imbalanced issues and competes with the most famous decomposition ensembles on natural and artificial datasets [10]. borderline, random over sampler, smote, smote-enn, svm-smote, and smote-tomek handle imbalanced data and predict pupil success using two datasets using machine learning models like random forest, knearest-neighbor, artificial neural network, xg-boost, support vector machine (radial basis function), decision tree, logistic regression, and naïve bayes [11]. svm-smote outperforms other resampling methods in the friedman statistical relevance test. random forest was best after svm-smote resampling. this study's motivation and new contribution lie in evaluating the resampling algorithm with three different classifiers: naïve bayes classifier, decision tree, and backpropagation neural network (bpnn) on 100 public datasets. while the resampling algorithm has been widely used in the literature, its effectiveness in improving classification performance on imbalanced datasets is still an open question, particularly when combined with different classifiers. furthermore, this study goes beyond simply applying the resampling algorithm and evaluates its impact on classification metrics, including accuracy, precision, recall, and f-measure, which are particularly relevant in resampling applications on classifiers where false negatives and false positives can have significant consequences. overall, this study aims to find the significance of resampling on the classifier's performance between the resampled dataset and the classifier performance of the dataset without resampling. to achieve that goal, an empirical study was carried out on 100 datasets by comparing the results of the classifier performance from the dataset without resampling. each dataset applied resampling techniques: random undersampling, oversampling, and smote. then the dataset without resampling and the dataset that has been resampled are classified using three different classifiers. the classification metrics values were tested using paired t-tests to find the significance between resampling and combining the classifier with resampling that can provide the most significant results. these findings can inform the development of more effective and reliable machine-learning models for resampling applications on classifier performance. in section ii of this article, the methodologies that were utilized in this research are explained. the findings are presented in section iii, with a discussion of those results and a comparison of relevant models. the last part, the conclusion, can be found in section iv. ii. methods a. data collection this study used 100 datasets obtained online through 3 websites: uci machine learning, kaggle, and openml. each dataset has a numeric input and a binary class. most datasets have more than ten attributes and have fewer than 1000 instances. the fields used in the dataset can be seen in figure 1. u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 89 fig. 1 the fields used in the dataset b. data preprocessing two preprocessing steps are carried out in this phase, imputing missing values and resampling. impute missing values aims to replace the missing values in the dataset with new sample values. in this study, the method used to create new samples is the k-nearest neighbor (k-nn) with a kneighbor of 10 and a manhattan distance. this method will look for cases that are similar to issues where there are missing values. the two cases are identical if each attribute of the two cases is close together. if a similar case has been found, the attribute with missing values will be filled with the value from the attribute value with a similar case. the k-nn algorithm can provide more robust and more sensitive predictions of missing values [12]. a k-neighbor of 10 to overcome missing values can minimize the error rate when doing classification [13]. manhattan distance can give ba better results than the other kind of distance (euclidean distance, correlation distance, and cosine distance) [14]. resampling aims to resample each dataset using random undersampling, random oversampling, and smote with a ratio of 100%. random undersampling is a technique that removes some randomly selected data to decrease the majority [15]. random oversampling is an oversampling technique by duplicating randomly selected data to increase the minority [16]. smote is an oversampling technique that creates new synthetic data from some of the closest selected data using the k-nn [17]. smote, with a ratio of 100%, is of a kind feel the payment ratio in the smote where the process of making new samples is carried out until the minority has the same number as the majority. c. data classification three classifiers used to classify this study are gaussian naïve bayes, decision tree with c4.5 algorithm, and bpnn. gaussian naïve bayes is a kind of naïve bayes classifier that uses the gaussian normal distribution to calculate the probability of each attribute. naïve bayes classifier is a classifier based on bayes’ theorem with an assumption of independence among features [18]. the normal distribution formula can be shown in (1). (𝑥) = 1 √2𝜋𝜎 𝑒 − (𝑥−𝜇)2 2𝜎2 () where 𝑓(𝑥) is the normal distribution of each attribute in each class, 𝜎 is the standard deviation for each attribute in each class, 𝜇 is the mean value of each attribute in each class, and 𝑥 is the sample value of each attribute. here is the pseudocode of gaussian naïve bayes. input: training dataset d = {(x1, y1), (x2, y2), ..., (xn, yn)}, where xi is a feature vector and yi is the corresponding class label. output: a trained gaussian naive bayes classifier. start { 1. calculate the prior probability of each class p(y) as the frequency of each class in the training dataset. 90 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 2. for each feature i, calculate the mean and standard deviation for each class j: • mean μij = mean(xi|yi=j) • standard deviation σij = std(xi|yi=j) 3. for a new sample x: a. for each class j: i. calculate the likelihood p(x|y=j) using a gaussian probability density function with mean μij and standard deviation σij for each feature i. ii. calculate the posterior probability p(y=j|x) using bayes' theorem: p(y=j|x) = p(x|y=j) * p(y=j) / p(x). b. choose the class with the highest posterior probability as the predicted class for x. } end the decision tree classifier is one classifier that makes branching conditions based on specific attribute values that are done until the branching process cannot be done. there are three parts to the decision tree, they are the root node is the main attribute that influences determining class [19], the branch node is the attribute that is selected next after the root node [20], and the leaf node is the class label of each branch that is passed [21]. so that the decision tree structure is similar to a tree structure. this study used the c4.5 algorithm to determine the branch. the c4.5 algorithm is a development of the id3 method, which can provide better accuracy than the id3 method [22]. the formula to calculate the gain ratio can be shown as in (2) to (6). 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) = 𝐺𝑎𝑖𝑛(𝐴) 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴(𝐷) () 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴(𝐷) = − ∑ |𝐷𝑗| |𝐷| 𝑙𝑜𝑔2( |𝐷𝑗| |𝐷| )𝑣𝑗=1 () 𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜𝐴(𝐷) () 𝐼𝑛𝑓𝑜𝐴(𝐷) = ∑ |𝐷𝑗| |𝐷| × 𝐼𝑛𝑓𝑜(𝐷𝑗 ) 𝑣 𝑗=1 () 𝐼𝑛𝑓𝑜(𝐷) = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖) 𝑚 𝑖=1 () where 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) is the gain ratio value for each split point, 𝐺𝑎𝑖𝑛(𝐴) is the information gain value for each split point, 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷) is the split info value for each split point, 𝐼𝑛𝑓𝑜(𝐷) is the overall entropy value in the dataset, 𝐼𝑛𝑓𝑜𝐴(𝐷) is the entropy value for each split point, 𝐷𝑗 is the number of events at each split point, 𝐷 is the total number of events at each split point, 𝑣 is the number of times class label type, 𝐴 is the split point value and 𝑝𝑖 is the probability value for each class. this is the pseudocode for decision tree with c4.5 algorithm. input: training dataset d = {(x1, y1), (x2, y2), ..., (xn, yn)}, where xi is a feature vector and yi is the corresponding class label. output: a trained decision tree classifier. start { 1. if all samples in d belong to the same class y, then return a leaf node with class y. 2. if the set of features f is empty, then return a leaf node with the majority class in d. 3. calculate the information gain ratio for each feature i in f: • calculate the entropy h(d) of the current dataset d. • for each possible value v of feature i, calculate the entropy h(d|xi=v) of the subset of d with xi=v. u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 91 • calculate the information gain igi = h(d) σ p(xi=v) * h(d|xi=v), where p(xi=v) is the proportion of samples with xi=v in d. • calculate the split information si = σ p(xi=v) * log2(p(xi=v)). • calculate the information gain ratio ig_ratioi = igi / si. 4. choose the feature i with the highest information gain ratio ig_ratioi as the splitting feature. 5. create a decision tree node with feature i and its possible values as children. 6. for each child node j of the current node: • let dj be the subset of d with xi=j. • if dj is empty, create a leaf node with the majority class in d. • otherwise, recursively build the subtree rooted at node j using dj and the remaining features in f {i}. 7. return the root node of the decision tree. } end neural network (nn) is one of the classification algorithms in which the classification is similar to the workings of the human nervous system, the existence of a collection of interconnected neurons used to perform complex learning repeatedly [23]. a nn contains a collection of inputs and outputs connected by a weighted line. the weights are adjusted during the learning phase to help the network on nn make correct class predictions from the input. nn are suitable for applications that require complex learning because nn takes a long time to carry out repeated learning [24] and adjust to empirically determined parameters and network designs [25]. in the nn model, nodes with added weight are on each path, so the nn can learn to handle the wrong datasets. nn conducts several training in one case, so the nn algorithm has a relatively small error rate and high accuracy [26]. the prediction results are influenced by determining the learning rate value, the target error, the amount of training data used, and the initial weight [27]. the nn model has three parts: the input layer, the hidden layer, and the output layer. the input layer is a layer that contains a collection of input nodes where the input node has the input attribute values. the hidden layer is a layer after the input layer, which contains a collection of hidden nodes where the hidden node contains the input set values that have been processed with weight and bias values. the weight value is a value that states an input priority. the bias value is a constant value contained in each hidden node. the output layer is a layer that contains a collection of output nodes where the output node contains values that have been processed from several hidden nodes and become predictive values. in this study, the learning method in the nn algorithm is the backpropagation method. the backpropagation method is one of the learning methods in the nn algorithm, where the learning method adjusts the weights repeatedly in each tuple by changing the weights carried out backward from the output layer to the hidden layer. the purpose of this adjustment is to minimize the mean squared error (mse). if the mse is low, the predicted class with the actual class has similarities. there are two processes in the backpropagation neural network (bpnn). they are feedforward which is the process of calculating each value in the input attribute carried out forward from the input layer to the output layer, and backward is the process of calculating the error value between the value on the output layer and the target value. the error value adjusts the weight until the error value has a smaller number than the target error [28]. the following steps in the bpnn can be shown as in (7). 𝐼𝑗 = ∑ 𝑤𝑖𝑗 𝑂𝑖 + 𝑤𝜃𝑗 𝜃𝑗𝑖 () equation (8) is used to calculate the new input value for each unit in the hidden layer and the output layer where 𝐼𝑗 is the new input value for each unit in the hidden layer and output layer, 𝑤𝑖𝑗 is the weight value from unit 𝑖 in the previous layer to unit j, 𝑂𝑖 is the output value from the previous layer. if the calculation is done for the first time, then the value of 𝑂𝑖 is the value of the input layer, 𝑤𝜃𝑗 is the bias weight for each unit, and 𝜃 is the bias value for each unit. then calculate the new output in each unit in the hidden layer and the output layer using the formula as in (8). 92 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 𝑂𝑗 = 1 1+𝑒 −𝐼𝑗 () where 𝑂𝑗 is the output value in unit j and 𝐼𝑗 is the input value in j-unit. next, calculate the error value used as a stop condition using the mse calculation formula as in (9). 𝑀𝑆𝐸 = 1 𝑛 ∑ (𝑋𝑖 − 𝑆𝑖) 2𝑛 𝑖=1 () where 𝑋𝑖 is the output value in the dataset, 𝑆𝑖 is the output value calculated in the previous layer, and n is the class number. to perform the backward pass process, the first thing to do is to calculate the total error against the weight of each unit using the formula as in (10). 𝐸𝑇 𝑤𝑖𝑗 = 𝐸𝑇 𝑂𝑗 × 𝑂𝑗 𝐼𝑗 × 𝐼𝑗 𝑤𝑖𝑗 () where 𝐸𝑇 𝑂𝑗 _i is the backward pass value of the total error against each unit in the hidden layer and output layer, 𝑂𝑗 𝐼𝑗 is the backward pass value of the output of each unit on the output layer, the hidden layer is the input of each unit in the output layer and the hidden layer, and 𝐼𝑗 𝑤𝑖𝑗 is the backward pass value of the input from each unit in the output layer and the hidden layer against the weight connected to each unit. then update the weight using the formula as in (11). 𝑤𝑖𝑗 𝑛𝑒𝑤 = 𝑤𝑖𝑗 𝑜𝑙𝑑 − (𝑙 × 𝐸𝑇 𝑤𝑖𝑗 ) () where 𝑤𝑖𝑗 𝑛𝑒𝑤 is the weight value of the new unit 𝑖𝑗 , 𝑤𝑖𝑗 𝑜𝑙𝑑 is the weight value of the old 𝑖𝑗 unit, 𝐸𝑇 𝑤𝑖𝑗 is the total error value for the weight in each unit, and 𝑙 is the learning rate value. this is the pseudocode for the bpnn. input: training dataset d = {(x1, y1), (x2, y2), …, (xn, yn)}, where xi is a feature vector and yi is the corresponding class label. output: a trained bpnn classifier. start { 1. initialize the weights and biases of the neural network randomly. 2. for each training sample (x, y) in d, do the following steps: a. forward pass: i. calculate the output y' of the neural network for input x by applying the weights and biases to each neuron using the activation function. ii. calculate the error δ for each neuron in the output layer as δj = y'j(1-y'j)(yj-y'j), where yj is the desired output for neuron j. iii. calculate the error δ for each neuron in the hidden layers using the chain rule: δj = yj(1-yj)σ wjk δk, where wjk is the weight from neuron k to neuron j, and δk is the error for neuron k. b. backward pass: i. update the weights and biases of the neural network using the error δ and the learning rate α as follows:  for each weight wjk from neuron k to neuron j: wjk = wjk + αδjyk  for each bias bj of neuron j: bj = bj + αδj 3. repeat step 2 for a fixed number of epochs or until the error on the validation set stops improving. 4. return the trained bpnn classifier. } end u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 93 d. evaluation the first evaluation is using the classification matrix. the matrix is performance evaluation metrics calculated from the confusion matrix. a confusion matrix is a table with as many row and column dimensions as the number of classes in the dataset used to analyze the performance of the classification algorithm. the confusion matrix is used as an evaluation of how good the quality of classifier performance. the confusion matrix has four components: true positive (tp), true negative (tn), false positive (fp), and false negative (fn). tp is the number of positive class data that is classified correctly. tn is the number of negative class data that is classified correctly. fp is the number of negative class data that is incorrectly predicted as a positive class. fn is the number of positive class data incorrectly predicted as negative [29]. the four values are used to find the algorithm performance evaluation value: accuracy, precision, recall, and f-measure. the formula to calculate the accuracy, precision, recall, and f-measure as in (12) to (15) [30]. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ( 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 ) × 100% () 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ( 𝑇𝑃 𝑇𝑃+𝐹𝑃 ) × 100% () 𝑅𝑒𝑐𝑎𝑙𝑙 = ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 ) × 100% () 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = ( 2×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 ) × 100% () the second evaluation used the paired t-test used scipy, a python language-based library, in this study. to perform a paired t-test, namely by calculating the t value between the values compared using the formula can be shown as in (16) [31]. 𝑡 = �̅� 𝑆𝑑 × √𝑛 () where 𝑡 is the t-statistic value used to determine the significance between 2 values, 𝑑 is the difference value of 2 samples, 𝑆𝑑 is the value of standard deviation, �̅� is the mean value of the difference between 2 samples, and 𝑛 is the number of instances. in the paired t-test, there are two hypotheses, are 𝐻0 which means that there is no significant difference between the two values being compared and 𝐻1 there is a significant difference between the two values being compared. to determine whether 𝐻0 and 𝐻1 are accepted, an alpha value is required. the alpha value used in this study is 5%. if the t-statistic value is less than 5%, then 𝐻0 is rejected and 𝐻1is accepted. if the tstatistic value is more than 5%, then 𝐻0 is accepted and 𝐻1is rejected. iii. results and discussion there are two results obtained from this study. the first result is the performance evaluation of each classifier combination with resamplings, such as accuracy, precision, recall, and f-measure. the second result is paired t-test results from resampling in general and a combination of the classifier with resampling based on accuracy, precision, recall, and f-measure values. the performance evaluation results of each classifier combination with resampling techniques are shown in figure 2 to figure 4. 94 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 fig. 2. evaluation of gaussian naïve bayes with each resampling technique for each type of evaluation value is based on the mean fig. 3. evaluation of decision tree with c4.5 algorithm with each resampling technique for each type of evaluation value is based on the mean fig. 4. evaluation of the backpropagation neural network with each resampling technique for each type of evaluation value is based on the mean u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 95 based on the results, the classification results without resampling of each algorithm give the best performance than the other three resampling techniques in accuracy. smote gives better results based on the recall and f-measure values of the three classification algorithms. based on the precision value, the use of resampling gives different results for each algorithm, where random undersampling gives the best performance of the precision value on the gaussian naïve bayes algorithm, random oversampling provides the best performance of the precision value in the c4.5 decision tree algorithm, and smote provides the best performance of the precision value in the bpnn algorithm. combining the bpnn algorithm with smote provides the best performance. this is because smote can provide new samples so the classification algorithm can learn more data patterns. smote provides better performance than random undersampling and random oversampling because, in random undersampling, there is a possibility that essential data will be lost due to the random data deletion process so that the classifier cannot recognize more varied patterns and can cause a decrease in the performance of the classifier. meanwhile, random oversampling provides the same data the due to the random duplicating data so that it can lead to overfitting, where the classifier has better performance because it predicts correctly on the same data so that when classifying the new data, the classifier will misclassify the class because the classifier does not recognize new data patterns. it can cause a decrease in the performance of the classifier. the classification results from the unresampled datasets give better accuracy but give lower precision, recall, and f-measure than the other three resampling techniques because classification using unresampled datasets can provide overfitting classification results, where the classification algorithm has better performance due to the classification algorithm predicts correctly on the majority data so that when classifying new data which should be a minority class, the classification algorithm will misclassify into the majority class. this is because the classification algorithm does not learn minority data well and is better at recognizing data patterns in the majority class. the impact of this is a decrease in the performance of the classification algorithm because the fp and fn values become high because of classification errors. the bpnn can give the best performance than gaussian naïve bayes and decision tree with c4.5 algorithm because, in the gaussian naïve bayes, the process of calculating the probability of each attribute in each class uses a gaussian distribution, so class determination is very dependent on the mean and the standard deviation value for each attribute. if the mean and standard deviation values are more significant in a class, the algorithm will be more likely to determine a new class with a higher mean and standard deviation value. meanwhile, in the decision tree with c4.5 algorithm, there is a possibility that the entropy value of 0 will appear during the process of calculating the number of classes so that the decision tree model will immediately determine the class based on the attribute that has the number of 0 on a split point, causing a class determination at the beginning of the branching because there is the possibility of entropy is 0. so class determination is only determined by one attribute. this can reduce the algorithm's performance because other attributes are not chosen that can influence class determination. in the bpnn, there is a process of calculating the error between the prediction results and the actual class and adjusting the weight and as that can support more optimal classification results. the t-test results from resampling, in general, can be shown in table 1. to determine the significance of the results, this study used a z value of 1.960 for t-paired and an α value for p-paired as the threshold for determining the hypothesis. if the test results between the two resampling techniques have a t-paired value less than the z value and a p-paired value more than α, then the two resampling techniques do not provide significant results. the paired t-test results above gave three scenarios a yellow highlighter with a t-paired value less than the z value and the p-paired value more than the α value. so those three scenarios did not provide significant results. 96 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 the following result, the paired t-test, is a test between 2 combinations of classification algorithms with resampling techniques. the next paired t-test is a test between 2 combinations of classification algorithms with resampling techniques. the red column shows that the p-value is more than 5%, and the green column shows that the p-value is less than 5%. the following is the abbreviation of the combination name in the paired t-test result: • nb ns: the combination of the naïve bayes algorithm without resampling • nb os: the combination of the naïve bayes algorithm with random oversampling • nb smote: the combination of the naïve bayes algorithm with smote • nb us: the combination of the naïve bayes algorithm with random undersampling • dt ns: the combination of decision tree algorithm without resampling • dt os: the combination of decision tree algorithm with random oversampling • dt smote: the combination of the decision tree algorithm with smote • dt us: the combination of decision tree algorithm with random undersampling • nn ns: the combination of neural network algorithms without resampling • nn os: the combination of neural network algorithm with random oversampling • nn smote: the combination of neural network algorithm with smote • nn us: the combination of neural network algorithm with random undersampling table 1. t-test result based on resampling in general t-test scenario name t-paired p-paired t-test between no sampling and oversampling based on the accuracy value 4.44832 1.22e-05 t-test between no sampling and smote based on the accuracy value 2.948332 0.003447 t-test between no sampling and undersampling based on the accuracy value 9.447194 1.04e-18 t-test between oversampling and no sampling based on the value of precision 4.254278 2.81e-05 t-test between oversampling and no sampling based on the recall value 2.080658 0.038316 t-test between oversampling and no sampling based on the f-measure value 2.818378 0.005149 t-test between oversampling and undersampling based on the accuracy value 8.783319 1.26e-16 t-test between oversampling and undersampling based on the value of precision 3.069031 0.002344 t-test between oversampling and undersampling based on the recall value 0.102836 0.918162 t-test between smote and no sampling based on precision values 7.728664 1.66e-13 t-test between smote and no sampling based on the recall value 6.42767 5.11e-10 t-test between smote and no sampling is based on the f-measure value 7.733038 1.62e-13 t-test between smote and oversampling based on the accuracy value 3.050948 0.002486 t-test between smote and oversampling based on the value of precision 1.791596 0.074209 t-test between smote and oversampling based on the recall value 4.182419 3.79e-05 t-test between smote and oversampling based on the f-measure value 4.729757 3.47e-06 t-test between smote and undersampling based on the accuracy value 9.088606 1.42e-17 t-test between smote and undersampling based on the value of precision 4.278243 2.54e-05 t-test between smote and undersampling based on the recall value 5.115083 5.61e-07 t-test between smote and undersampling based on the f-measure value 5.05701 7.44e-07 t-test between undersampling and no sampling based on the value of precision 2.405361 0.016764 t-test between undersampling and no sampling based on the recall value 2.529396 0.01194 t-test between undersampling and no sampling based on the f-measure value 3.487873 0.00056 t-test between undersampling and oversampling based on the f-measure value 0.25086 0.802095 z = 1,960 α = 0,05 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 97 paired t-test results between 2 combinations of classification and resampling algorithms can be shown in figure 5 to figure 8. fig. 5. result of paired t-test between a combination of classifier and resampling based on accuracy values paired t-test results based on accuracy values as shown in figure 5, the combination of nn algorithms without resampling gives the most significant results than the other 11 combinations. meanwhile, the gaussian naïve bayes with random undersampling does not give good results. figure 5 shows that 5 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. fig. 6. result of paired t-test between a combination of classifier and resampling based on precision values paired t-test results based on precision values, as shown in figure 6, the combination of nn algorithms without resampling gives the most significant results than the other 11 combinations. meanwhile, the gaussian naïve bayes without resampling does not give good results. figure 6 shows that 7 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. fig. 7. result of paired t-test between a combination of classifier and resampling based on recall values 98 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 paired t-test results based on recall, as shown in figure 7, the combination of nn algorithms without resampling gives the most significant results than the other 11 combinations. meanwhile, the decision tree c4.5 algorithm without resampling does not give good results. figure 7 shows that 5 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. fig. 8. result of paired t-test between a combination of classifier and resampling based on accuracy values paired t-test results based on accuracy values as shown in figure 8, the combination of nn algorithms without resampling gives the most significant results than the other 11 combinations. meanwhile, the gaussian naïve bayes without resampling does not give good results. figure 8 shows that 6 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. iv. conclusion based on the results and discussion of the research that has been done, it can be concluded that. the bpnn with smote performs best based on accuracy, precision, recall, and f-measure. the mean and paired t-test values are better than the 11 combinations of classification algorithms and other resampling techniques. the combination of the classification algorithm and the resampling technique does not provide significant results based on the type of evaluation: (1) based on the accuracy, combining the gaussian naïve bayes with random undersampling does not provide the most significant performance results; (2) based on precision, and f-measure, combining the gaussian naïve bayes without resampling does not provide the most significant performance results; (3) combining the decision tree c4.5 algorithm without resampling does not provide the most significant performance results based on recall. using resampling can provide significant results on the classification algorithm's performance compared to the classification algorithm's performance on the dataset without resampling. it is shown that most of the test results from comparing classification results from datasets that apply resampling techniques and from datasets without resampling techniques give significant results. however, combining multiple resampling techniques may improve classification performance even further. future research could explore the effectiveness of combining different resampling techniques and the impact on classification performance. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 99 additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] f. thabtah, s. hammoud, f. kamalov, and a. gonsalves, “data imbalance in classification: experimental evaluation,” inf. sci. (ny)., vol. 513, pp. 429–441, mar. 2020. [2] a. ali-gombe and e. elyan, “mfc-gan: class-imbalanced dataset classification using multiple fake class generative adversarial network,” neurocomputing, vol. 361, pp. 212–221, oct. 2019. [3] u. pujianto, “random forest and novel under-sampling strategy for data imbalance in software defect prediction,” int. j. eng. technol., vol. 7, no. 4, pp. 39–42, 2018. [4] t. chen, y. lu, x. fu, n. n. sze, and h. ding, “a resampling approach to disaggregate analysis of bus -involved crashes using panel data with excessive zeros,” accid. anal. prev., vol. 164, p. 106496, jan. 2022. [5] b. mirzaei, b. nikpour, and h. nezamabadi-pour, “cdbh: a clustering and density-based hybrid approach for imbalanced data classification,” expert syst. appl., vol. 164, p. 114035, feb. 2021. [6] c. zhang et al., “over-sampling algorithm based on vae in imbalanced classification,” in lecture notes in computer science (lnisa,volume 10967), 2018, pp. 334–344. [7] j. hancock, t. m. khoshgoftaar, and j. m. johnson, “the effects of random undersampling for big data medicare fraud detection,” in 2022 ieee international conference on service-oriented system engineering (sose), aug. 2022, pp. 141–146. [8] r. zhou et al., “prediction model for infectious disease health literacy based on synthetic minority oversampling technique algorithm,” comput. math. methods med., vol. 2022, pp. 1–6, mar. 2022. [9] g. wang, j. wang, and k. he, “majority-to-minority resampling for boosting-based classification under imbalanced data,” appl. intell., vol. 53, no. 4, pp. 4541–4562, feb. 2022. [10] m. janicka, m. lango, and j. stefanowski, “using information on class interrelations to improve classification of multiclass imbalanced data: a new resampling algorithm,” int. j. appl. math. comput. sci., vol. 29, no. 4, pp. 769– 781, dec. 2019. [11] r. ghorbani and r. ghousi, “comparing different resampling methods in predicting students’ performance using machine learning techniques,” ieee access, vol. 8, pp. 67899–67911, 2020. [12] s. saeed, a. abdullah, n. z. jhanjhi, m. naqvi, and a. nayyar, “new techniques for efficiently k-nn algorithm for brain tumor detection,” multimed. tools appl., vol. 81, no. 13, pp. 18595–18616, may 2022. [13] h. xu and y. chen, “a block padding approach in multidimensional dependency missing data,” eng. appl. artif. intell., vol. 120, p. 105929, apr. 2023. [14] h. a. abu alfeilat et al., “effects of distance measure choice on k-nearest neighbor classifier performance: a review,” big data, vol. 7, no. 4, pp. 221–248, dec. 2019. [15] j. li, s. fong, s. hu, r. k. wong, and s. mohammed, “similarity majority under-sampling technique for easing imbalanced classification problem,” in communications in computer and information science, 2018, pp. 3–23. [16] j. fonseca, g. douzas, and f. bacao, “improving imbalanced land cover classification with k-means smote: detecting and oversampling distinctive minority spectral signatures,” information, vol. 12, no. 7, p. 266, jun. 2021. [17] z. shi, “improving k-nearest neighbors algorithm for imbalanced data classification,” iop conf. ser. mater. sci. eng., vol. 719, no. 1, p. 012072, jan. 2020. [18] n. salmi and z. rustam, “naïve bayes classifier models for predicting the colon cancer,” iop conf. ser. mater. sci. eng., vol. 546, no. 5, p. 052068, jun. 2019. [19] s. wahyuni, “implementation of data mining to analyze drug cases using c4.5 decision tree,” j. phys. conf. ser., vol. 970, p. 012030, mar. 2018. [20] t. thomas, a. p. vijayaraghavan, and s. emmanuel, “applications of decision trees,” in machine learning approaches in cyber security analytics, singapore: springer singapore, 2020, pp. 157–184. [21] r. benkercha and s. moulahoum, “fault detection and diagnosis based on c4.5 decision tree algorithm for grid connected pv system,” sol. energy, vol. 173, pp. 610–634, oct. 2018. [22] g. s. reddy and s. chittineni, “entropy based c4.5-sho algorithm with information gain optimization in data mining,” peerj comput. sci., vol. 7, p. e424, apr. 2021. [23] i. gonzalez-fernandez, m. a. iglesias-otero, m. esteki, o. a. moldes, j. c. mejuto, and j. simal-gandara, “a critical review on the use of artificial neural networks in olive oil production, characterization and authentication,” crit. rev. food sci. nutr., vol. 59, no. 12, pp. 1913–1926, jul. 2019. [24] a. khan, a. sohail, u. zahoora, and a. s. qureshi, “a survey of the recent architectures of deep convolutional neural networks,” artif. intell. rev., vol. 53, no. 8, pp. 5455–5516, dec. 2020. [25] x. qi, g. chen, y. li, x. cheng, and c. li, “applying neural-network-based machine learning to additive manufacturing: current applications, challenges, and future perspectives,” engineering, vol. 5, no. 4, pp. 721–729, aug. 2019. [26] h. dagdougui, f. bagheri, h. le, and l. dessaint, “neural network model for short-term and very-short-term load forecasting in district buildings,” energy build., vol. 203, p. 109408, nov. 2019. [27] y. wu, r. gao, and j. yang, “prediction of coal and gas outburst: a method based on the bp neural network optimized by gasa,” process saf. environ. prot., vol. 133, pp. 64–72, jan. 2020. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1016/j.ins.2019.11.004 https://doi.org/10.1016/j.ins.2019.11.004 https://doi.org/10.1016/j.neucom.2019.06.043 https://doi.org/10.1016/j.neucom.2019.06.043 https://doi.org/10.22219/kinetik.v5i2.897 https://doi.org/10.22219/kinetik.v5i2.897 https://doi.org/10.1016/j.aap.2021.106496 https://doi.org/10.1016/j.aap.2021.106496 https://doi.org/10.1016/j.eswa.2020.114035 https://doi.org/10.1016/j.eswa.2020.114035 https://doi.org/10.1007/978-3-319-94295-7_23 https://doi.org/10.1007/978-3-319-94295-7_23 https://doi.org/10.1109/sose55356.2022.00023 https://doi.org/10.1109/sose55356.2022.00023 https://doi.org/10.1109/sose55356.2022.00023 https://doi.org/10.1155/2022/8498159 https://doi.org/10.1155/2022/8498159 https://doi.org/10.1007/s10489-022-03585-2 https://doi.org/10.1007/s10489-022-03585-2 https://doi.org/10.2478/amcs-2019-0057 https://doi.org/10.2478/amcs-2019-0057 https://doi.org/10.2478/amcs-2019-0057 https://doi.org/10.1109/access.2020.2986809 https://doi.org/10.1109/access.2020.2986809 https://doi.org/10.1007/s11042-022-12271-x https://doi.org/10.1007/s11042-022-12271-x https://doi.org/10.1016/j.engappai.2023.105929 https://doi.org/10.1016/j.engappai.2023.105929 https://doi.org/10.1089/big.2018.0175 https://doi.org/10.1089/big.2018.0175 https://doi.org/10.1007/978-981-13-0292-3_1 https://doi.org/10.1007/978-981-13-0292-3_1 https://doi.org/10.3390/info12070266 https://doi.org/10.3390/info12070266 https://iopscience.iop.org/article/10.1088/1757-899x/719/1/012072 https://iopscience.iop.org/article/10.1088/1757-899x/719/1/012072 https://iopscience.iop.org/article/10.1088/1757-899x/546/5/052068 https://iopscience.iop.org/article/10.1088/1757-899x/546/5/052068 https://iopscience.iop.org/article/10.1088/1742-6596/970/1/012030 https://iopscience.iop.org/article/10.1088/1742-6596/970/1/012030 https://doi.org/10.1007/978-981-15-1706-8 https://doi.org/10.1007/978-981-15-1706-8 https://doi.org/10.1016/j.solener.2018.07.089 https://doi.org/10.1016/j.solener.2018.07.089 https://doi.org/10.7717/peerj-cs.424 https://doi.org/10.7717/peerj-cs.424 https://doi.org/10.1080/10408398.2018.1433628 https://doi.org/10.1080/10408398.2018.1433628 https://doi.org/10.1080/10408398.2018.1433628 https://doi.org/10.1007/s10462-020-09825-6 https://doi.org/10.1007/s10462-020-09825-6 https://doi.org/10.1016/j.eng.2019.04.012 https://doi.org/10.1016/j.eng.2019.04.012 https://doi.org/10.1016/j.eng.2019.04.012 https://doi.org/10.1016/j.enbuild.2019.109408 https://doi.org/10.1016/j.enbuild.2019.109408 https://doi.org/10.1016/j.psep.2019.10.002 https://doi.org/10.1016/j.psep.2019.10.002 100 u. pujianto et al. / knowledge engineering and data science 2022, 5 (1): 87–100 [28] j. c. r. whittington and r. bogacz, “theories of error back-propagation in the brain,” trends cogn. sci., vol. 23, no. 3, pp. 235–250, mar. 2019. [29] d. chicco, n. tötsch, and g. jurman, “the matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” biodata min., vol. 14, no. 1, p. 13, feb. 2021. [30] j. miao and w. zhu, “precision–recall curve (prc) classification trees,” evol. intell., vol. 15, no. 3, pp. 1545–1569, sep. 2022. [31] g. mahalle, o. salunke, n. kotkunde, a. k. gupta, and s. k. singh, “neural network modeling for anisotropic mechanical properties and work hardening behavior of inconel 718 alloy at elevated temperatures,” j. mater. res. technol., vol. 8, no. 2, pp. 2130–2140, apr. 2019. https://doi.org/10.1016/j.tics.2018.12.005 https://doi.org/10.1016/j.tics.2018.12.005 https://doi.org/10.1186/s13040-021-00244-z https://doi.org/10.1186/s13040-021-00244-z https://doi.org/10.1186/s13040-021-00244-z https://doi.org/10.1007/s12065-021-00565-2 https://doi.org/10.1007/s12065-021-00565-2 https://doi.org/10.1016/j.jmrt.2019.01.019 https://doi.org/10.1016/j.jmrt.2019.01.019 https://doi.org/10.1016/j.jmrt.2019.01.019 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 1, july 2021, pp. 1–13 eissn 2597-4637 https://doi.org/10.17977/um018v4i12021p1-13 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by indonesian ministry of research & technology forecasting stock exchange data using group method of data handling neural network approach marzieh faridi masouleh a, 1, *, ahmad bagheri b, 2 a department of computer engineering, ahrar institute of technology and higher education rasht, km 4 rasht-lakan road, seyed ahmad khomeini, guilan, iran b department of dynamics, control, and vibrations, faculty of mechanical engineering, university of guilan 5th kilometer of persian gulf highway, rasht, guilan, iran 1 m.faridi@ahrar.ac.ir *; 2 iranbagheri@guilan.ac.ir * corresponding author i. introduction in economics, there are several definitions for the market. primarily, market refers to a physical place where sellers and buyers come together to exchange goods and services [1], but a second definition expresses that market does not necessarily need to have a physical existence to encompass a specific space; in fact, the market includes all the buyers and sellers that are exchanging some specific goods or services. for example, the international stock exchange market is a market whose transaction operations occur through international communicational networks and are not limited to a certain place. in fact, the market is a mechanism that allows buyers and sellers to exchange their properties. these properties can be physical (e.g., estates) or financial (e.g., valuable stock share papers) [2][3][4]. according to dominic salvatore, business is a location or circumstance where buyers and sellers trade products, facilities, and capital. any item, service, or resource that can be traded has a demand. however, salvatore believes that the market can be a non-physical location for trading in addition to its physical form [5]. in terms of the type of properties, the market can be divided into the market of physical properties (real) and the market of financial properties. the market of physical properties is a market where buyers and sellers exchange properties that have physical nature (like automobiles, estates, and furniture). the phrase market of financial properties is used for a market. people, including article info a b s t r a c t article history: received 4 march 2021 revised 29 march 2021 accepted 4 april 2021 published online 17 august 2021 the increasing uncertainty of the natural world has motivated computer scientists to seek out the best approach to technological problems. nature-inspired problemsolving approaches include meta-heuristic methods that are focused on evolutionary computation and swarm intelligence. one of these problems significantly impacting information is forecasting exchange index, which is a serious concern with the growth and decline of stock as there are many reports on loss of financial resources or profitability. when the exchange includes an extensive set of diverse stock, particular concepts and mechanisms for physical security, network security, encryption, and permissions should guarantee and predict its future needs. this study aimed to show it is efficient to use the group method of data handling (gmdh)-type neural networks and their application for the classification of numerical results. such modeling serves to display the precision of gmdh-type neural networks. following the us withdrawal from the joint comprehensive plan of action in april 2018, the behavior of the stock exchange data stream and commend algorithms has not been able to predict correctly and fit in the network satisfactorily. this paper demonstrated that group method data handling is most likely to improve inductive self-organizing approaches for addressing realistic severe problems such as the iranian financial market crisis. a new trajectory would be used to verify the consistency of the obtained equations hence the models' validity. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: exchange forecasting group method of data handling neural network http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i12021p1-13 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.ristekbrin.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 2 m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 natural and legal people, are busy exchanging financial securities, goods, and other properties that can be paid for by a small commission. the market prices depend on supply and demand. on the other hand, financial securities include shares, bonds, and some other goods (such as expensive metals or agriculture produce) [6]. communications and information technology have led to the growth of easy online sales that reduce time and cost and prevent physical presence in crowded places. today, one can easily do desired shopping with a few simple clicks. this technology has brought new challenges, including data management systems, data recommendation, data classification, and data security risk. some studies [7] and [8] have proposed that forecasting and ensuring should be considered a management issue. the main concern in online business organizations is the management of information security that deals with the issues of data breaches, identification of burglary, and other online frauds. in terms of global security, data breaches are a major concern. data breaches affect 93 percent of big businesses and 87 percent of small businesses in the united kingdom [9][10]. in the united kingdom, the total cost of a data loss is about 4.1 million dollars, and rehabilitation lasts around nine months and three days [11]. although various technical solutions have been proposed for forecasting exchange in recent years, and some are being upgraded, forecasting exchange is still considered a necessary approach [12]. many approaches in forecasting stock exchange indices have been used by the neural networks approach and other algorithms. one of the newest methods is proposed in [13]. they all have advantages and disadvantages, including computational complexity, high runtime, ungeneralizable algorithms, and insufficient accuracy. statistical technological metrics were used in several previous experiments to forecast exchange rates. soft computing tools were used in some of the proposed approaches as a forecasting scheme. hann et al. proposed a new approach for predicting exchange rates based on neural networks vs. linear models using monthly and weekly data in 1996 [14]. mahnaz et al. introduced a bayesian statistics algorithm for predicting exchange rates in 1997. regardless of the economic model used by forecasters, the solution suggested may be used. the international fisher effect was used to show how the proposed model could be used in practice and how its results differed from the mean squared method [15]. the neural networks proposed by zhang et al. were used to predict the british pound/us dollar exchange rate. it looked at how the number of input and hidden nodes and the scale of the training sample affected output in-sample and out-of-sample. for this investigation, the british pound/us dollar exchange rate was used. neural networks outperformed linear models, mainly when the forecast horizon was small, according to the findings. furthermore, the number of input nodes has a more significant effect on output than the number of hidden nodes, while a more significant number of observations allows prediction errors to be reduced [16]. rodrı́guez et al. proposed simultaneous nearest-neighbour methods [17], leung et al. presented general regression neural networks (grnn) algorithm [18], and michael et al. introduced setar models for exchange rate forecasting [19]. chen et al. published a bayesian vector error correction model for exchange rate forecasting in 2003. they developed the bayesian vector error correction model (bvecm), which they used to predict shifts in currency exchange rates for three big asia pacific economies one month ahead [20]. chen et al. [21] developed a regression neural network and used an adaptive forecasting method that combined the strengths of neural networks and multivariate econometric models to provide a solution to error correction in foreign exchange forecasting and trading. a time series model was used to estimate the exchange rates, and a general regression neural network was used to correct the estimate errors in this process. several experiments and statistical methods were used to compare the consistency of the two-stage models (with neural network error correction) and the single-stage models (without neural network error correction). in 2005, yu et al. presented a novel nonlinear ensemble forecasting model incorporating glar and ann [22]. in 2007, preminger et al. presented a robust regression approach [23], and in 2008 wright et al. introduced bayesian model averaging for foreign exchange [24]. in 2009, carriero et al. presented a large bayesian var for foreign exchange [25], and in 2012, ye et al. introduced mb exchange rate forecast approach based on bp neural network. this paper chose rmb exchange rate data from july 2005 to september 2010 to establish a bp neural network model to forecast rmb exchange rate in the future by using matlab software [26]. korol et al. in 2014 proposed a fuzzy logic model [27], and in 2015, shen et al. presented deep belief networks and conjugate gradient method [28], and in 2016, abounoori et al. introduced markov switching garch approach [29] for m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 3 forecasting exchange rates. in 2017, kolasa et al [30] proposed dsge models, and in 2018, sun et al. introduced a new multiscale decomposition ensemble approach. in the latter approach, foreign exchange rates were divided into a limited number of subcomponents by employing the variational mode decomposition (vmd). and in order to model and forecast each subcomponent, it used the support vector neural network (svnn) technique. another svnn technique was utilized to integrate the forecasting results of each subcomponent. the quality of the performance of the proposed approach was tested by comparing and evaluating four key exchange rates. the experimental results showed that the forecasting accuracy and statistical tests of the proposed vmd-svnn-svnn multiscale decomposition ensemble approach were higher than some other benchmarks. therefore, in forecasting foreign exchange rates, the proposed vmd-svnn-svnn multiscale decomposition ensemble method proved superior [31]. variational mode decomposition and entropy theory was proposed to forecast exchange rates [32]. dzalbs et al. proposed cartesian genetic programming and artificial neural network (ann), and amat et al. presented simple machine learning methods for forecasting exchange rates [33]. the methods they used were sequential ridge regression and the exponentially weighted average strategy. neither of them estimated an underlying model with discount factors but combined the fundamentals to directly output forecasts [34]. finally, in 2019, wei et al. presented decomposition clustering ensemble learning approach [35], and fu et al. introduced a support vector machine for rmb [36], and ni et al. proposed deep learning [37] and wang et al. nonlinear taylor rule-based models for forecasting foreign exchange rates [38]. one self-organizing approach is named the group method of data handling (gmdh) algorithm among the newly developed methods. this approach evaluates the models' accuracy in a group of multi-input single-output data pairs, eventually yielding more complex models. as a result, gmdh will aid in creating an analytical function in a feed-forward network by solving the problem of initial information about the system's mathematical model and relying on a quadratic node transfer function. the coefficients of the equation are calculated using the regression technique [39][40][41]. thus, this research makes use of the gmdh neural network algorithm to forecast stock exchange data. other sections of the article include the introduction of the method with the description of the proposed architecture, results and discussion, and finally, the conclusion. ii. method the artificial neural network is an information processing system that has features in common with natural neural networks. neural networks are generalized mathematical models of human recognition based on biology that have several assumptions, some of which are as follows [6][42]: • neurons do information processing operations. • signals are transferred between neurons in the network via their bonds or connections. • each bond has its weight that is multiplied by its transferred signals in common neural networks. each neuron applies an activation function on its inputs, the weighted sum of the input signals to produce its output signals. figure 1 shows the flowchart is a suggested method for forecasting currency. as shown in figure 1, the exchange dataset is first introduced into the proposed system. then all data are preprocessed. after the preprocessing of the exchange dataset, the missing values are removed. then the data prepared and received from the stock exchange are converted into an acceptable format for simulation. in the next step, the data are normalized, and then the sampling process is performed. train sample (80%) is used to generate the model by gmdh neural network algorithm and test samples (20%) to evaluate the performance of the proposed method. train data are applied to the gmdh neural network algorithm, and a model is developed after the training. after model generation, test data is applied to the model for prediction. finally, it is examined whether all specimens are predicted. if all new samples are completed, evaluation and calculation of the results are performed. after the data are entered into the proposed system, the desired data are preprocessed, and then the unused and useless samples are deleted. next, the data converted into cohesive after-pre-processing becomes an acceptable format for simulation tools. at this point, the data are usually converted into an excel and integrated format. 4 m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 for preprocessing the data, various methods have been proposed, which are as follows: • clearing the data • data collection • data transfer • reduced given the problem in this research, the present study has made use of only the data clearing method. the proposed strategy analyzes the data and identifies if the row or column has empty or unused values. then the empty or unused values will be examined before and after the sample, and their averages will be computed. finally, the null value will be replaced with the obtained average. by doing so, the samples will be lost, and more consistent data will be generated. after the prototype has disappeared, the data have to be prepared and for this purpose. the preprocessed data are converted into an acceptable format for simulation tools. the default format for the data is in excel, and therefore, for the analysis, we initially need case-tested data. in the preprocessing stage, the values of each attribute of the data used are normalized from 0 to 1. then we rotate the rows of the general data matrix randomly so that the order of the data is collected from the initial state, get out. in addition, all data are mapped in a matrix form and changes in matrix rows, and then normalization operations are performed. normalization is due to higher precision. relationship (1) is used to normalize the values of each set of data. 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒(𝑥) = (𝑥−𝑋𝑚𝑖𝑛) (𝑋𝑚𝑎𝑥−𝑋𝑚𝑖𝑛) (1) xmax and xmin are the maximum and minimum values in the x-property range. after normalizing the data, the values of all the adjectives are in the range [0,1]. data sampling test data(20%) train data(80%) gmdh neural network gmdh model forecasting exchange data evaluating the results end is the process finished? learning generate model no i=i+1 input ith data evaluating the performance of new data(test) yes data preprocessing dataset data preparation data normalization fig. 1. the proposed method flowchart m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 5 it is assumed that the train samples will learn the gmdh algorithm, and the resultant model will be used for other data. training samples are usually 80% of the total sample. test samples representing 20% of the total samples are used to evaluate and validate the method. these samples are used for testing and measuring the efficiency and validation of the proposed method. after the training samples (80%) and the experimental samples (20%) were normally subdivided, we apply the training samples to gmdh neural network algorithms as inputs. a. group method of data handling (gmdh) group method of data handling (gmdh) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models. gmdh is used in data mining, knowledge discovery, prediction, complex systems modeling, optimization, and pattern recognition. in 2017, li et al. showed that gmdh neural network outperformed the classical forecasting algorithms such as single exponential smooth, double exponential smooth, arima, and back-propagation neural network. gmdh algorithms (figure 2) are characterized by an inductive procedure that gradually sorts out complicated polynomial models and selects the best solution through the so-called external criterion [9][42]. a set of neurons can represent the standard gmdh algorithm. they contain different pairs of neurons in each layer, and they are joined utilizing a quadratic polynomial. the result of this connection is new neurons in the next layer. this representation can be utilized to model the mapping inputs to outputs. the identification problem is formally defined as a way of finding a function 𝑓 so that it can be approximately used in place of the actual one, 𝑓 for the prediction of the output �̂� for a given input vector 𝑋 = (𝑥1. 𝑥2. 𝑥3. … . 𝑥𝑛) possibly close to its actual output 𝑦. thus, given m, samples of multi-input singleoutput data pairs define the equations (2): 𝑦𝑖 = 𝑓(𝑥𝑖1. 𝑥𝑖2. 𝑥𝑖3. … . . 𝑥𝑖𝑛). 𝑖 = 1.2. … . 𝑀 (2) a gmd-type neural network may now be trained for the prediction of the output values 𝑋 = (𝑥𝑖1. 𝑥𝑖2. 𝑥𝑖3. … . 𝑥𝑖𝑛). it means, �̂�𝑖 = 𝑓(𝑥𝑖1. 𝑥𝑖2. 𝑥𝑖3. … . . 𝑥𝑖𝑛). 𝑖 = 1.2. … . 𝑀 (3) there is now the problem of determining a gmdh-type neural network to minimize the square of the difference between the predicted output and the actual one: ∑ [(𝑓(𝑥𝑖1. 𝑥𝑖2. 𝑥𝑖3. … . 𝑥𝑖𝑛) − 𝑦𝑖) 2] → 𝑀𝑖𝑛.𝑀𝑘=1 (4) a complicated polynomial of the form can display the general connection between inputs and output variables, known as the ivakhnenko polynomial [43]. �̂� = 𝑎0 + ∑ 𝑎𝑖𝑥𝑖 + ∑ ∑ 𝑎𝑖𝑗𝑥𝑖𝑥𝑗 + ∑ ∑ ∑ 𝑎𝑖𝑗𝑘𝑥𝑖𝑥𝑗𝑥𝑘 + ⋯ 𝑚 𝑘=1 𝑚 𝑗=1 𝑚 𝑖=1 𝑚 𝑗=1 𝑚 𝑖=1 𝑚 𝑖=1 (5) most applications, however, use the quadratic form of two variables in the following form to predict the output y. �̂� = 𝐺(𝑥𝑖. 𝑥𝑗) = 𝑎0 + 𝑎1𝑥𝑖 + 𝑎3𝑥𝑖 2 + 𝑎4𝑥𝑗 2 + 𝑎5𝑥𝑖𝑥𝑗 (6) fig. 2. structure of gmdh network 6 m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 using regression techniques, the coefficients a_i in equation (5) are calculated [43] in order to minimize the difference between actual output, y, and the calculated one, �̂�, for each pair of 𝑥𝑖. 𝑥𝑗) as input variables. in fact, it shows that through the quadratic form given in equation (5), whose coefficients are obtained in a least-squares sense, a tree of polynomials is constructed. therefore, the coefficients of each quadratic function 𝐺𝑖 are derived to fit the output in the whole set input-output pair optimally: 𝑟2 = ∑ (𝑦𝑖−𝐺𝑖) 𝑀 𝑖=1 ∑ 𝑦𝑖 2𝑀 𝑖=1 → 𝑀 𝑖𝑛 (7) the construct of the regression polynomial in equation (6) can very well fit the dependent samples (𝑦𝑖. 𝑖 = 1.2. … . 𝑀) in a least-squares sense. as a result, ( 𝑛 2 ) = 𝑛(𝑛−1) 2 neurons will be constructed in the second layer of the feed-forward network from the samples {(𝑦𝑖. 𝑥𝑖𝑝. 𝑥𝑖𝑝). (𝑖 = 1.2. … . 𝑀)} for different 𝑝. 𝑞𝜖{1.2. … . 𝑀} (farlow 1984). it means that there is a possibility for the construction of m data triples {(𝑦𝑖. 𝑥𝑖𝑝. 𝑥𝑖𝑞). (𝑖 = 1.2. … . 𝑀)} from samples using such 𝑝. 𝑞𝜖{1.2. … . 𝑀} [ 𝑥1𝑝 𝑥1𝑝 𝑥2𝑝 𝑥2𝑞 ⋯ 𝑥𝑀𝑝 ⋯ 𝑥𝑀𝑝 ⋮ 𝑦1 𝑦2 ⋯ ⋮ ⋯ 𝑦𝑀 ] (8) the following matrix equation can be readily obtained if the quadratic sub-expression is used in the form of equation (6) for each row of m data triples: 𝐴𝑎 = 𝑌 (9) where a is the vector of unknown coefficients of the quadratic polynomial in equation (10): 𝑎 = {𝑎0. 𝑎1. 𝑎2. 𝑎3. 𝑎4. 𝑎5} (10) 𝑌 = {𝑦1. 𝑦2. 𝑦3. … . 𝑦𝑀} 𝑇 (11) is the vector of output values from samples? it shows 𝐴 = [ 1 1 ⋮ 1 𝑥1𝑝 𝑥2𝑝 ⋮ 𝑥𝑀𝑝 𝑥1𝑞 𝑥2𝑞 ⋮ 𝑥𝑀𝑝 𝑥1𝑝𝑥1𝑞 𝑥2𝑝𝑥2𝑞 ⋮ 𝑥𝑀𝑝𝑥𝑀𝑞 𝑥1𝑝 2 𝑥2𝑝 2 ⋮ 𝑥𝑀𝑝 2 𝑥1𝑞 2 𝑥2𝑞 2 ⋮ 𝑥𝑀𝑞 2 ] (12) the least-squares technique from multiple-regression analysis results in the solution of the normal equations as follows: 𝑎 = (𝐴𝑇𝐴)−1𝐴𝑇𝑌 (13) the vector of the best coefficients of the quadratic (5) for the whole set of m data triples, but this solution directly derived from normal equations is almost capable of rounding off errors and, more importantly, of the singularity of these equations. b. structural identification of gmdh type neural networks for the structural identification of gmdh-type networks, there are different approaches [39][41]: 1. increasing selection pressure approach (isp). in the approach of selection pressure, there is only one parameter that is sequentially increased in different layers so that the number of neurons in each layer and the number of layers in the network can be determined. 2. pre-specified structural design approach (psd). this approach prescribes the number of layers in the network and the number of neurons in each layer. 3. error driven structural approach (eds). in this approach, following a threshold error for equation (6). it is possible to determine the number of layers and al. moreover, the third approach also differs in that it is possible to include in subsequent layers some of the input variables or generated neurons in different layers. therefore, the structure of such a network may be more complex than those generated in other approaches. m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 7 iii. results and discussions this section describes the dataset, evaluation metrics and then discusses the results of the proposed method. firstly, the dataset used to forecast the stock exchange data is described in section iii.a. secondly, the evaluation metrics of the proposed method are presented in section iii.b. finally, in section iii.c, the analysis of the results is discussed. the software and hardware configuration of the experimental environment is shown in table 1. as can be seen from table 1, the operating system (os) windows 10, the os type is also 64-bit, 4gb of ram used that 3.06 gigabytes of usable, cpu is intel (core™ i7 cpu) q720 @ 1.60ghz 1.60 ghz. a. dataset the iranian stock exchange database has been the primary data source for the daily exchange rates in this research project. the data collection was from 4 january 2015 to 28 february 2018, as shown in figure 3. they included the general forms of the dataset for the mechanism of access: the average price of gold and dollar, volume, value, and the number of transactions. as table 2 illustrates, the dataset was in two parts: in-sample subset and out-of-sample subset, but the table does not cover the detailed data. they can either be secured directly from the authors or accessed at the iranian stock exchange database. table 3 presents the descriptive statistics for the foreign stock exchange data. they describe the basic statistical characteristics of the data exchange rate and include minimum, maximum, mean, standard deviation, skewness, and kurtosis. as a measure of the symmetry of the dataset, skewness is employed. zero skewness represents a perfectly symmetric distribution, while negative and positive fig. 3. daily exchange rates data range from january 4, 2015 to february 28, 2018 table 1. experimental environment configuration configuration items configuration parameter os windows 10 os type 64 bits cpu intel (core™)i7 cpu) – q 720 @ 1.60ghz 1.60 ghz ram 8 gb h.d.d 1 tb table 2. in-sample and out-of-sample dataset of these exchange rate series time series sample type from to sample size usd/dollar in-sample 4 january 2015 28 february 2018 819 out-of-sample 240 0 0.2 0.4 0.6 0.8 1 1.2 1 2 1 4 1 6 1 8 1 1 0 1 1 2 1 1 4 1 1 6 1 1 8 1 2 0 1 2 2 1 2 4 1 2 6 1 2 8 1 3 0 1 3 2 1 3 4 1 3 6 1 3 8 1 4 0 1 4 2 1 4 4 1 4 6 1 4 8 1 5 0 1 5 2 1 5 4 1 5 6 1 5 8 1 6 0 1 6 2 1 6 4 1 6 6 1 6 8 1 7 0 1 7 2 1 7 4 1 7 6 1 7 8 1 8 0 1 n o rm a l( c h a n g e s) days 8 m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 skewness show that distribution is skewed to the left and right, respectively. however, if the value of absolute skewness is great, the asymmetry is more specific. however, kurtosis represents a measurement of the extremities (i.e., tails) of the distribution of data, indicating the existence of outliers. the standard measure of kurtosis is on the basis of a scaled version of the fourth moment of the data. this figure shows the tails of the distribution. in this regard, a higher kurtosis results from infrequent extreme deviations (or outliers) than the frequent modestly sized deviations. the kurtosis of any univariate normal distribution (standard gaussian distribution) is 3, and if it is greater than 3, the observation is more concentrated with a shorter tail than the normal distribution. however, if the kurtosis is less than 3, the observation is not so concentrated, and it has a longer tail than the normal one, as is common in the uniform distribution of the rectangle. the results are normalized with equation (1) in table 3. b. evaluation metric in order to assess the level forecasting accuracy of the proposed fed-gmdh, three main evaluation metrics are used to compare the out-of-sample forecasting performance. for level forecasting accuracy evaluation, the mean absolute error (mae), the root mean square error (rmse), and the mean absolute percentage error (mape) is selected, respectively, as follow: 𝑀𝐴𝐸 = 1 𝑇 ∑ |𝑦𝑡 − 𝑌 ′ 𝑡|. 𝑇 𝑡=1 (14) 𝑅𝑀𝑆𝐸 = √ 1 𝑇 ∑ |𝑦𝑡 − 𝑌 ′ 𝑡|. 𝑇 𝑡=1 (15) 𝑀𝐴𝑃𝐸 = 1 𝑇 ∑ | 𝑦𝑡−𝑌 ′ 𝑡 𝑦𝑡 | × 100.𝑇𝑡=1 (16) where t is the number of observations in the out-of-sample subset, yt and y't are respectively the actual value and the forecast value at time t. c. analyzing results this section analyzes the simulation results of the proposed method for forecasting stock exchange data and discusses the evaluation metric. the primary purpose is to design an intelligent system that can predict the unstructured pattern of dollar prices based on other financial market attributes. figure 4 and figure 5 show the fitting of train and test data sets. several training runs have been performed using different numbers of the group. figure 4 shows the error distributions from the neural network, with the sample containing all 400 simulated patterns. the plots are quite encouraging in that there are no very badly reconstructed patterns, and the error distribution is reasonably uniform throughout the space. there is some evidence of a slight systematic misfitting of wide spectra. in figure 4(a), 750 train samples are shown. in this section, the dollar prices and the predicted value in the training process are shown by the gmdh algorithm. as can be seen, the gmdh algorithm performs the training process with high accuracy and low error. in figure 4(b), the mae and rmse are calculated at the training stage, where mae = 5268.8127, rmse = 72.5866. in figure 4(c), the error std and mape are calculated at the training stage: mape = 10.5582, error std = 7314.6658. figure 5 shows an example test with the gmdh neural network that fits superimposed. in figure 5(a), 85 test samples are shown. in this section, the dollar prices and the predicted value in the testing process are shown by the gmdh algorithm. as can be seen, the gmdh algorithm performs the testing process with high accuracy and low error. in figure 5(b), the mae and rmse are calculated at the testing stage, where mae = 5837.4721, rmse = 76.4034. in figure 5(c), the error std and mape are calculated at the training stage: mape = 11.4883, error std = 9242.0662. table 3. the descriptive statistics of these exchange rate series time series minimum maximum mean std* skewness kurtosis usd/us dollar 0.075 1 0.244 0.125 1.83 5.11 note: std∗ refers to the standard deviation m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 9 in figure 6(a), 825 all samples are shown. in this section, the dollar prices and the predicted value in the predicting process are shown by the gmdh algorithm. as can be seen, the gmdh algorithm performs the predicting process with high accuracy and low error. in figure 6(b), the mae and rmse are calculated at the predicting stage, where mae = 5325.7481, rmse = 72.9777. in figure 6(c), the error std and mape are calculated at the predicting stage: mape = 10.6513, error std = 7525.16. as shown in figure 7, the linear regression offers information on two extremes. it provides a global appreciation of the accuracy (through the regression value r and through the slope and offset. it compares the position of each generated data point with its target counterpart. fig. 4. fitting of training set fig. 5. fitting of testing set 10 m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 iv. conclusion the correct prediction of the price of the currency in the stock market and the banking system of any country is significant. with timely and accurate forecasting of the currency, significant improvements can be made in each country's economy and foreign exchange industry. in this paper, fig. 6. fitting of all dataset fig. 7. plot regression m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 11 we have used the gmdh neural network algorithm to predict the price of a currency on the iranian stock exchange. the gmdh algorithm is a deep neural network that can predict high-order time series examples such as currency. the proposed method in this paper has several stages: data preprocessing, data preparation and data normalization, separation of training and testing samples, entering training samples into the gmdh neural network algorithm, and generating a network model. applying test samples to the generated model, forecasting the price of the currency and finally, evaluating the desired criteria. by simulating the proposed method, the gmdh neural network algorithm predicted the minimum amount of error in training, testing, and evaluation of the price of the currency. therefore, this algorithm can be trusted and used to predict the currency's price on the iranian stock exchange. as a suggestion for developing the results of this research, we can use the combination of machine learning algorithms such as decision tree c5, svm-lib, and mlp in the form of a reinforcement learning system to improve the results of the gmdh algorithm. we also can improve the results by combining optimization methods, cats, dragon fly, ga, and aco. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering unversitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. n. singh, a. picot, j. kranz, m. p. gupta, and a. ojha, “information security management (ism) practices: lessons from select cases from india and germany,” glob. j. flex. syst. manag., vol. 14, no. 4, pp. 225–239, dec. 2013, doi: 10.1007/s40171-013-0047-4. [2] b. apolloni, s. bassis, j. rota, g. l. galliani, m. gioia, and l. ferrari, “a neurofuzzy algorithm for learning from complex granules,” granul. comput., vol. 1, no. 4, pp. 225–246, dec. 2016, doi: 10.1007/s41066-016-0018-1. [3] j. k. deane, d. m. goldberg, t. r. rakes, and l. p. rees, “the effect of information security certification announcements on the market value of the firm,” inf. technol. manag., vol. 20, no. 3, pp. 107–121, sep. 2019, doi: 10.1007/s10799-018-00297-3. [4] s. shan, z. hu, z. liu, j. shi, l. wang, and z. bi, “an adaptive genetic algorithm for demand-driven and resourceconstrained project scheduling in aircraft assembly,” inf. technol. manag., vol. 18, no. 1, pp. 41–53, mar. 2017, doi: 10.1007/s10799-015-0223-7. [5] ernst & young llp, fighting to close the gap, no. november. 2012. [6] m. song and y. wang, “a study of granular computing in the agenda of growth of artificial neural networks,” granul. comput., vol. 1, no. 4, pp. 247–257, dec. 2016, doi: 10.1007/s41066-016-0020-7. [7] z. a. soomro, m. h. shah, and j. ahmed, “information security management needs more holistic approach: a literature review,” int. j. inf. manage., vol. 36, no. 2, pp. 215–225, apr. 2016, doi: 10.1016/j.ijinfomgt.2015.11.009. [8] m. siponen, m. a. mahmood, and s. pahnila, “technical opinionare employees putting your company at risk by not following information security policies?,” commun. acm, vol. 52, no. 12, pp. 145–147, dec. 2009, doi: 10.1145/1610252.1610289. [9] t. ring, “a breach too far?,” comput. fraud secur., vol. 2013, no. 6, pp. 5–9, jun. 2013, doi: 10.1016/s13613723(13)70052-6. [10] m. r. sanaei and f. m. sobhani, “information technology and e-business marketing strategy,” inf. technol. manag., vol. 19, no. 3, pp. 185–196, sep. 2018, doi: 10.1007/s10799-018-0289-0. [11] w. ashford, “many uk firms underestimate cost of data breaches, study finds,” https://www.computerweekly.com/, 2012. https://www.computerweekly.com/news/2240171040/many-uk-firms-underestimate-cost-of-data-breachesstudy-finds. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1007/s40171-013-0047-4 https://doi.org/10.1007/s40171-013-0047-4 https://doi.org/10.1007/s40171-013-0047-4 https://doi.org/10.1007/s41066-016-0018-1 https://doi.org/10.1007/s41066-016-0018-1 https://doi.org/10.1007/s10799-018-00297-3 https://doi.org/10.1007/s10799-018-00297-3 https://doi.org/10.1007/s10799-018-00297-3 https://doi.org/10.1007/s10799-015-0223-7 https://doi.org/10.1007/s10799-015-0223-7 https://doi.org/10.1007/s10799-015-0223-7 https://www.shinnihon.or.jp/shinnihon-library/publications/research/2013/pdf/2012-globalinformationsecuritysurvey-e.pdf https://doi.org/10.1007/s41066-016-0020-7 https://doi.org/10.1007/s41066-016-0020-7 https://doi.org/10.1016/j.ijinfomgt.2015.11.009 https://doi.org/10.1016/j.ijinfomgt.2015.11.009 https://doi.org/10.1145/1610252.1610289 https://doi.org/10.1145/1610252.1610289 https://doi.org/10.1145/1610252.1610289 https://doi.org/10.1016/s1361-3723(13)70052-6 https://doi.org/10.1016/s1361-3723(13)70052-6 https://doi.org/10.1007/s10799-018-0289-0 https://doi.org/10.1007/s10799-018-0289-0 https://www.computerweekly.com/news/2240171040/many-uk-firms-underestimate-cost-of-data-breaches-study-finds https://www.computerweekly.com/news/2240171040/many-uk-firms-underestimate-cost-of-data-breaches-study-finds https://www.computerweekly.com/news/2240171040/many-uk-firms-underestimate-cost-of-data-breaches-study-finds 12 m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 [12] s. chaigusin, c. chirathamjaree, and j. clayden, “the use of neural networks in the prediction of the stock exchange of thailand (set) index,” in 2008 international conference on computational intelligence for modelling control & automation, 2008, pp. 670–673, doi: 10.1109/cimca.2008.83. [13] s. galeshchuk, “neural networks performance in exchange rate prediction,” neurocomputing, vol. 172, pp. 446–452, jan. 2016, doi: 10.1016/j.neucom.2015.03.100. [14] t. h. hann and e. steurer, “much ado about nothing? exchange rate forecasting: neural networks vs. linear models using monthly and weekly data,” neurocomputing, vol. 10, no. 4, pp. 323–339, apr. 1996, doi: 10.1016/09252312(95)00137-9. [15] m. mahdavi, “a bayesian approach to foreign exchange forecasting,” glob. financ. j., vol. 8, no. 1, pp. 15–31, mar. 1997, doi: 10.1016/s1044-0283(97)90003-x. [16] g. zhang and m. y. hu, “neural network forecasting of the british pound/us dollar exchange rate,” omega, vol. 26, no. 4, pp. 495–506, aug. 1998, doi: 10.1016/s0305-0483(98)00003-6. [17] f. fernández-rodrı́guez, s. sosvilla-rivero, and j. andrada-félix, “exchange-rate forecasts with simultaneous nearestneighbour methods: evidence from the ems,” int. j. forecast., vol. 15, no. 4, pp. 383–392, oct. 1999, doi: 10.1016/s0169-2070(99)00003-5. [18] m. t. leung, a.-s. chen, and h. daouk, “forecasting exchange rates using general regression neural networks,” comput. oper. res., vol. 27, no. 11–12, pp. 1093–1110, sep. 2000, doi: 10.1016/s0305-0548(99)00144-6. [19] m. p. clements and j. smith, “evaluating forecasts from setar models of exchange rates,” j. int. money financ., vol. 20, no. 1, pp. 133–148, feb. 2001, doi: 10.1016/s0261-5606(00)00039-5. [20] a.-s. chen and m. t. leung, “a bayesian vector error correction model for forecasting exchange rates,” comput. oper. res., vol. 30, no. 6, pp. 887–900, may 2003, doi: 10.1016/s0305-0548(02)00041-2. [21] a.-s. chen and m. t. leung, “regression neural network for error correction in foreign exchange forecasting and trading,” comput. oper. res., vol. 31, no. 7, pp. 1049–1068, jun. 2004, doi: 10.1016/s0305-0548(03)00064-9. [22] l. yu, s. wang, and k. k. lai, “a novel nonlinear ensemble forecasting model incorporating glar and ann for foreign exchange rates,” comput. oper. res., vol. 32, no. 10, pp. 2523–2541, oct. 2005, doi: 10.1016/j.cor.2004.06.024. [23] a. preminger and r. franck, “forecasting exchange rates: a robust regression approach,” int. j. forecast., vol. 23, no. 1, pp. 71–84, jan. 2007, doi: 10.1016/j.ijforecast.2006.04.009. [24] j. h. wright, “bayesian model averaging and exchange rate forecasts,” j. econom., vol. 146, no. 2, pp. 329–341, oct. 2008, doi: 10.1016/j.jeconom.2008.08.012. [25] a. carriero, g. kapetanios, and m. marcellino, “forecasting exchange rates with a large bayesian var,” int. j. forecast., vol. 25, no. 2, pp. 400–417, apr. 2009, doi: 10.1016/j.ijforecast.2009.01.007. [26] s. ye, “rmb exchange rate forecast approach based on bp neural network,” phys. procedia, vol. 33, pp. 287–293, 2012, doi: 10.1016/j.phpro.2012.05.064. [27] t. korol, “a fuzzy logic model for forecasting exchange rates,” knowledge-based syst., vol. 67, pp. 49–60, sep. 2014, doi: 10.1016/j.knosys.2014.06.009. [28] f. shen, j. chao, and j. zhao, “forecasting exchange rate using deep belief networks and conjugate gradient method,” neurocomputing, vol. 167, pp. 243–253, nov. 2015, doi: 10.1016/j.neucom.2015.04.071. [29] e. abounoori, z. (mila) elmi, and y. nademi, “forecasting tehran stock exchange volatility; markov switching garch approach,” phys. a stat. mech. its appl., vol. 445, pp. 264–282, mar. 2016, doi: 10.1016/j.physa.2015.10.024. [30] m. ca’ zorzi, m. kolasa, and m. rubaszek, “exchange rate forecasting with dsge models,” j. int. econ., vol. 107, no. 260, pp. 127–146, 2017, doi: 10.1016/j.jinteco.2017.03.011. [31] s. sun, s. wang, and y. wei, “a new multiscale decomposition ensemble approach for forecasting exchange rates,” econ. model., vol. 81, pp. 49–58, sep. 2019, doi: 10.1016/j.econmod.2018.12.013. [32] k. he, y. chen, and g. k. f. tso, “forecasting exchange rate using variational mode decomposition and entropy theory,” phys. a stat. mech. its appl., vol. 510, pp. 15–25, nov. 2018, doi: 10.1016/j.physa.2018.05.135. [33] i. dzalbs and t. kalganova, “forecasting price movements in betting exchanges using cartesian genetic programming and ann,” big data res., vol. 14, pp. 112–120, dec. 2018, doi: 10.1016/j.bdr.2018.10.001. [34] c. amat, t. michalski, and g. stoltz, “fundamentals and exchange rate forecastability with simple machine learning methods,” j. int. money financ., vol. 88, pp. 1–24, nov. 2018, doi: 10.1016/j.jimonfin.2018.06.003. [35] y. wei, s. sun, j. ma, s. wang, and k. k. lai, “a decomposition clustering ensemble learning approach for forecasting foreign exchange rates,” j. manag. sci. eng., vol. 4, no. 1, pp. 45–54, mar. 2019, doi: 10.1016/j.jmse.2019.02.001. [36] s. fu, y. li, s. sun, and h. li, “evolutionary support vector machine for rmb exchange rate forecasting,” phys. a stat. mech. its appl., vol. 521, pp. 692–704, may 2019, doi: 10.1016/j.physa.2019.01.026. [37] l. ni, y. li, x. wang, j. zhang, j. yu, and c. qi, “forecasting of forex time series data based on deep learning,” procedia comput. sci., vol. 147, pp. 647–652, 2019, doi: 10.1016/j.procs.2019.01.189. [38] r. wang, b. morley, and m. p. stamatogiannis, “forecasting the exchange rate using nonlinear taylor rule based models,” int. j. forecast., vol. 35, no. 2, pp. 429–442, apr. 2019, doi: 10.1016/j.ijforecast.2018.07.017. [39] a. bagheri, n. nariman-zadeh, a. s. siavash, and a. r. khoobkar, “gmdh type neural networks and their application to the identification of the inverse kinematic equations of robotic manipulators,” int. j. eng., vol. 18, no. 2, pp. 135– 143, 2005. https://doi.org/10.1109/cimca.2008.83 https://doi.org/10.1109/cimca.2008.83 https://doi.org/10.1109/cimca.2008.83 https://doi.org/10.1016/j.neucom.2015.03.100 https://doi.org/10.1016/j.neucom.2015.03.100 https://doi.org/10.1016/0925-2312(95)00137-9 https://doi.org/10.1016/0925-2312(95)00137-9 https://doi.org/10.1016/0925-2312(95)00137-9 https://doi.org/10.1016/s1044-0283(97)90003-x https://doi.org/10.1016/s1044-0283(97)90003-x https://doi.org/10.1016/s0305-0483(98)00003-6 https://doi.org/10.1016/s0305-0483(98)00003-6 https://doi.org/10.1016/s0169-2070(99)00003-5 https://doi.org/10.1016/s0169-2070(99)00003-5 https://doi.org/10.1016/s0169-2070(99)00003-5 https://doi.org/10.1016/s0305-0548(99)00144-6 https://doi.org/10.1016/s0305-0548(99)00144-6 https://doi.org/10.1016/s0261-5606(00)00039-5 https://doi.org/10.1016/s0261-5606(00)00039-5 https://doi.org/10.1016/s0305-0548(02)00041-2 https://doi.org/10.1016/s0305-0548(02)00041-2 https://doi.org/10.1016/s0305-0548(03)00064-9 https://doi.org/10.1016/s0305-0548(03)00064-9 https://doi.org/10.1016/j.cor.2004.06.024 https://doi.org/10.1016/j.cor.2004.06.024 https://doi.org/10.1016/j.ijforecast.2006.04.009 https://doi.org/10.1016/j.ijforecast.2006.04.009 https://doi.org/10.1016/j.jeconom.2008.08.012 https://doi.org/10.1016/j.jeconom.2008.08.012 https://doi.org/10.1016/j.ijforecast.2009.01.007 https://doi.org/10.1016/j.ijforecast.2009.01.007 https://doi.org/10.1016/j.phpro.2012.05.064 https://doi.org/10.1016/j.phpro.2012.05.064 https://doi.org/10.1016/j.knosys.2014.06.009 https://doi.org/10.1016/j.knosys.2014.06.009 https://doi.org/10.1016/j.neucom.2015.04.071 https://doi.org/10.1016/j.neucom.2015.04.071 https://doi.org/10.1016/j.physa.2015.10.024 https://doi.org/10.1016/j.physa.2015.10.024 https://doi.org/10.1016/j.jinteco.2017.03.011 https://doi.org/10.1016/j.jinteco.2017.03.011 https://doi.org/10.1016/j.econmod.2018.12.013 https://doi.org/10.1016/j.econmod.2018.12.013 https://doi.org/10.1016/j.physa.2018.05.135 https://doi.org/10.1016/j.physa.2018.05.135 https://doi.org/10.1016/j.bdr.2018.10.001 https://doi.org/10.1016/j.bdr.2018.10.001 https://doi.org/10.1016/j.jimonfin.2018.06.003 https://doi.org/10.1016/j.jimonfin.2018.06.003 https://doi.org/10.1016/j.jmse.2019.02.001 https://doi.org/10.1016/j.jmse.2019.02.001 https://doi.org/10.1016/j.physa.2019.01.026 https://doi.org/10.1016/j.physa.2019.01.026 https://doi.org/10.1016/j.procs.2019.01.189 https://doi.org/10.1016/j.procs.2019.01.189 https://doi.org/10.1016/j.ijforecast.2018.07.017 https://doi.org/10.1016/j.ijforecast.2018.07.017 https://www.ije.ir/article_71580.html https://www.ije.ir/article_71580.html https://www.ije.ir/article_71580.html m.f. masouleh and a. bagheri / knowledge engineering and data science 2021, 4 (1): 1–13 13 [40] a. bagheri, h. mohammadi peyhani, and m. akbari, “financial forecasting using anfis networks with quantumbehaved particle swarm optimization,” expert syst. appl., vol. 41, no. 14, pp. 6235–6250, oct. 2014, doi: 10.1016/j.eswa.2014.04.003. [41] s. dick and a. kandel, “granular computing in neural networks,” studies in fuzziness and soft computing, pp. 275– 305, 2001. [42] b. apolloni, s. bassis, j. rota, g. l. galliani, m. gioia, and l. ferrari, “a neurofuzzy algorithm for learning from complex granules,” granul. comput., vol. 1, no. 4, pp. 225–246, dec. 2016, doi: 10.1007/s41066-016-0018-1. [43] s. j. farlow, “the gmdh algorithm of ivakhnenko,” am. stat., vol. 35, no. 4, pp. 210–215, nov. 1981, doi: 10.1080/00031305.1981.10479358. https://doi.org/10.1016/j.eswa.2014.04.003 https://doi.org/10.1016/j.eswa.2014.04.003 https://doi.org/10.1016/j.eswa.2014.04.003 https://doi.org/10.1007/978-3-7908-1823-9_12 https://doi.org/10.1007/978-3-7908-1823-9_12 https://doi.org/10.1007/s41066-016-0018-1 https://doi.org/10.1007/s41066-016-0018-1 https://doi.org/10.1080/00031305.1981.10479358 https://doi.org/10.1080/00031305.1981.10479358 i. introduction ii. method a. group method of data handling (gmdh) b. structural identification of gmdh type neural networks iii. results and discussions a. dataset b. evaluation metric c. analyzing results iv. conclusion declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 1, july 2021, pp. 38–48 eissn 2597-4637 https://doi.org/10.17977/um018v4i12021p38-48 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by indonesian ministry of research & technology indonesian sentence boundary detection using deep learning approaches joan santoso a, 1, *, esther irawati setiawan a, 2, christian nathaniel purwanto b, 3, fachrul kurniawan c, 4 a department of information technology, institut sains dan teknologi terpadu surabaya jalan ngagel jaya tengah 73 77, surabaya, indonesia b electrical engineering and computer science, national yan ming chiao tung university 1001 university road, hsincu, taiwan c department of informatics, maulana malik ibrahim state islamic university jalan gajayana no.50, malang, indonesia 1 joan@istts.ac.id*; 2 esther@istts.ac.id; 3 chrisnp.ee08@nycu.edu.tw; 4 fachrulk@ti.uin-malang.ac.id * corresponding author i. introduction sentence segmentation or tokenization is a primary text processing in natural language processing [1]. to begin processing each token of words, we need to detect whether those tokens belong to the same sentence or not. sentence boundary detection is used to split every sentence in a document. hence, we can transfer this boundary information to the following process. this kind of task is a crucial one for natural language processing. the core of detecting sentence boundary is to identify the end of a sentence [2]. a full stop mark “.” usually ends the sentence, but not in all cases. for example, the full stop mark may denote an abbreviation, decimal value, or even currency value. another punctuation marks that may end a sentence are a question mark or exclamation mark. even a random word may finish a sentence. it needs many rules to encounter all the possibilities as every writer comes with their writing style. many rules mean a lot of effort and time required. several studies use sentence boundary detection for text pre-processing. walker [3] improves the accuracy of machine translation using a sentence splitter. liu [4][5] and roark [6] detect the sentence boundary from a conversation. goldstein [7] and erwin [8] also use sentence extractor to summarize a document. rudrapal [9] uses sentence boundary detection for social media text. another research by chang et al. [10] use sentence position as a feature for question answering. sentence boundary article info a b s t r a c t article history: received 7 february 2021 revised 22 may 2021 accepted 21 june 2021 published online 17 august 2021 detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. it can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. this research uses a deep learning approach to split each sentence from an indonesian news document. hence, there is no need to define any handcrafted features or rules. in part of speech tagging and named entity recognition, we use sequence labeling to determine sentence boundaries. two labels will be used, namely o as a non-boundary token and e as the last token marker in the sentence. to do this, we used the bi-lstm approach, which has been widely used in sequence labeling. we have proved that our approach works for indonesian text using pre-trained embedding in indonesian, as in previous studies. this study achieved an f1-score value of 98.49 percent. when compared to previous studies, the achieved performance represents a significant increase in outcomes. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: bahasa indonesia bidirectional lstm natural language processing sentence boundary detection sequence classification http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i12021p38-48 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.ristekbrin.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 39 detection can help the pre-processing phase and further improve the performance results. we choose the deep learning approach to simplify the learning process without crafting any rules by hand as in the traditional machine learning approach. we decide to use bidirectional lstm because of its ability to remember long-term sequences from two-way directions. by using this model, we do not need a handcrafted feature like in previous research. this model only needs the token of words. there are several reasons why we conduct this research for bahasa indonesia. the main reason is the limitation of available tools and resources. moreover, there is a need for tokenizing sentences. natural language processing approaches can use the tokenizing task as a basis for further tasks. sentence boundary detection is crucial as a pre-processing phase of many natural language processing tasks. one use is on simultaneous translation, where sentence boundary detection could detect sentences before the translation process [11]. sentence boundary detection also is needed for chatbot [12], machine translation, named entity recognition, and coreference resolution [13]. previous researchers have worked on several machine learning approaches on sentence boundary detection, i.e., unsupervised [14], rule-based method [13][15], maximum entropy [16], hidden markov model [17], conditional random field [18], support vector machine [19], and confusion networks [20]. we use a deep learning approach to detect the sentence boundary as in our previous work [21]. sentence boundary detection has been studied on other languages like english [22], portuguese [23], french [24], vietnamese [25], chinese [26], japanese [27], marathi [28], kannada [29], arabic [30], and urdu [31]. another study is in thai with bi-lstm cnn approach [32]. in indonesian, sentence boundary detection has been presented using maximum entropy [33] and bidirectional lstm [21]. our contribution is aimed directly at text processing in bahasa indonesia. the result of sentence boundary detection might be used for extracting information or even further, like solving another natural language processing problem. to our knowledge, we are the first to propose sentence boundary detection with deep learning in indonesian. after the tokenization process of a document, sometimes the determination of punctuation as the end of a sentence gives ambiguity whether it is the ending of a sentence or not. in this research, the sequential learning method is used to classify each token whether it marks the end of a sentence or not. we use deep learning to provide a crucial preprocessing of text that detects each sentence from a text document. our sentence boundary detector can be used as a feature extractor for later tasks. furthermore, we also prove that the deep learning model is capable of detecting sentence boundaries. our approach could achieve a higher f1 score than the previous approach, and no need to build any handcrafted rules. ii. method this section explains the steps of our research framework. the first step is explaining how we build our corpus for sentence boundary detection. this section explains how to get the raw data until processed as a labeled dataset, followed by further discussion of the proposed architecture. the discussion is divided according to each architecture layer: input layer, bidirectional lstm cells, and output layer. this section also includes an additional explanation of the used optimization method. a. the problem in sentence boundary detection for bahasa indonesia this section will explain some problems that occur when detecting sentence boundaries for bahasa indonesia [33]. all of them are based on the ambiguities that punctuation marks may not always end a sentence [34]. we have listed each problem with few examples. there are also several points that we discuss to explain each case. the first problem is writing the title and degree. when writing someone’s title, the writer often uses the short version of the title or degree. as seen in the first example, “h” is a title for someone who used to have a pilgrimage. “ir” is an academic degree for an engineering major. the title “h” stands for “haji” and the title “ir” stands for “insinyur”. this example shows the use of a stop mark to shorten the writing of the title or degree. the full stop mark in the title and degree does not end the sentence. this case is different when the title or degree is placed at the end of the sentence. in the second example, the stop mark in “kom.” ends the sentence because it is the last word. 1. presiden ir. h. joko widodo berkunjung ke surabaya. president ir. h. joko widodo visited surabaya. 40 j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 2. kelas kami diajar oleh joan santoso, m.kom. our class is taught by joan santoso, m.sc. abbreviation for names case comes when writing a long name. the writer usually makes the name shorter by using each word’s first character and gives a full stop mark on each abbreviation. this abbreviation is written in uppercase letters. it is hard to list all of the abbreviations for names because many names are used in the document collection. full stop mark in abbreviations for names does not end a sentence. however, it ends the sentence if the abbreviation is placed at the end of the sentence. this case is similar to the first case in 2.1, which happens in writing someone’s title and degree. as seen in the first example, the writer use “w” which stands for “widodo” to shorten the name. in the second example, the stop mark after “s” stands for “santoso” ends the sentence because it is the last word. 1. presiden ir. h. joko w. berkunjung ke surabaya. president ir. h. joko w. visited surabaya. 2. kelas kami kedatangan alumni bernama joan s. our class is visited by an alumnus named joan s. the third problem is related to common abbreviations. there are some standard abbreviations used in bahasa indonesia. for examples: “a.n” (atas nama / by the name of), “s.d.” (sampai dengan/until), “d.a.” (dengan alamat/placed in), “jl.” (jalan/street), “hlm.” (halaman/page), etc. a full stop mark in this kind of abbreviation does not end a sentence. usually, the writer uses these abbreviations in the middle of the sentence. the first example shows that the stop mark in “tgl” is shortened from the original word “tanggal”. the second example also using the abbreviation “s.d.” to shorten the original word “tanggal”. in the third example, the writer could write the original word “jalan” or just “jl.” for the shorter one. 1. dia akan pergi pada tgl. 25 agustus 2018. he will go on 25 august, 2018. 2. dia akan pergi dari senin s.d. minggu. he will go from monday to sunday. 3. dia akan pergi ke jl. ngagel jaya. he will go to ngagel jaya street. time separator is considered as the fourth problem. time can be separated using punctuation marks. the full stop mark in the time separator does not end the sentence. in the first example, the expression of time “10.30” does not end its sentence. it only separates that 10 is the number of hours and 30 is the number of minutes. the second example also uses a full stop mark to separate between hours and minutes. both the first and second examples provide the use case of stop mark for time expression in a sentence. 1. dia telah tiba di surabaya pukul 10.30 wib. he had arrived in surabaya at 10.30 wib. 2. pada pagi hari jam 08.05, sang pembunuh menemui korban. at 08:05 in the morning, the murderer met the victim. the next problem, the money separator, can be expressed using punctuation marks. in the first example, the full stop mark in the expression “100.000” does not end the sentence. it separates the amount of money. usually, people separate money per three digits in bahasa indonesia to make it easier for the reader. the second example also expresses the money format with the other currency used. “rp.” is the formal way to write indonesian currency. there is another way to express money in bahasa indonesia, as stated in the third example. the only difference is in the use of “,-” to end the money expression. 1. buku ini seharga 100.000. this book costs 100,000. 2. tas ini seharga rp. 100.000. this bag costs rp. 100,000. j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 41 3. meja ini seharga rp. 100.000,-. this table costs rp. 100,000,-. another problem is a number separator. a full stop mark is used to separate the number per thousand. it is not only used in expressing money but also when writing any number. for example, “1.123” in the first example contains full stop mark that separate number in expressing the number of people who died from the earthquake. the second example shows the use of a full stop mark to separate the number of smartphones. almost any numbering expression uses a full stop mark to separate per thousand. this separation is similar to money separation to make the reader easier to read and understand. 1. gempa pekan lalu menimbulkan korban sekitar 1.123 jiwa. last week's quake caused casualties of around 1,123 people. 2. ada 1.500.000 ponsel pintar yang terhubung ke server kami. there are 1,500,000 smartphones connected to our server. the email-formatted text could be problematic, contain more than one full stop mark. the first example shows the standard email formatted text. however, the second example shows that the number of full stop marks in email can be as much as possible. users can freely choose a custom name for their email. the third example shows that there are a lot of non-formal ways to write an email. in this case, building rules for each case is time-consuming. moreover, email-formatted text can also be written like in the fourth example. the writer can use “dot” instead of a full stop mark. 1. pertanyaan lain dapat dikirimkan ke email christian.np@indocl.stts.edu. other questions can be sent to christian.np@indocl.stts.edu. 2. email kami yaitu people.hrd.tech.123@main.hrd.indocl.stts.edu. our email is people.hrd.tech.123@main.hrd.indocl.stts.edu. 3. email saya adalah christian at indocl.stts.edu. my email is christian at indocl.stts.edu. 4. email dia adalah christian.np at indocl dot stts dot edu. his email is christian.np at indocl dot stts dot edu. problem number eight is the username formatted text. sometimes the writer takes quotes from social media and includes the username. there is no limitation on giving a full stop mark in the username. full stop in username does not end the sentence. the first example shows the use of a full stop mark in the usual username “@christian.np”. on the contrary, a username can also contain many full stop marks like the second example. “@christian.n.p.stts.sby” contains several numbers of full stop marks. this case rarely happens, but it is still possible for a username to have many full stop marks. 1. akun @christian.np juga mengatakan hal yang serupa. account @christian.np also said the same thing. 2. @christian.n.p.stts.sby @joan.s. ayo pergi ke bali bulan depan! @christian.n.p.stts.sby @joan.s. let’s go to bali next month! sentence emphasis is often used when the author wants to emphasize some meaning from the text. this kind of writing is often found in drama script writing to express feeling through the writing. in addition, the writer can combine many different punctuation marks according to his or her creativity. this case is the same when handling free structured text from social media. chat, comment, or post on social media does not have a fixed rule in writing. users can write anything based on current trends, thus making a problem because the rule for splitting each sentence is different for each time. sometimes multiple punctuations can be combined to be a single token. usually, the last punctuation mark in the token is the one that ends the sentence. a question mark may not finish a sentence when it comes together with another punctuation mark like “?!”. as seen in figure 1, the question mark after the word “surabaya” does not end the sentence. the exclamation mark after the question mark is the one that ends the sentence. on the other hand, a single punctuation mark may not end the sentence if it is placed in line with another punctuation mark. the last punctuation mark in token “!!!” which is the exclamation mark, is the one that ends the sentence. 42 j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 fig. 1. sentence emphasis example. fig. 2. non-punctuation token example from detik.com the next problem happens in the dialogue text. some conversations may consist of multiple sentences. when we try to split them up, we lose their context, which is used to determine these sentences belong to whom. the sentences seem to have their context, but those sentences are in the same context. we make an agreement that all spoken words from a person at a particular time will be counted as a single sentence, even if there is more than one sentence inside it. this agreement may be different from other sentence tokenizer tools where the text is tokenized based on the end of a sentence, not the context of the whole text. 1. “siapa namamu?” tanya joan. “what is your name?” asked joan. 2. “hai! nama saya christian np. saya senang berkenalan denganmu!” ujar christian. “hi! my name is christian np. i am glad to know you!” said christian. the first example is a common writing style in which dialogue contains only one sentence. the second example is more complex than the previous example. it consists of three different sentences, which are “hai!”, “nama saya christian np.”, and “bagaimana harimu?”. we count these three sentences as a single sentence, together with the main sentence. the context is the same because they are all spoken by one person at a particular time. the last problem is the non-punctuation token. as we analyze our dataset, we found that the end of each sentence is not always a punctuation mark. it may occur when a non-word ends the sentence. this case usually happens when converting a list to plain text. point by point in a list can either ends with a punctuation mark like a full stop mark or just a word. figure 2 [35] shows the output of sentence tokenization from a list. colon mark “:” ends the first sentence as a description of the list. the second sentence until the rest is split according to the number of the list. as we can see, the end of the second sentence until the rest is different. it may be “widjojanto”, “husein”, “hehamahua”, or other words that ends the sentence. on another view, the full stop mark after the index is combined with the current sentence. these numbers are used as an index and do not end the sentence. b. data preparation our corpus is built from indonesian news documents. all news is crawled from two news sites which are detik and kompas. each news is then extracted and parsed to get the text. we remove unused information like ads, pictures, video, and audio because we only need plain text. then, we conducted post-processing, which converts all list types to readable text and does tokenization at the word level. the product is a token that contains either word or a punctuation mark. in the last step, we split each sentence manually for all documents and crawled those sites from 2011 until 2012. there are 14,142 sentences in total from all documents. input : akankah bapak gunawan berkunjung ke surabaya?! kami yakin beliau datang!!! output : akankah bapak gunawan berkunjung ke surabaya?! kami yakin beliau datang!!! berikut 8 nama calon pimpinan kpk hasil seleksi pansel yang dikirim presiden ke dpr: 1. bambang widjojanto 2. yunus husein 3. abdullah hehamahua 4. handoyo sudradjat 5. abraham samad 6. jaksa zulkarnain 7. adnan pandu praja 8. irjen pol (purn) aryanto sutadi j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 43 fig. 3. data preparation from news site detik figure 3 [36] displays an example of the data preparation process. the rest of the dataset follows the same process. on the left side is the original html formatted text from detik news. on the right side is the result with 20 sentences in total. each sentence is split based on its context (as discussed in section 2). there is a section with a list of typed text in the last part, separated one by one per point. the numbering is essential information for further tasks. c. sequence classification for sentence boundary detection long short-term memory (lstm) is established for part-of-speech (pos) tagging, named entity recognition (ner), and noun phrase chunking. sentence boundary detection can be seen as a sequence classification problem where we want to label every timestep of the input or as a collocation identification problem [37]. every token of input is predicted, whether it is the end of a sentence or not, based on the previous token. we build an architecture based on the nature of the problem. we pay much attention to the whole sequence rather than individual prediction. thus, we use bidirectional lstm to capture the sequential features from both directions (left to right direction and right to left direction). figure 4 is the visualization of our system architecture. we divide our architecture into three different layers: input layer, sequence learning, and output layer. the input is a sequence of tokens from a single sentence, and the output is also a sequence of labels. in the input layer, it just simply converts each token into a vector using word embedding. thus, the word vector is learned by the sequence learning layer. we use bidirectional lstm for sequence learning. in the end, all predicted results are converted into final predictions in the output layer. this prediction contains information on which token is identified as the end of the sentence and which is not. we also use one of the optimization methods to help the learning process. we use adam optimizer for that purpose. our proposed input layer is token embedding because it converts from token input into vector. every token is a string which can be either word or a punctuation mark. we use skip-gram word2vec as our embedding model. skip-gram word2vec is capable of giving a semantic representation of a token. it is also capable of providing the similarity of context from different words. however, word2vec has a drawback when handling unknown words. word2vec cannot provide the vector representation if the word is not trained before. to encounter this problem, we use a random trained vector to represent every unknown word. 1. penerbit buku panduan traveling terkemuka dunia, lonely planet mengumumkan 10 destinasi terbaik di asia. 2. salah satunya ada dari indonesia, yakni pulau komodo. melansir cnn travel, jumat (13/7/2018), destinasi nomor satu di asia berasal dari korea selatan, yakni busan. 3. kota ini sering disebut juga sebagai kota kedua di korea selatan. busan, sekitar 2,5 jam perjalanan dari seoul. 4. kota ini terkenal karena merupakan tujuan berlibur di musim panas dengan seafoodnya yang lezat dan pantai yang cantik. busan menawarkan berbagai kegiatan bagi pra traveler yang mengunjunginya. 5. anda bisa mendaki perbukitan ke kuil buddha, bersantai di pemandian air panas dan menikmati hidangan laut di pasar ikan terbesar di negara itu. 6. "asia adalah benua yang sangat luas dengan keberagaman budayanya akan sangat cocok bagi mereka yang memimpikan tempat pelarian," kata juru bicara lonely planet asia-pasifik, chris zeiher. 7. "para ahli kami telah menyisir ribuan rekomendasi untuk memilih tujuan terbaik untuk dikunjungi selama 12 bulan ke depan," tukas dia. 8. tempat-tempat lain dipuji karena perbaikan infrastruktur destinasinya, sebagai contohnya taman nasional komodo indonesia. berada di nomor 10 karena lebih mudah diakses daripada sebelumnya berkat rute penerbangan baru. 9. "selain melihat komodo yang terkenal, para pengunjung dapat mengunjungi pulau-pulau kecil seperti di padar, kanawa dan menyelam dengan pemandangan terumbu karang cantik," katanya. 10. berikut daftar 10 destinasi terbaik asia tahun 2018 versi lonely planet: 11. 1. busan, korea selatan 12. 2. uzbekistan 13. 3. ho chi minh city, vietnam 14. 4. ghats barat, india 15. 5. nagasaki, jepang 16. 6. chiang mai, thailand 17. 7. lumbini, nepal 18. 8. teluk arugam, sri lanka 19. 9. provinsi sìchuan, china 20. 10. taman nasional komodo, indonesia 44 j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 sequence learning is used to predict the outputs from the given inputs. we use bidirectional lstm, which uses two different lstm cells. each cell acts as a forward learner and a backward learner. forward lstm reads input from the first token to the last token, and backward lstm reads input from the last token to the first token. the results from both of the cells will be concatenated. the gray circle on the figure denotes the input to lstm cell. the colorful circle denotes every gate in the lstm cell, which consists of yellow for the activation gate, green for the input gate, red for the forget gate, and blue for the output gate. the last one is light blue for cell state, which holds long-term memory from several previous calculations. 𝑎𝑡 𝑓𝑤𝑑 = tanh⁡(𝑤𝑎 𝑓𝑤𝑑 . 𝑥𝑡 𝑓𝑤𝑑 + 𝑢𝑎 𝑓𝑤𝑑 . ℎ𝑡−1 𝑓𝑤𝑑⁡ +⁡𝑏𝑎 𝑓𝑤𝑑 ) (1) 𝑖𝑡 𝑓𝑤𝑑 ⁡= ⁡𝜎(𝑤𝑖 𝑓𝑤𝑑 . 𝑥𝑡 𝑓𝑤𝑑 + 𝑢𝑖 𝑓𝑤𝑑 . ℎ𝑡−1 𝑓𝑤𝑑 ⁡+⁡𝑏𝑖 𝑓𝑤𝑑 ) (2) 𝑓𝑡 𝑓𝑤𝑑 = ⁡𝜎(𝑤𝑓 𝑓𝑤𝑑 . 𝑥𝑡 𝑓𝑤𝑑 + 𝑢𝑓 𝑓𝑤𝑑 . ℎ𝑡−1 𝑓𝑤𝑑 ⁡+⁡𝑏𝑓 𝑓𝑤𝑑 ) (3) 𝑜𝑡 𝑓𝑤𝑑 = ⁡𝜎(𝑤𝑜 𝑓𝑤𝑑 . 𝑥𝑡 𝑓𝑤𝑑 + 𝑢𝑜 𝑓𝑤𝑑 . ℎ𝑡−1 𝑓𝑤𝑑 ⁡+⁡𝑏𝑜 𝑓𝑤𝑑 ) (4) 𝑐𝑡 𝑓𝑤𝑑 =⁡𝑐𝑡−1 𝑓𝑤𝑑 ∗ 𝑓𝑡 𝑓𝑤𝑑 + 𝑖𝑡 𝑓𝑤𝑑 ∗ 𝑎𝑡 𝑓𝑤𝑑 ⁡ (5) ℎ𝑡 𝑓𝑤𝑑 =⁡𝑜𝑡 𝑓𝑤𝑑 ∗ tanh⁡(𝑐𝑡 𝑓𝑤𝑑 ) (6) equations (1) to (6) are the mathematical functions for the forward lstm cell. (1) is the activation gate, (2) is the input gate, (3) is the forget gate, (4) is the output gate, (5) is the cell state, and (6) is the prediction from the forward lstm cell. equations (7) to (12) are similar to equations (1) to (6). (13) is used as the final prediction of both lstm cells which use concatenation function to combine two vectors values. 𝑎𝑡 𝑏𝑤𝑑 = tanh⁡(𝑤𝑎 𝑏𝑤𝑑. 𝑥𝑡 𝑏𝑤𝑑 + 𝑢𝑎 𝑏𝑤𝑑. ℎ𝑡−1 𝑏𝑤𝑑⁡ +⁡𝑏𝑎 𝑏𝑤𝑑) (7) 𝑖𝑡 𝑏𝑤𝑑 ⁡= ⁡𝜎(𝑤𝑖 𝑏𝑤𝑑. 𝑥𝑡 𝑏𝑤𝑑 + 𝑢𝑖 𝑏𝑤𝑑. ℎ𝑡−1 𝑏𝑤𝑑 ⁡+⁡𝑏𝑖 𝑏𝑤𝑑) (8) 𝑓𝑡 𝑏𝑤𝑑 = ⁡𝜎(𝑤𝑓 𝑏𝑤𝑑. 𝑥𝑡 𝑏𝑤𝑑 + 𝑢𝑓 𝑏𝑤𝑑. ℎ𝑡−1 𝑏𝑤𝑑 ⁡+⁡𝑏𝑓 𝑏𝑤𝑑) (9) 𝑜𝑡 𝑏𝑤𝑑 = ⁡𝜎(𝑤𝑜 𝑏𝑤𝑑. 𝑥𝑡 𝑏𝑤𝑑 + 𝑢𝑜 𝑏𝑤𝑑. ℎ𝑡−1 𝑏𝑤𝑑 ⁡+⁡𝑏𝑜 𝑏𝑤𝑑) (10) 𝑐𝑡 𝑏𝑤𝑑 =⁡𝑐𝑡−1 𝑏𝑤𝑑 ∗ 𝑓𝑡 𝑏𝑤𝑑 + 𝑖𝑡 𝑏𝑤𝑑 ∗ 𝑎𝑡 𝑏𝑤𝑑⁡ (11) ℎ𝑡 𝑏𝑤𝑑 =⁡𝑜𝑡 𝑏𝑤𝑑 ∗ tanh⁡(𝑐𝑡 𝑏𝑤𝑑) (12) ℎ𝑡 ⁡⁡⁡⁡⁡=⁡concat⁡(ℎ𝑡 𝑓𝑤𝑑 , ℎ𝑡 𝑏𝑤𝑑⁡) (13) the output layer converts every vector result from the sequence learner to be the predicted label using the softmax function. the function provides a probability distribution for each label and then outputs the label with the largest probability. the output labels are “e” as “endofsentence” and “o”. label “e” or “endofsentence” means the current token is the ending of a sentence. label “o” fig. 4. system architecture j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 45 (others) represents that the current token does not end a sentence. because of its sequential nature, every token input will have a single output label. in this research, we choose adam optimizer to obtain an appropriate gradient for each weight in networks. adam combines adaptive learning rate and momentum. technically, every weight is updated by using gradient calculated with adam. algorithm 1 is the pseudocode of the adam optimizer. the default value for each hyperparameter is based on the original paper in [38]. algorithm 1; adam optimizer while not converged do: t = t + 1 gt = getgradient(θt-1) mt = β1 * mt-1 + (1 β1)*gt vt = β2 * vt-1 + (1 – β2)*gt2 mt = mt / (1β1t) vt = vt/(1β2t) = -1 α * mt/(√vt + ϵ) return iii. results and discussions we had done several experiments to prove the capability of our proposed architecture. we provide some test cases by fine-tuning a few hyperparameters. besides, we also report a different approach by using standard lstm to compare with our bidirectional lstm model. we ran different scenarios based on the changing of hyperparameters. each scenario used the same dataset. we split our corpus by 70% (9,953 sentences) for training and 30% (4,189 sentences) for testing. the random seed was turned off to focus only on the original effect of the hyperparameters setting. there were two big categories based on the model we have tried. we tested every model by changing the hidden unit of lstm cell, the number of layers, and training iteration. table 1 contains all experiments using different kinds of methods. the row represents the method, and the column represents the number of iterations. every method is either experimented on the lstm or the word embedding. based on the results in table 1, we found that bilstm (bidirectional lstm) works better than unilstm (unidirectional lstm). word embedding gives a small difference in overall accuracy. the number of iterations will increase accuracy but not a lot in the next iteration. we also conduct another trial to identify the effect of word embedding dimension by using 50% of the training document and separate as 70% sentences as training and 30% sentences as testing. the results are as follows 0.9843% for 50 dimensions, 0.9830% for 100 dimensions, 0.9832% for 150 dimensions, 0.9846% for 200 dimensions, 0.9850% for 250 dimensions, 0.9857% for 300 dimensions, 0.9808% for 350 dimensions, 0.9826% for 400 dimensions, 0.9863% for 450 dimensions, and 0.9826% for 500 dimensions. our final result is 98.49% when using bi-lstm model with word2vec embedding and 100 iterations. the second experiment was conducted by comparing the performance of the proposed method with several approaches from previous state-of-the-art research. the problem modeling in this research is sequential tagging for a set of input token sequences. several sequential tagging methods will be used as a comparison method in this proposed approach, namely maximum entropy, decision tree, and naïve bayes. in addition to using several traditional non-deep learning models, the performance of the proposed method is also compared with previous studies using bi-lstm by purwanto et al. [21]. the experimental results can be seen in table 2. table 1. experiments results method number of iteration 10 20 50 100 unilstm 96.79% 96.94% 97.14% 96.81% bilstm 96.95% 97.43% 98.22% 98.47% unilstm + word2vec* 96.91% 96.41% 97.44% 97.48% bilsm + word2vec* 97.09% 98.10% 98.39% 98.49% *we use skip-gram model for word2vec 46 j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 based on the experimental results in previous studies, the best performance of the bi-lstm proposed in this study provides the most significant increase of approximately 13% compared to other approaches that do not use deep learning. however, compared with the bi-lstm that has been proposed by [21], there was an increase of approximately 2%. the reason is that the results of the proposed approach are using two labels and while in [21] approach uses four labels. the use of two labels can give the best results compared to 4 labels in previous studies, especially in sentence boundary detection research. iv. conclusion we have done several experiments to prove the capability of bidirectional lstm as the sequence learner to solve sentence boundary detection. we view this task as a sequential problem where every token input is predicted to end a sentence. based on our experiments, we could reach 98.49% f1 score with bidirectional lstm as our sequence learner and train embedding for the word embedding as the best model. we also compare our research with other widely used methods in sequence classification. we conclude that the bidirectional lstm is way better than a unidirectional lstm. in our case, word2vec does not effectively capture sentence boundaries for indonesian news documents. our last trial gives a similar f1 score, whether using low dimension or high dimension embedding size. acknowledgment the authors want to appreciate institut sains dan teknologi terpadu surabaya (istts) for supporting this research. also, we want to thank our laboratory member from natural language processing laboratory from istts for helping us finish this research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering unversitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] d. jurafsky and h. james, martin: speech and language processing: an introduction to natural language processing, speech recognition, and computational linguistics. prentice-hall, englewood cliffs, 2008. [2] j. read, r. dridan, s. oepen, and l. j. solberg, “sentence boundary detection: a long solved problem?,” in proceedings of coling 2012: posters, 2012, pp. 985–994. [3] d. j. walker, d. e. clements, m. darwin, and j. w. amtrup, “sentence boundary detection: a comparison of paradigms for improving mt quality,” in proceedings of the mt summit viii, 2001, vol. 58. table 2. experimental results compared with other studies no. previous research performance 1 maximum entropy 87.91% 2 decision tree 82.23% 3 naïve bayes 86.28% 4 bi-lstm by purwanto et al. [21] 96.57% 5 our proposed model 98.49% http://journal2.um.ac.id/index.php/keds https://www.researchgate.net/publication/200111340_speech_and_language_processing_an_introduction_to_natural_language_processing_computational_linguistics_and_speech_recognition https://www.researchgate.net/publication/200111340_speech_and_language_processing_an_introduction_to_natural_language_processing_computational_linguistics_and_speech_recognition https://aclanthology.org/c12-2096/ https://aclanthology.org/c12-2096/ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.4928 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.4928 j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 47 [4] y. liu, a. stolcke, e. shriberg, and m. harper, “comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech,” in proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 64–71. [5] y. liu, a. stolcke, e. shriberg, and m. harper, “using conditional random fields for sentence boundary detection in speech,” in proceedings of the 43rd annual meeting on association for computational linguistics, 2005, pp. 451–458. [6] b. roark et al., “reranking for sentence boundary detection in conversational speech,” in 2006 ieee international conference on acoustics speech and signal processing proceedings, 2006, vol. 1, pp. i--i. [7] j. goldstein, v. mittal, j. carbonell, and m. kantrowitz, “multi-document summarization by sentence extraction,” in proceedings of the 2000 naacl-anlp workshop on automatic summarization, 2000, pp. 40–48. [8] e. y. hidayat, f. firdausillah, k. hastuti, i. n. dewi, and a. azhari, “automatic text summarization using latent drichlet allocation (lda) for document clustering,” int. j. adv. intell. informatics, vol. 1, no. 3, pp. 132–139, 2015. [9] d. rudrapal, a. jamatia, k. chakma, a. das, and b. gambäck, “sentence boundary detection for social media text,” in proceedings of the 12th international conference on natural language processing, 2015, pp. 254–260. [10] x. chang and q. zheng, “offline definition extraction using machine learning for knowledge-oriented question answering,” in international conference on intelligent computing, 2007, pp. 1286–1294. [11] r. zhang and c. zhang, “dynamic sentence boundary detection for simultaneous translation,” proceedings of the first workshop on automatic simultaneous translation, 2020. [12] t. a. le, “sequence labeling approach to the task of sentence boundary detection,” in acm international conference proceeding series, jan. 2020, pp. 144–148, doi: 10.1145/3380688.3380703. [13] n. sadvilkar and m. neumann, “pysbd: pragmatic sentence boundary disambiguation,” oct. 2020, [online]. available: http://arxiv.org/abs/2010.09657. [14] t. kiss and j. strunk, “unsupervised multilingual sentence boundary detection,” comput. linguist., vol. 32, no. 4, pp. 485–525, 2006. [15] j. wang, y. zhu, and y. jin, “a rule-based method for chinese punctuations processing in sentences segmentation,” in 2014 international conference on asian language processing (ialp), 2014, pp. 195–198. [16] j. c. reynar and a. ratnaparkhi, “a maximum entropy approach to identifying sentence boundaries,” in proceedings of the fifth conference on applied natural language processing, 1997, pp. 16–19. [17] b. jurish and k.-m. würzner, “word and sentence tokenization with hidden markov models.,” jlcl, vol. 28, no. 2, pp. 61–83, 2013. [18] k. tomanek, j. wermter, and u. hahn, “sentence and token splitting based on conditional random fields,” in proceedings of the 10th conference of the pacific association for computational linguistics, 2007, vol. 49, p. 57. [19] y. akita, m. saikou, h. nanjo, and t. kawahara, “sentence boundary detection of spontaneous japanese using statistical language model and support vector machines,” 2006. [20] d. hillard, m. ostendorf, a. stolcke, y. liu, and e. shriberg, “improving automatic sentence boundary detection with confusion networks,” in proceedings of hlt-naacl 2004: short papers, 2004, pp. 69–72. [21] c. n. purwanto, a. t. hermawan, j. santoso, and gunawan, “distributed training for multilingual combined tokenizer using deep learning model and simple communication protocol,” in 2019 1st international conference on cybernetics and intelligent system (icoris), 2019, vol. 1, pp. 110–113. [22] d. gillick, “sentence boundary detection and the problem with the us,” in proceedings of human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics, companion volume: short papers, 2009, pp. 241–244. [23] c. n. silla and c. a. a. kaestner, “an analysis of sentence boundary detection systems for english and portuguese documents,” in international conference on intelligent text processing and computational linguistics, 2004, pp. 135– 141. [24] c.-e. gonzález-gallardo and j.-m. torres-moreno, “sentence boundary detection for french with subword-level information vectors and convolutional neural networks,” arxiv prepr. arxiv1802.04559, 2018. [25] h. p. le and t. v. ho, “a maximum entropy approach to sentence boundary detection of vietnamese texts,” 2008. [26] n. xue and y. yang, “chinese sentence segmentation as comma classification,” in proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, jun. 2011, pp. 631–635, [online]. available: https://www.aclweb.org/anthology/p11-2111. [27] k. shitaoka, k. uchimoto, t. kawahara, and h. isahara, “dependency structure analysis and sentence boundary detection in spontaneous japanese,” in proceedings of the 20th international conference on computational linguistics, 2004, pp. 1107–es, doi: 10.3115/1220355.1220514. [28] n. wanjari, g. m. dhopavkar, and n. b. zungre, “sentence boundary detection for marathi language,” procedia comput. sci., vol. 78, pp. 550–555, 2016, doi: https://doi.org/10.1016/j.procs.2016.02.101. [29] d. n and r. k. p, “article: sentence boundary detection in kannada language,” int. j. comput. appl., vol. 39, no. 9, pp. 38–41, feb. 2012. [30] c.-e. gonzález-gallardo, e. l. pontes, f. sadat, and j.-m. torres-moreno, “automated sentence boundary detection in modern standard arabic transcripts using deep neural networks,” procedia comput. sci., vol. 142, pp. 339–346, 2018, doi: https://doi.org/10.1016/j.procs.2018.10.485. [31] z. rehman, w. anwar, and u. i. bajwa, “challenges in urdu text tokenization and sentence boundary disambiguation,” in proceedings of the 2nd workshop on south southeast asian natural language processing https://www.researchgate.net/publication/221012655_comparing_and_combining_generative_and_posterior_probability_models_some_advances_in_sentence_boundary_detection_in_speech https://www.researchgate.net/publication/221012655_comparing_and_combining_generative_and_posterior_probability_models_some_advances_in_sentence_boundary_detection_in_speech https://www.researchgate.net/publication/221012655_comparing_and_combining_generative_and_posterior_probability_models_some_advances_in_sentence_boundary_detection_in_speech https://doi.org/10.3115/1219840.1219896 https://doi.org/10.3115/1219840.1219896 https://doi.org/10.1109/icassp.2006.1660078 https://doi.org/10.1109/icassp.2006.1660078 https://doi.org/10.3115/1567564.1567569 https://doi.org/10.3115/1567564.1567569 https://doi.org/10.26555/ijain.v1i3.43 https://doi.org/10.26555/ijain.v1i3.43 https://www.researchgate.net/publication/287199747_sentence_boundary_detection_for_social_media_text https://www.researchgate.net/publication/287199747_sentence_boundary_detection_for_social_media_text https://doi.org/10.1007/978-3-540-74282-1_144 https://doi.org/10.1007/978-3-540-74282-1_144 https://doi.org/10.18653/v1/2020.autosimtrans-1.1 https://doi.org/10.18653/v1/2020.autosimtrans-1.1 https://doi.org/10.1145/3380688.3380703 https://doi.org/10.1145/3380688.3380703 https://arxiv.org/abs/2010.09657v1 https://arxiv.org/abs/2010.09657v1 https://doi.org/10.1162/coli.2006.32.4.485 https://doi.org/10.1162/coli.2006.32.4.485 https://doi.org/10.1109/ialp.2014.6973504 https://doi.org/10.1109/ialp.2014.6973504 https://doi.org/10.3115/974557.974561 https://doi.org/10.3115/974557.974561 https://www.researchgate.net/publication/259772781_word_and_sentence_tokenization_with_hidden_markov_models https://www.researchgate.net/publication/259772781_word_and_sentence_tokenization_with_hidden_markov_models https://www.semanticscholar.org/paper/sentence-and-token-splitting-based-on-conditional-tomanek-wermter/5651b25a78ac8fd5dd65f9c877c67897f58cf817 https://www.semanticscholar.org/paper/sentence-and-token-splitting-based-on-conditional-tomanek-wermter/5651b25a78ac8fd5dd65f9c877c67897f58cf817 https://www.researchgate.net/publication/221478457_sentence_boundary_detection_of_spontaneous_japanese_using_statistical_language_model_and_support_vector_machines https://www.researchgate.net/publication/221478457_sentence_boundary_detection_of_spontaneous_japanese_using_statistical_language_model_and_support_vector_machines https://doi.org/10.21236/ada460954 https://doi.org/10.21236/ada460954 https://doi.org/10.1109/icoris.2019.8874898 https://doi.org/10.1109/icoris.2019.8874898 https://doi.org/10.1109/icoris.2019.8874898 https://doi.org/10.3115/1620853.1620920 https://doi.org/10.3115/1620853.1620920 https://doi.org/10.3115/1620853.1620920 https://doi.org/10.1007/978-3-540-24630-5_16 https://doi.org/10.1007/978-3-540-24630-5_16 https://doi.org/10.1007/978-3-540-24630-5_16 https://arxiv.org/abs/1802.04559 https://arxiv.org/abs/1802.04559 https://hal.inria.fr/inria-00334762/document ttps://www.aclweb.org/anthology/p11-2111 ttps://www.aclweb.org/anthology/p11-2111 ttps://www.aclweb.org/anthology/p11-2111 https://doi.org/10.3115/1220355.1220514 https://doi.org/10.3115/1220355.1220514 https://doi.org/10.3115/1220355.1220514 https://doi.org/10.1016/j.procs.2016.02.101 https://doi.org/10.1016/j.procs.2016.02.101 https://doi.org/10.5120/4852-7124 https://doi.org/10.5120/4852-7124 https://doi.org/10.1016/j.procs.2018.10.485 https://doi.org/10.1016/j.procs.2018.10.485 https://doi.org/10.1016/j.procs.2018.10.485 https://www.aclweb.org/anthology/w11-3007 https://www.aclweb.org/anthology/w11-3007 48 j. santoso et al. / knowledge engineering and data science 2021, 4 (1): 38–48 (wssanlp), nov. 2011, pp. 40–45, [online]. available: https://www.aclweb.org/anthology/w11-3007. [32] s. sirirattanajakarin, d. jitkongchuen, and p. intarapaiboon, “boydcut: bidirectional lstm-cnn model for thai sentence segmenter,” sep. 2020, doi: 10.1109/ibdap50342.2020.9245454. [33] s. j. putra, m. n. gunawan, i. khalil, and t. mantoro, “sentence boundary disambiguation for indonesian language,” in acm international conference proceeding series, dec. 2017, pp. 587–590, doi: 10.1145/3151759.3156474. [34] s. raharjo, r. wardoyo, and a. e. putra, “rule based sentence segmentation of indonesian language,” j. eng. appl. sci., vol. 13, no. 21, pp. 8986–8992, 2018. [35] “siapa calon pimpinan kpk yang akan dipilih dpr?,” nov. 14, 2011. https://news.detik.com/berita/d-1766855/siapacalon-pimpinan-kpk-yang-akan-dipilih-dpr (accessed aug. 09, 2021). [36] “10 destinasi terbaik asia 2018 versi lonely planet, ada komodo,” jul. 13, 2018. https://travel.detik.com/travelnews/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo (accessed aug. 09, 2021). [37] t. kiss and j. strunk, “viewing sentence boundary detection as collocation identification,” in proceedings of konvens, 2002, vol. 2002, pp. 75–82. [38] d. p. kingma and j. l. ba, “adam: a method for stochastic optimization,” int. conf. learn. represent. 2015, 2015. https://www.aclweb.org/anthology/w11-3007 https://doi.org/10.1109/ibdap50342.2020.9245454 https://doi.org/10.1109/ibdap50342.2020.9245454 https://doi.org/10.1145/3151759.3156474 https://doi.org/10.1145/3151759.3156474 https://www.medwelljournals.com/abstract/?doi=jeasci.2018.8986.8992 https://www.medwelljournals.com/abstract/?doi=jeasci.2018.8986.8992 https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo https://www.researchgate.net/publication/249902347_viewing_sentence_boundary_detection_as_collocation_identification https://www.researchgate.net/publication/249902347_viewing_sentence_boundary_detection_as_collocation_identification https://arxiv.org/abs/1412.6980 i. introduction ii. method a. the problem in sentence boundary detection for bahasa indonesia b. data preparation c. sequence classification for sentence boundary detection iii. results and discussions iv. conclusion acknowledgment declarations author contribution funding statement conflict of interest additional information references [1] d. jurafsky and h. james, martin: speech and language processing: an introduction to natural language processing, speech recognition, and computational linguistics. prentice-hall, englewood cliffs, 2008. [2] j. read, r. dridan, s. oepen, and l. j. solberg, “sentence boundary detection: a long solved problem?,” in proceedings of coling 2012: posters, 2012, pp. 985–994. [3] d. j. walker, d. e. clements, m. darwin, and j. w. amtrup, “sentence boundary detection: a comparison of paradigms for improving mt quality,” in proceedings of the mt summit viii, 2001, vol. 58. [4] y. liu, a. stolcke, e. shriberg, and m. harper, “comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech,” in proceedings of the 2004 conference on empirical methods in natural la... [5] y. liu, a. stolcke, e. shriberg, and m. harper, “using conditional random fields for sentence boundary detection in speech,” in proceedings of the 43rd annual meeting on association for computational linguistics, 2005, pp. 451–458. [6] b. roark et al., “reranking for sentence boundary detection in conversational speech,” in 2006 ieee international conference on acoustics speech and signal processing proceedings, 2006, vol. 1, pp. i--i. [7] j. goldstein, v. mittal, j. carbonell, and m. kantrowitz, “multi-document summarization by sentence extraction,” in proceedings of the 2000 naacl-anlp workshop on automatic summarization, 2000, pp. 40–48. [8] e. y. hidayat, f. firdausillah, k. hastuti, i. n. dewi, and a. azhari, “automatic text summarization using latent drichlet allocation (lda) for document clustering,” int. j. adv. intell. informatics, vol. 1, no. 3, pp. 132–139, 2015. [9] d. rudrapal, a. jamatia, k. chakma, a. das, and b. gambäck, “sentence boundary detection for social media text,” in proceedings of the 12th international conference on natural language processing, 2015, pp. 254–260. [10] x. chang and q. zheng, “offline definition extraction using machine learning for knowledge-oriented question answering,” in international conference on intelligent computing, 2007, pp. 1286–1294. [11] r. zhang and c. zhang, “dynamic sentence boundary detection for simultaneous translation,” proceedings of the first workshop on automatic simultaneous translation, 2020. [12] t. a. le, “sequence labeling approach to the task of sentence boundary detection,” in acm international conference proceeding series, jan. 2020, pp. 144–148, doi: 10.1145/3380688.3380703. [13] n. sadvilkar and m. neumann, “pysbd: pragmatic sentence boundary disambiguation,” oct. 2020, [online]. available: http://arxiv.org/abs/2010.09657. [14] t. kiss and j. strunk, “unsupervised multilingual sentence boundary detection,” comput. linguist., vol. 32, no. 4, pp. 485–525, 2006. [15] j. wang, y. zhu, and y. jin, “a rule-based method for chinese punctuations processing in sentences segmentation,” in 2014 international conference on asian language processing (ialp), 2014, pp. 195–198. [16] j. c. reynar and a. ratnaparkhi, “a maximum entropy approach to identifying sentence boundaries,” in proceedings of the fifth conference on applied natural language processing, 1997, pp. 16–19. [17] b. jurish and k.-m. würzner, “word and sentence tokenization with hidden markov models.,” jlcl, vol. 28, no. 2, pp. 61–83, 2013. [18] k. tomanek, j. wermter, and u. hahn, “sentence and token splitting based on conditional random fields,” in proceedings of the 10th conference of the pacific association for computational linguistics, 2007, vol. 49, p. 57. [19] y. akita, m. saikou, h. nanjo, and t. kawahara, “sentence boundary detection of spontaneous japanese using statistical language model and support vector machines,” 2006. [20] d. hillard, m. ostendorf, a. stolcke, y. liu, and e. shriberg, “improving automatic sentence boundary detection with confusion networks,” in proceedings of hlt-naacl 2004: short papers, 2004, pp. 69–72. [21] c. n. purwanto, a. t. hermawan, j. santoso, and gunawan, “distributed training for multilingual combined tokenizer using deep learning model and simple communication protocol,” in 2019 1st international conference on cybernetics and intelligent s... [22] d. gillick, “sentence boundary detection and the problem with the us,” in proceedings of human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics, companion volume: sho... [23] c. n. silla and c. a. a. kaestner, “an analysis of sentence boundary detection systems for english and portuguese documents,” in international conference on intelligent text processing and computational linguistics, 2004, pp. 135–141. [24] c.-e. gonzález-gallardo and j.-m. torres-moreno, “sentence boundary detection for french with subword-level information vectors and convolutional neural networks,” arxiv prepr. arxiv1802.04559, 2018. [25] h. p. le and t. v. ho, “a maximum entropy approach to sentence boundary detection of vietnamese texts,” 2008. [26] n. xue and y. yang, “chinese sentence segmentation as comma classification,” in proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, jun. 2011, pp. 631–635, [online]. available: htt... [27] k. shitaoka, k. uchimoto, t. kawahara, and h. isahara, “dependency structure analysis and sentence boundary detection in spontaneous japanese,” in proceedings of the 20th international conference on computational linguistics, 2004, pp. 1107–es, d... [28] n. wanjari, g. m. dhopavkar, and n. b. zungre, “sentence boundary detection for marathi language,” procedia comput. sci., vol. 78, pp. 550–555, 2016, doi: https://doi.org/10.1016/j.procs.2016.02.101. [29] d. n and r. k. p, “article: sentence boundary detection in kannada language,” int. j. comput. appl., vol. 39, no. 9, pp. 38–41, feb. 2012. [30] c.-e. gonzález-gallardo, e. l. pontes, f. sadat, and j.-m. torres-moreno, “automated sentence boundary detection in modern standard arabic transcripts using deep neural networks,” procedia comput. sci., vol. 142, pp. 339–346, 2018, doi: https://d... [31] z. rehman, w. anwar, and u. i. bajwa, “challenges in urdu text tokenization and sentence boundary disambiguation,” in proceedings of the 2nd workshop on south southeast asian natural language processing (wssanlp), nov. 2011, pp. 40–45, [online]. ... [32] s. sirirattanajakarin, d. jitkongchuen, and p. intarapaiboon, “boydcut: bidirectional lstm-cnn model for thai sentence segmenter,” sep. 2020, doi: 10.1109/ibdap50342.2020.9245454. [33] s. j. putra, m. n. gunawan, i. khalil, and t. mantoro, “sentence boundary disambiguation for indonesian language,” in acm international conference proceeding series, dec. 2017, pp. 587–590, doi: 10.1145/3151759.3156474. [34] s. raharjo, r. wardoyo, and a. e. putra, “rule based sentence segmentation of indonesian language,” j. eng. appl. sci., vol. 13, no. 21, pp. 8986–8992, 2018. [35] “siapa calon pimpinan kpk yang akan dipilih dpr?,” nov. 14, 2011. https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr (accessed aug. 09, 2021). [36] “10 destinasi terbaik asia 2018 versi lonely planet, ada komodo,” jul. 13, 2018. https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo (accessed aug. 09, 2021). [37] t. kiss and j. strunk, “viewing sentence boundary detection as collocation identification,” in proceedings of konvens, 2002, vol. 2002, pp. 75–82. [38] d. p. kingma and j. l. ba, “adam: a method for stochastic optimization,” int. conf. learn. represent. 2015, 2015. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 85–96 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p85-96 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology a comprehensive analysis of reward function for adaptive traffic signal control abu rafe md jamil 1 , naushin nower 2, * institute of information technology, university of dhaka dhaka-1000, bangladesh 1 bsse0722@iit.du.ac.bd; 2 naushin@iit.du.ac.bd* * corresponding author i. introduction modern society primarily depends on public road transport for their movement and goods. thus, the transportation system is a critical infrastructure for the metropolitan city since it plays a vital role in the habitant’s daily life and environment. moreover, the increasing population requires an increasing number of vehicles on the road, often exceeding the infrastructure's total capability, leading to more congestion and longer travel time. because of horrible traffic congestion, three billion gallons of fuel are wasted each year, and travelers need to wait in their cars for almost seven billion extra hours(forty-two hours per traveler) [1]. thus, traffic jam has become a severe problem in almost every metropolitan area because of population growth, increasing urbanization, inadequate traffic infrastructure, and inefficient traffic signal control. however, in the era of modern civilization, it is pretty impossible to stop urbanization; as a result, more public transportation is needed to facilitate the increasing population. increasing the transport capability may be a solution to this issue through road infrastructure construction. however, it is costly and time-consuming. whereas enhancing the issue of traffic signal control could be an effective solution to this problem. reducing 1% traffic congestion using an efficient traffic signal control system can save billions in a year [2]. among the various traffic control systems, the adaptive traffic signal control (atsc) system is the most appropriate solution to the traffic signal control problem since it utilizes real-time traffic information from road intersections to make the signal control decision. unlike fixed signal timing, it makes dynamic traffic signaling based on the incoming data [3] more suitable for reducing traffic jams. atsc catches the attention of researchers because it can intelligently regulate the traffic signal to minimize traffic congestion. the traffic controller is needed to be correctly trained to make the proper signal timing to mitigate congestion. recently deep reinforcement learning (drl) has been used to train the traffic controller [4][5][6][7] in many kinds of research. in the traffic signal control problem, the proper decision is unknown, and to evaluate the decision, we need to wait until the effect article info a b s t r a c t article history: submitted 14 october 2021 revised 9 november 2021 accepted 19 november 2021 published online 31 december 2021 adaptive traffic control systems (atcs) can play an essential role in reducing traffic congestion in urban areas. the main challenge for atsc is to determine the proper signal timing. recently, deep reinforcement learning (drl) has been used to determine proper signal timing. however, the success of the drl algorithm depends on the appropriate reward function design. there exist various reward functions for atsc in the existing research. this research presents a comprehensive analysis of the widely used reward function. the pros and cons of various reward algorithms were discussed, and experimental analysis shows that the multi-objective reward function enhances the performance of atsc. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: reward function adaptive traffic signal control deep reinforcement learning multi-objective reward function signal timing http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p85-96 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 86 a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 of the decision is observed. this type of problem can be solved by deep q-learning network (dqn), a type of drl. in drl, the traffic light controller observes the present traffic condition as a state, and it has a decision unit that decides which traffic phase to activate in a particular situation, i.e., which traffic light turns green. according to the decision of the controller, the action is employed. the traffic condition changes after the action and is based on the traffic condition, and a scalar reward value is gained from the environment shown in figure 1. in drl, the impact of the decision is evaluated by reward value. if the reward value is high, the decision is labeled as good and bad if the reward value is low. the traffic light controller is trained to make decisions in different traffic conditions to gain high reward values to change traffic phases effectively to improve the traffic flow. thus, reward value plays an essential role in drl, and it can act as an evaluator of the action. the main objective of the drl agent is to maximize the reward collected from the environment over time. more specifically, the reward function describes the problem the agent is trying to solve. it also defines the best possible performance. therefore, the reward function is vital in how the agent learns optimal behavior [8]. thus, selecting the reward function is one of the essential aspects of drl for atsc. different traffic features have been utilized to define the reward function in the existing dqn solutions [8]. based on how many traffic features are utilized to design the reward function, the reward function can be classified as single objective and multi-objective. the reward function optimizes a single traffic feature is a single objective reward function. for example, the queue length [5][9], cumulative delay [7], waiting time [4][10], and travel time. parameters are widely used as a single objective reward function. the work in [11] used vehicle pressure of the road intersection as a single objective reward function. besides the single objective reward function, the multi-objective reward function optimizes more than one traffic parameter at a time. in this type of reward function, various traffic features are combined in different ways to optimize parameters. since most of the problems in the real world need multiple parameters to optimize, multi-objective reward functions are more appropriate than single objective reward functions. many existing studies [12][13] also experimentally prove this fact. however, when conflicting parameters are combined in the multi-objective reward function, it does not optimize multiple parameters properly because of correlation among them. in order to solve this problem, in [8], the authors proposed composite reward architecture (cra), where each reward function is evaluated separately, and a decision is made based on the majority’s decisions. before that, hybrid reward architecture [14] decomposes a single reward function into n different reward functions, and the summation of the n rewards calculates the final reward. different types of reward functions have been proposed in the literature, and none of them compares among these reward functions. this research uses six single objective reward functions and three recently proposed multi-objective reward functions in different traffic scenarios. the comparative analysis can conclude that a multi-objective reward function is better than a single objective reward and among the multi-objective reward function, cra [8] and [6] perform competitively. ii. methods reinforcement learning (rl) is a machine learning technique that can be learned through experience. rl is different from supervised learning [15], which learns from a labeled dataset with a supervisor. supervised learning learns from a training set of labeled samples provided by a knowledgeable external supervisor. rl is also different from unsupervised learning [16], which is about learning clustering of unlabeled data. on the other hand, rl tries to optimize a reward signal by trial and error. finding structure in an agent's experience may be helpful for rl, but unsupervised learning does not solve the rl issue of optimizing a reward signal. thus, rl is the learning process of mapping scenarios to actions to maximize a numeric reward value. in rl, an agent is deployed in an environment without any experience of behaving. the agent perceives the environment, determines an action based on the environment condition, and gets a reward from the environment based on the outcome of the action. thus, the reward value is considered the environment's feedback for the action taken. if the environment's condition improves, the agent gets a positive reward; otherwise, it receives a negative reward. this way, the agent can distinguish between good and bad actions and gain experience iteratively. the expected reward value for each a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 87 state-action pair is represented as a q value and stored in a q table. the agent can quickly determine which action is desirable for a particular state from this q-table. any rl problem can be modeled as a markov decision process (mdp), a mathematical framework to describe the environment in the rl. the mdp can be represented as a tuple , where s is the set of states, a represents a set of actions, t indicates a transition function, and r is the reward function. in any given state s∈s, selecting an action, a∈a will change the environment to a new state s'∈s with the probability t(s,a,s')∈0,1) and the environment gives a reward r = r(s,a). this process iteratively rewards the agent, generating a policy that can map certain states to action (π:s×a →[0,1]. an mdp aims to find the best policy for an agent. the policy π defines the probability of selecting action at in-state st. the policy is generated to maximize the expected cumulative discounted reward over time. the discounted future reward, rt at time t is defined as follows 𝑅𝑡 = 𝐸[∑ 𝛾 𝑘 𝑟𝑡+𝑘 ∞ 𝑘=0 ] (1) where 𝛾 is the discount factor which indicates the impact of future rewards. in atsc, there are a considerable number of state and action pairs. as a result, it is pretty impossible to manage them in a q-table from [17]. this problem can be solved by using deep neural network (dnn) where neural networks have been used for function approximation instead of q-table. this type of function approximator is widely used in many studies where the state-action pair is unlimited and unknown. the neural network (nn) consists of many neurons with weights that can be easily used for unlimited state-action pairs. the problem of atsc formulated with drl is shown in figure 1 where the road traffic scenario represents the environment. traffic conditions such as queue length, waiting time, halting number, etc., are used to represent the state. the traffic controller acts as an agent. the neural network (in the agent) takes the values of the state from the environment as input and produces q values. based on the q-values, the action is generated by the agent. the condition of the environment changes, and the agent gets a numerical reward value as the evaluation of the action. in this way, the agent learns the good or bad action and adjusts the weight of the network accordingly. the state is the agent’s view of the environment in a specific timestep. in literature, the state is designed in several aspects. it is better to design a state with more information to represent the current condition of the traffic. in the literature, some approaches use one parameter as a state; others use multiple parameters to represent the environment. queue lengths of the lane, number of halting vehicles, waiting time of the vehicles, the average speed of the vehicles are used to represent state [6][7][8]. the action indicates the activity the agent performs in the environment. in the case of atsc, the action is to change the traffic phase or stay in the current phase. the traffic phase could be two, four, or six. for two phases of traffic, vehicles going east and west across the intersection indicate one phase, and vehicles approaching north and south represent another phase. the traffic phase will be green for vehicles heading east and west, and in another action, the traffic light will be green for vehicles approaching north and south. as a result, the agent has two distinct actions (a = {0, 1}). here, 0 instructs to stay on the current phase, and 1 indicates to change the traffic phase to the next phase. with these two actions, any phase of traffic can be controlled. in rl, the reward reflects the environmental feedback after the agent has taken action. the agent uses the reward value to consider the result of the action taken and update the model for future choice of action. therefore, the reward is one of the most vital parts of the learning process. usually, it is fig. 1. the formulation of atsc with drl 88 a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 defined as a function of some performance metrics of the intersection, such as vehicle delay, queue length, waiting time, travel time, or throughput. there are multiple ways to define reward functions. as a result, various reward architectures are designed in the literature, but none compare among different reward functions. in this paper, we have discussed the pros and cons of various existing reward functions and made an experimental analysis of widely used rewards on different traffic scenarios. to make a comparative analysis among rewards, we need to focus on the types of rewards. in this section, the classification of reward functions is discussed. based on the number of parameters and how the parameters are processed, the reward functions in the existing literature are designed basically in two ways: 1) single objective and 2) multi-objective. the single objective reward function is designed to optimize one parameter is known as a single objective reward function. in this approach, the agent's goal is to optimize the single parameter and get a reward based on how well it optimizes that one parameter. for example, in atsc, queue length, delay, waiting time, travel time, etc., can be used separately as a traffic parameter to optimize. the single objective rl is depicted in figure 2. the work in [5] used drl with a dnn to learn the q-function. a deep stacked autoencoder (sae) neural network [18] is used to estimate the q-function. the sae is trained to minimize the loss that indicates the difference of target and prediction q-values in the learning. the reward function is designed considering only the queue lengths of the lanes. 𝑟𝑡 = | 𝑚𝑎𝑥𝑖=1,2{𝑞𝑡 𝑒−𝑤,𝑖 , 𝑞𝑡 𝑤−𝑒,𝑖 } − 𝑚𝑎𝑥𝑖=1,2{𝑞𝑡 𝑠−𝑛,𝑖 , 𝑞𝑡 𝑛−𝑠,𝑖 }| (2) where i indicates the lane number, e-w is east to west direction, w-e is west to east direction, n-s is north to south direction, s-n is south to north direction, and qt is the queue length at time t. with the reward function in equation 2, the learning process maximizes queue length between we and ns. if the difference is high, the agent gets a high reward value. however, the difference in queue length between we and ns does not indicate the smoothness of traffic flow. for example, if we's queue length is 100 units and sn is 20 units, the reward value will be 80. on the other hand, if the queue length in we is 40 units and sn is 40 units, the reward value will be zero. although the second scenario represents a more stable traffic flow, it gets a meager reward. therefore, the reward function gives wrong feedback to the agent. another atsc applying drl is proposed in [7]. this work proposed a dense information new state space with the discrete encoding of traffic features defined as discrete traffic state encoding (dtse). a deep convolution neural network trains the q-learning agent with experience replay [19]. the difference of cumulative delays (equation (3)) is considered as a reward function in this work. rt = dt−1 − dt (3) where dt−1 is the cumulative delay at time t − 1, and dt is at t. delay is defined in equation (4). 𝐷 = 1 − average speed of vehicles in lanes maximum allowed lane speed (4) with this reward function, the agent tends to maximize the maximum possible speed and average speed of vehicles. the average speed of vehicles is influenced by road infrastructure, road occupancy, etc. the high average speed 24 of vehicles can be a metric for smooth traffic flow. however, it alone cannot guarantee the smoothness of traffic flow. further, another drl algorithm for atsc is proposed in [4]. the work used experience replay and target network mechanisms to improve algorithm stability. fig. 2. single objective rl a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 89 the difference of vehicles' cumulative waiting time between before and after the action, as shown in equation (5), is considered a reward function in [4]. this same reward function is used in another drl-based approach proposed in [10]. rt = wt−1 − wt (5) where wt−1 is the waiting time of vehicles at time t−1, and wt is the waiting time at time t. with this reward function, the agent tries to minimize the waiting time for vehicles. although waiting time is one of the most dominant parameters for indicating smooth traffic flow, only the waiting time of vehicles cannot provide an appropriate measure for efficient traffic flow. for example, if one vehicle has a low speed, the waiting time of the vehicle will be zero (waiting time is calculated only for the halting vehicles). although the waiting time is zero, the vehicle is not experiencing a smooth journey. the above analysis can conclude that a single parameter alone cannot guarantee good performance. thus, the traffic signal control problem is a multi-objective problem, where multiple parameters need to be optimized for better traffic flow. the single objective reward function optimizes a single parameter that is unsuitable for the traffic signal problem. therefore, the multi-objective reward function is preferable for the atsc. a multi-objective reward function is called a multi-objective reward function when it is designed to optimize multiple parameters at a time. in the real world, most problems are multi-objective; thus, this reward function is more suitable than a single objective reward function. in this approach, multiple parameters are combined to optimize the reward function. the learning agent optimizes the set of multiple objectives simultaneously. the multi-objective rl is depicted in figure 3. in traffic signal control, queue length, delay, waiting time, travel time, and fuel consumption. parameters could be combined with different weights for multi-objective optimization. the general form of the multiobjective reward function can be expressed in equation (6), which is the weighted sum of the traffic features. 𝑟𝑡 = ∑ 𝑊𝑖 × 𝑡𝑓𝑖 𝑛 𝑖=1 (6) where wi is the weight of i the traffic features and tfi is the traffic feature. the work in [19] investigated the learning control policies for traffic lights. it introduced a new reward function that considers the number of teleports j, number of action switches c, number of emergencies stop e, a sum of delay d, and a sum of wait time w as parameters. 𝑟𝑡 = −0.1𝑐 − 0.1 ∑ 𝑗𝑖 𝑁 𝑖=1 − 0.2 ∑ 𝑒𝑖 𝑁 𝑖=1 − 0.3 ∑ 𝑑𝑖 𝑁 𝑖=1 − 0.3 ∑ 𝑤𝑖 𝑁 𝑖=1 (7) the first three coefficients of the equation do not affect the drl process [20]. for this reason, the feedback returned from the reward function may be misled. a drl agent with a multi-objective reward function is proposed in [6]. the research used a new dnn to decide whether to change the current traffic phase or keep measuring the q-value. the neural network has different branches for each traffic phase. one branch of the network is activated in one traffic phase, and the other is activated in another traffic phase. the reward function is calculated considering the weighted sum of the sum of delay d, queue length l, the sum of updated waiting time w, the total number of vehicles passed n, an indicator of light switches c, and total travel time t of the passing vehicles. 𝑟𝑡 = −0.25 ∑ 𝐿𝑖𝑖∈𝑙 − 0.25 ∑ 𝐷𝑖𝑖∈𝑙 − 0.25 ∑ 𝑊𝑖 − 5𝐶 + 𝑁 + 𝑇𝑖∈𝑙 (8) fig. 3. multi-objective rl 90 a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 here in equation (8), the indicator of light switches c has a very high negative coefficient though it does not affect the drl process [20]. furthermore, the total travel time t is added to the equation. it indicates that the agent will get a high reward whenever the travel time increases. however, a higher value of travel time indicates terrible traffic conditions. these two factors mislead the feedback with this reward function. further, they tested both on simulation data and data from the real world. this model achieves state-of-the-art efficiency on most measures. nonetheless, it should be noted that the authors point out that due to the large volume of data collected for the agent training, the study has substantial limitations on a real-world appliance. a new multi-objective reward function for traffic light optimization is proposed in [21]. it used deep q-learning with a policy gradient approach to solve the rl problem. the following equation designs the reward function 𝑟𝑡 = 𝐷𝑡−1 − 𝐷𝑡 + 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐻𝑎𝑙𝑡𝑖𝑛𝑔 𝑣𝑒𝑐ℎ𝑖𝑙𝑒𝑠+𝑐 (9) here, the value is used to avoid zero division. further, other multi-objective reward functions are proposed in [22][23][24]. most reward functions are designed using the weighted sum of traffic features (equation 6). however, if the traffic features are conflicting and/or correlated with each other, the multi-objective reward function calculated by combining traffic features does not provide an optimal solution [12][13]. in order to solve this problem, the cra [8] calculates the reward function for each traffic parameter separately and then combines the decisions of the multiple reward function by using the majority voting approach. in this multi-objective approach, each objective has a reward function, producing a decision. the decision chosen by the majority reward functions is selected from the multiple decisions. since the reward functions are calculated separately, the dependency or confliction does not hamper making final decisions. however, this approach ignores the decisions of minority groups. the cra approach is depicted in figure 4. hra [14] is a particular type of reward function, which is a different form of the single and multiobjective reward function. in hra [14], a task is divided into multiple tasks, and there is a reward function for each task. the final reward is calculated by adding the rewards of all sub-tasks. the hra [14] is depicted in figure 5. fig. 4. composite reward architecture for rl fig. 5. hybrid reward architecture for rl a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 91 iii. results and discussions we conduct several experiments in different traffic scenarios and provide a comparative analysis to compare the permanence of different reward functions. a traffic micro-simulator named simulation of urban mobility (sumo) is used with its python api to create the simulation environment. a fourway road intersection is used where each road has two incoming and two outgoing lanes. the highest speed limit of the incoming lane is 70 km/hour and 40 km/hr for the outgoing lane. the road length is set as 300 meters in the simulation. the vehicles are allowed to pass the intersection into four different routes: (1) from west to east (w-e); (2) from north to south (n-s); (3) from east to west (e-w); and (4) from south to north (s-n). the dnn is used as a function approximator in the learning process of the simulation. the network structure proposed in [8] is used as a function approximator for all reward functions except [6]. the network structure of [8] is generic enough to fit any number of states and reward functions. for example, if there are n numbers of states and m incoming lanes, there will be n×m nodes in the input layer. the number of nodes in the output layer depends on the number of reward functions (x) and many actions (y), which are x×y nodes. the nodes of the hidden layers need to be adjusted according to the number of input and output nodes. the authors have used their network structure to implement their reward function [6]. the parameters used in the learning process are mentioned in table 1. different traffic scenarios are used using synthetic data to compare different reward functions. these traffic scenarios are widely used in the literature [6][8][21]. a total of five configurations are utilized, where the first configuration represents a steady traffic flow with a low traffic rate in all directions. the unstable traffic flow shows in the second configuration where vehicles' arrival rate in the east-west direction is two times less than the north-south direction. the third configuration maintains a steady flow with heavy traffic. the fourth configuration combines the first three configurations to represent the actual live traffic condition with low, heavy, and unstable traffic at different times. the fifth configuration represents the traffic flow for the whole day, considering the traffic variation from 6 am to 12 am. generally, the traffic flow starts with low pressure and increases as the day progresses. it goes to the peak at 9-10 am and then gradually decreases during the noon. another peak is created in the evening, which remains longer than the morning peak and gradually decreases during midnight. these five configurations are listed in table 2. to get a smooth traffic flow, we want a minimum number of vehicles halting on the road, to wait less time on the road, and to reach the destination in a minimum time. the queue length on the road is also expected to be minimum. as a result, halting number, waiting time, queue length, and travel time are used as parameters to evaluate different rewards. the pseudocode for reward functions comparison can be found as follow. pseudocode for reward functions comparison set n = no. of states, m = no. of lanes, x = no. of reward functions and y = no. of actions implement dnn with be n×m nodes in the input layer and x×y node in the output layer implement the reward halting number rhn,t=hnt-1 hnt implement the reward waiting time rw,t=wt-1 wt implement the reward travel time rt,t=tt-1 tt implement the reward delay rd,t=dt-1 dt implement hra using [14] implement intellilight𝑅𝑡 = −0.25 ∑ 𝐿𝑖𝑖∈𝑙 − 0.25 ∑ 𝐷𝑖𝑖∈𝑙 − 0.25 ∑ 𝑊𝑖 − 5𝐶 + 𝑁 + 𝑇𝑖∈𝑙 implement cra using [8] implement avg. queue length reward 𝑟𝑡 = | 𝑚𝑎𝑥𝑖=1,2{𝑞𝑡 𝑒−𝑤,𝑖 , 𝑞𝑡 𝑤−𝑒,𝑖 } − 𝑚𝑎𝑥𝑖=1,2{𝑞𝑡 𝑠−𝑛,𝑖 , 𝑞𝑡 𝑛−𝑠,𝑖 }| implement metalight [25] calculate halting number, waiting time, queue length and travel for all the implemented reward functions and compare among them we have compared a total of nine reward functions where six of them are single objective, and the rest of the three are multi-objective. total five traffic scenarios are used to compare the reward functions. the results of the comparisons are depicted from fig. 6 to fig. 9. the lower, the better values in the following figures for every performance matrix. the halting number for all configurations is depicted in figure 6. the result shows that cra [8] and intellilight [6] perform competitively to optimize halting numbers. among the five traffic configurations, cra performs best 92 a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 in configurations 1, 2, and 3, and intellilight [6] performs best in configurations 4 and 5. halting number, hra [14], presslight [11], cra [8], and waiting time get the second position for the configuration 1, 2, 3, 4, and 5, respectively. the result for the waiting time comparison for all configurations is shown in figure 7. the figure shows that cra [8] performs best in waiting time for all configurations except configuration 5. presslight [11] wins the first position for configuration 5. there is no single second-best winner for all configurations. for example, hra [14] performs after cra [8] in configuration 1, travel time performs second-best result in configuration 2, intellilight [6], metalight [25], and cra [8] is the second position winner for configuration 3, 4, and 5 respectively. the queue length comparison for all configurations is shown in figure 8. like the previous result, there is no single winner for all configurations. intellilight [6] wins for configurations 4 and 5, cra [8] shows the best result in configurations 3, and travel time wins for configuration 3. the comparative analysis for travel time is shown in figure 9. the figure shows that the average travel time for the fifth configuration is higher than the others. the reason is that configuration 5 has a longer time (18 hours) than others. for configuration 1, the hra [14] performs the best result, although others provide a very competitive result. cra [8] wins in configurations 2, 3, 4, and 5. table 1. agent’s parameters parameters values minimum time of the actions 5 learning rate 0.045 memory size 1200 sample size 400 training epochs 550 discount factors 𝛾 0.9 ∈ for exploration 0.05 table 2. traffic configurations configuration routes arrival rate (vehicle/second) start time (second) end time (second) 1 w-e n-s 0.2 0.2 0 0 7200 7200 2 w-e n-s 0.2 0.4 0 0 7200 7200 3 w-e n-s 0.4 0.4 0 0 7200 7200 4 configuration 1 configuration 2 configuration 3 0 7201 14401 7200 14401 21600 5 w-e n-s w-e n-s w-e n-s w-e n-s w-e n-s w-e n-s w-e n-s w-e n-s 0.225 0.225 0.225 0.388 0.416 0.416 0.388 0.225 0.225 0.225 0.225 0.388 0.416 0.416 0.225 0.225 06:00 am 08:00 am 10:00 am 12:00 pm 02:00 pm 04:00 pm 06:00 pm 10:00 pm 08:00 am 10:00 am 12:00 pm 02:00 pm 04:00 pm 06:00 pm 10:00 pm 12:00 pm a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 93 thus, the result concluded that no reward function shows the best result for all parameters. in most cases, either cra [8] or intellilight [6] provides the best performance where both of them are multiobjective. in some specific cases, a single objective reward function such as presslight [11], travel time provides the best result. thus, from the comparative study, we can conclude that a multi-objective reward function is preferable to a single objective reward function. in addition, among the multiobjective reward functions, cra [8] or intellilight [6] perform better among all the configurations. fig. 6. comparison of results in terms of average halting number fig. 7. comparison of results in terms of average waiting time (s) 4 ,6 1 2 0 3 ,2 6 2 5 0 ,1 8 1 7 9 ,8 1 6 3 ,3 8 4 ,7 7 1 9 9 ,8 9 2 4 2 ,4 4 1 7 4 ,8 9 9 4 ,9 4 ,9 1 2 3 ,3 1 2 3 4 2 1 2 ,2 1 1 0 9 ,4 1 1 4 ,9 1 9 7 ,4 6 2 4 2 ,1 8 2 2 5 ,1 1 7 4 ,6 7 4 ,5 9 1 9 1 ,1 8 2 3 6 ,3 5 1 1 6 ,4 1 1 3 7 ,7 5 1 1 ,8 4 2 0 0 ,0 5 2 0 5 ,8 8 7 7 ,0 7 7 8 ,5 5 4 ,5 7 3 7 ,4 7 1 9 0 ,0 7 1 0 6 1 0 6 ,6 9 1 2 ,1 6 1 0 2 ,1 6 1 9 2 ,0 1 3 2 1 1 2 2 ,1 6 2 1 ,1 4 1 0 1 ,1 4 2 3 4 ,0 9 4 3 2 2 1 2 ,1 4 0 50 100 150 200 250 300 350 400 450 configuration 1 configuration 2 configuration 3 configuration 4 configuration 5 h a lt in g n u m b e r configurations halting number wait._time [4,10] travel time delay [19] hra [14] intellilight [6] cra [8] presslight [11] metalight [25] 0 ,6 3 1 3 0 6 ,3 9 1 6 7 6 ,2 6 3 4 4 0 ,6 1 1 0 5 1 ,4 8 0 ,8 2 1 1 2 3 2 ,4 1 1 7 3 8 ,8 9 2 6 4 2 ,6 3 6 9 ,1 1 0 ,9 8 7 2 3 1 ,1 1 6 5 3 ,2 3 2 1 1 ,1 2 2 1 9 ,1 3 8 9 1 2 1 1 ,1 1 5 5 1 ,1 4 3 0 4 4 ,2 7 1 1 6 6 0 ,6 1 1 3 4 1 ,0 1 1 5 1 2 ,5 9 1 0 3 ,5 5 1 0 1 ,2 6 1 4 ,3 3 1 6 3 ,4 2 1 5 ,4 7 8 6 ,1 8 8 5 ,3 0 ,6 0 6 1 8 ,0 6 2 9 ,1 7 4 7 8 2 ,6 7 1 7 ,6 9 1 7 5 ,6 9 2 3 1 ,4 3 8 7 1 7 ,6 9 1 4 ,3 1 1 0 4 ,3 1 3 4 2 ,4 5 5 6 1 2 9 4 ,3 1 0 500 1000 1500 2000 2500 3000 3500 configuration 1 configuration 2 configuration 3 configuration 4 configuration 5 w a it in g t im e configurations halting number wait._time [4,10] travel time delay [19] hra [14] intellilight [6] cra [8] presslight [11] metalight [25] 94 a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 iv. conclusion the success of drl highly depends on the reward function since it is used to evaluate the agent's action. if the reward function gives wrong feedback to the agent, the agent will not learn properly. thus, we need to analyze which reward function performs well in atsc before using or designing a reward function. in this paper, we have analyzed widely used reward functions and experimentally shown that multi-objective reward function is more suitable for atsc, and among different multiobjective reward function, cra and intellilight performs well compared with others. in the future, we have a plan to investigate how the network structure impacts designing reward functions. fig. 8. comparison of results in terms of average queue length fig. 9. comparison of results in terms of average travel time 1 7 ,4 4 8 1 2 ,0 4 9 9 9 ,7 2 7 1 8 ,2 6 5 3 ,5 2 1 8 ,0 8 7 9 8 ,5 6 9 6 8 ,7 6 6 9 8 ,5 6 3 7 9 ,6 3 2 ,0 1 1 4 9 ,3 8 7 6 7 6 4 ,3 2 4 3 7 ,6 4 5 8 ,6 7 8 8 ,8 4 9 6 7 ,7 2 8 9 9 ,4 6 9 8 ,6 8 1 7 ,3 6 7 6 3 ,7 2 9 4 4 ,4 4 6 4 ,6 4 5 5 1 4 6 ,3 6 7 9 9 ,2 8 2 2 ,5 2 3 0 7 ,2 8 7 4 ,2 1 7 ,2 8 1 4 8 ,8 8 7 5 9 ,2 8 4 2 3 1 0 6 ,7 6 4 8 ,6 4 1 5 8 ,6 4 7 6 2 ,2 5 4 3 4 8 8 ,6 4 4 8 ,5 6 1 6 8 ,5 6 8 7 6 ,5 4 4 5 3 ,5 4 8 4 8 ,5 6 0 100 200 300 400 500 600 700 800 900 1000 configuration 1 configuration 2 configuration 3 configuration 4 configuration 5 q u e u e l e n g h t configurations halting number wait._time [4,10] travel time delay [19] hra [14] intellilight [6] cra [8] presslight [11] metalight [25] 0 ,0 0 3 3 9 5 0 ,0 0 9 3 4 5 0 ,0 1 2 5 6 9 0 ,0 0 9 1 8 0 ,5 6 8 0 ,0 0 3 3 9 5 0 ,0 0 8 1 9 7 0 ,0 1 2 5 9 2 0 ,0 0 8 6 3 6 0 ,5 1 2 0 ,0 0 5 4 3 3 0 ,0 0 8 2 1 0 ,0 2 3 1 0 ,0 0 9 8 7 0 ,5 0 3 0 ,0 0 4 3 5 7 0 ,0 0 9 5 4 7 0 ,0 1 1 7 3 3 0 ,0 0 9 0 7 5 0 ,5 3 9 0 ,0 0 3 3 9 0 ,0 0 7 7 2 0 ,0 1 2 7 2 4 0 ,0 0 8 4 7 9 0 ,5 2 2 0 ,0 0 3 6 0 ,0 0 7 5 6 0 ,0 1 0 0 5 0 ,0 0 8 4 3 0 ,3 3 9 0 ,0 0 3 3 9 0 ,0 0 7 1 8 0 ,0 0 8 8 2 9 0 ,0 0 8 3 3 0 ,2 0 3 0 ,0 0 4 5 7 0 ,0 0 8 5 7 0 ,0 0 8 9 7 0 ,0 0 9 8 7 0 ,5 5 7 0 ,0 0 5 8 7 0 ,0 0 9 8 7 0 ,0 0 9 8 7 0 ,0 1 2 0 ,5 8 7 -0,1 6e-16 0,1 0,2 0,3 0,4 0,5 0,6 configuration 1 configuration 2 configuration 3 configuration 4 configuration 5 t ra v e l ti m e configurations halting number wait._time [4,10] travel time delay [19] hra [14] intellilight [6] cra [8] presslight [11] metalight [25] a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 95 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] d. schrank, b. eisele, t. lomax, and j. bak, “2015 urban mobility scorecard,” texas a&m transportation institute and inrix, inc., usa, 2015. [2] t. economist, “the cost of traffic jams.” https://www.economist.com/ the-economist explains/2014/11/03/the-costof-traffic-jams, 2014. accessed: 2020-02-17. [3] m. alam, j. ferreira, j. and j. fonseca, “introduction to intelligent transportation systems”, in intelligent transportation systems pp. 1-17 2016, springer, cham. [4] j. gao, y. shen, j. liu, m. ito, and n. shiratori, “adaptive traffic signal control: deep reinforcement learning algorithm with experience replay and target network,” arxiv preprint arxiv: 1705.02755, 2017. [5] l. li, y. lv, and f.y. wang, “traffic signal timing via deep reinforcement learning”. ieee/caa journal of automatica sinica, 3(3), vol. 3, no. 3, pp. 247-254, jul 2016. [6] h. wei, g. zheng, h. yao, and z. li, "intellilight: a reinforcement learning approach for intelligent traffic light control." proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, pages 2496–2505, july 2018. [7] w. genders and r. saiedeh, "using a deep reinforcement learning agent for traffic signal control," arxiv preprint arxiv:1611.01142, 2016. [8] a. r. m. jamil, k. k. ganguly, and n. nower, “adaptive traffic signal control system using composite reward architecture based deep reinforcement learning,” iet intelligent transport systems, vol. 14, no. 14, pp. 2030–2041, dec. 2020. [9] s. lange and m. riedmiller, “deep auto-encoder neural networks in reinforcement learning,” the 2010 international joint conference on neural networks (ijcnn), pp. 1–8, july 2010. [10] x. liang, x. du, g. wang, and z. han, “deep reinforcement learning for traffic light control in vehicular networks,” arxiv preprint arxiv:1803.11115, 2018. [11] c. chen et al. "toward a thousand lights: decentralized deep reinforcement learning for large-scale traffic signal control", proceedings of the aaai conference on artificial intelligence. vol. 34, no. 04, pp. 3414-3421, apr. 2020. [12] a.r.m. jamil, k.k ganguly, n. nower, “an experimental analysis of reward functions for adaptive traffic signal control system”, proceedings of the international conference on distributed sensing and intelligent system (icdsis), springer, 2020. [13] p. mannion, j. duggan, and e. howley, “an experimental review of reinforcement learning algorithms for adaptive traffic signal control”, autonomic road transport support systems. springer, pp. 47–66, 2016. [14] h. van seijen et al., "hybrid reward architecture for reinforcement learning" proceedings of the advances in neural information processing systems, pp. 5392–5402, 2017. [15] s. b. kotsiantis, i. zaharakis, and p. pintelas, “supervised machine learning: a review of classification techniques,” proceeding of the 2007 conference emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007. [16] h. b. barlow, “unsupervised learning,” neural computation, vol. 1, no. 3, pp. 295–311, 1989. [17] s. lange and m. riedmiller, “deep auto-encoder neural networks in reinforcement learning,” in the 2010 international joint conference on neural networks (ijcnn), pp. 1–8, ieee, 2010. [18] t. schaul, j. quan, i. antonoglou, and d. silver, “prioritized experience replay,” arxiv preprint arxiv:1511.05952, 2015. [19] e. van der pol and f. a. oliehoek, “coordinated deep reinforcement learners for traffic light control,” proceedings of learning, inference and control of multi-agent systems (nips), 2016. [20] j. van dijk, “recurrent neural networks for reinforcement learning: an investigation of relevant design choices,” 2017. http://journal2.um.ac.id/index.php/keds https://static.tti.tamu.edu/tti.tamu.edu/documents/umr/archive/mobility-scorecard-2015.pdf https://static.tti.tamu.edu/tti.tamu.edu/documents/umr/archive/mobility-scorecard-2015.pdf https://www.economist.com/%20the-economist%20explains/2014/11/03/the-cost-of-traffic-jams https://www.economist.com/%20the-economist%20explains/2014/11/03/the-cost-of-traffic-jams https://doi.org/10.1007/978-3-319-28183-4_1 https://doi.org/10.1007/978-3-319-28183-4_1 https://arxiv.org/abs/1705.02755v1 https://arxiv.org/abs/1705.02755v1 https://doi.org/10.1109/jas.2016.7508798 https://doi.org/10.1109/jas.2016.7508798 https://doi.org/10.1145/3219819.3220096 https://doi.org/10.1145/3219819.3220096 https://doi.org/10.1145/3219819.3220096 https://arxiv.org/abs/1611.01142v1 https://arxiv.org/abs/1611.01142v1 https://doi.org/10.1049/iet-its.2020.0443 https://doi.org/10.1049/iet-its.2020.0443 https://doi.org/10.1049/iet-its.2020.0443 https://doi.org/10.1109/ijcnn.2010.5596468 https://doi.org/10.1109/ijcnn.2010.5596468 https://arxiv.org/abs/1803.11115 https://arxiv.org/abs/1803.11115 https://doi.org/10.1609/aaai.v34i04.5744 https://doi.org/10.1609/aaai.v34i04.5744 https://link.springer.com/book/9783030642570 https://link.springer.com/book/9783030642570 https://link.springer.com/book/9783030642570 https://doi.org/10.1007/978-3-319-25808-9_4 https://doi.org/10.1007/978-3-319-25808-9_4 https://proceedings.neurips.cc/paper/2017/file/1264a061d82a2edae1574b07249800d6-paper.pdf https://proceedings.neurips.cc/paper/2017/file/1264a061d82a2edae1574b07249800d6-paper.pdf https://dl.acm.org/doi/10.5555/1566770.1566773 https://dl.acm.org/doi/10.5555/1566770.1566773 https://dl.acm.org/doi/10.5555/1566770.1566773 https://doi.org/10.1162/neco.1989.1.3.295 https://doi.org/10.1109/ijcnn.2010.5596468 https://doi.org/10.1109/ijcnn.2010.5596468 https://arxiv.org/abs/1511.05952 https://arxiv.org/abs/1511.05952 http://www.v6.fransoliehoek.net/docs/vanderpol16licmas.pdf http://www.v6.fransoliehoek.net/docs/vanderpol16licmas.pdf https://5dok.net/document/lq504pjz-recurrent-neural-networks-reinforcement-learning-investigation-relevant-choices.html https://5dok.net/document/lq504pjz-recurrent-neural-networks-reinforcement-learning-investigation-relevant-choices.html 96 a.r.m. jamil and n. nower / knowledge engineering and data science 2021, 4 (2): 85–96 [21] m. coskun, a. baggag, and s. chawla, “deep reinforcement learning for traffic light optimization,” in 2018 ieee international conference on data mining workshops (icdmw), pp. 564–571, ieee, 2018. [22] m. a. khamis, w. gomaa, and h. el-shishiny, “multi-objective traffic light control system based on bayesian probability interpretation,” in 2012 15th international ieee conference on intelligent transportation systems, pp. 995– 1000, ieee, 2012. [23] m. a. khamis and w. gomaa, “adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework,” engineering applications of artificial intelligence, vol. 29, pp. 134–151, 2014. [24] a. vidali, l. crociani, g. vizzari, and s. bandini, “a deep reinforcement learning approach to adaptive traffic lights management,” in proceedings of the 20th workshop” from objects to agents”, parma, italy, 2019. [25] x. zang, h. yao, g. zheng, n. xu, k. xu, and z. li, "metalight: value-based meta-reinforcement learning for traffic signal control," proceedings of the aaai conference on artificial intelligence, vol. 34. no. 01, pp. 1153-1160, apr. 2020. https://doi.org/10.1109/icdmw.2018.00088 https://doi.org/10.1109/icdmw.2018.00088 https://doi.org/10.1109/itsc.2012.6338853 https://doi.org/10.1109/itsc.2012.6338853 https://doi.org/10.1109/itsc.2012.6338853 https://doi.org/10.1016/j.engappai.2014.01.007 https://doi.org/10.1016/j.engappai.2014.01.007 https://doi.org/10.1016/j.engappai.2014.01.007 http://ceur-ws.org/vol-2404/paper07.pdf http://ceur-ws.org/vol-2404/paper07.pdf https://doi.org/10.1609/aaai.v34i01.5467 https://doi.org/10.1609/aaai.v34i01.5467 https://doi.org/10.1609/aaai.v34i01.5467 i. introduction ii. methods iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references [1] d. schrank, b. eisele, t. lomax, and j. bak, “2015 urban mobility scorecard,” texas a&m transportation institute and inrix, inc., usa, 2015. [2] t. economist, “the cost of traffic jams.” https://www.economist.com/ the-economist explains/2014/11/03/the-cost-of-traffic-jams, 2014. accessed: 2020-02-17. [3] m. alam, j. ferreira, j. and j. fonseca, “introduction to intelligent transportation systems”, in intelligent transportation systems pp. 1-17 2016, springer, cham. [4] j. gao, y. shen, j. liu, m. ito, and n. shiratori, “adaptive traffic signal control: deep reinforcement learning algorithm with experience replay and target network,” arxiv preprint arxiv: 1705.02755, 2017. [5] l. li, y. lv, and f.y. wang, “traffic signal timing via deep reinforcement learning”. ieee/caa journal of automatica sinica, 3(3), vol. 3, no. 3, pp. 247-254, jul 2016. [6] h. wei, g. zheng, h. yao, and z. li, "intellilight: a reinforcement learning approach for intelligent traffic light control." proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, pages 2496–2505, july 2... [7] w. genders and r. saiedeh, "using a deep reinforcement learning agent for traffic signal control," arxiv preprint arxiv:1611.01142, 2016. [8] a. r. m. jamil, k. k. ganguly, and n. nower, “adaptive traffic signal control system using composite reward architecture based deep reinforcement learning,” iet intelligent transport systems, vol. 14, no. 14, pp. 2030–2041, dec. 2020. [9] s. lange and m. riedmiller, “deep auto-encoder neural networks in reinforcement learning,” the 2010 international joint conference on neural networks (ijcnn), pp. 1–8, july 2010. [10] x. liang, x. du, g. wang, and z. han, “deep reinforcement learning for traffic light control in vehicular networks,” arxiv preprint arxiv:1803.11115, 2018. [11] c. chen et al. "toward a thousand lights: decentralized deep reinforcement learning for large-scale traffic signal control", proceedings of the aaai conference on artificial intelligence. vol. 34, no. 04, pp. 3414-3421, apr. 2020. [12] a.r.m. jamil, k.k ganguly, n. nower, “an experimental analysis of reward functions for adaptive traffic signal control system”, proceedings of the international conference on distributed sensing and intelligent system (icdsis), springer, 2020. [13] p. mannion, j. duggan, and e. howley, “an experimental review of reinforcement learning algorithms for adaptive traffic signal control”, autonomic road transport support systems. springer, pp. 47–66, 2016. [14] h. van seijen et al., "hybrid reward architecture for reinforcement learning" proceedings of the advances in neural information processing systems, pp. 5392–5402, 2017. [15] s. b. kotsiantis, i. zaharakis, and p. pintelas, “supervised machine learning: a review of classification techniques,” proceeding of the 2007 conference emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007. [16] h. b. barlow, “unsupervised learning,” neural computation, vol. 1, no. 3, pp. 295–311, 1989. [17] s. lange and m. riedmiller, “deep auto-encoder neural networks in reinforcement learning,” in the 2010 international joint conference on neural networks (ijcnn), pp. 1–8, ieee, 2010. [18] t. schaul, j. quan, i. antonoglou, and d. silver, “prioritized experience replay,” arxiv preprint arxiv:1511.05952, 2015. [19] e. van der pol and f. a. oliehoek, “coordinated deep reinforcement learners for traffic light control,” proceedings of learning, inference and control of multi-agent systems (nips), 2016. [20] j. van dijk, “recurrent neural networks for reinforcement learning: an investigation of relevant design choices,” 2017. [21] m. coskun, a. baggag, and s. chawla, “deep reinforcement learning for traffic light optimization,” in 2018 ieee international conference on data mining workshops (icdmw), pp. 564–571, ieee, 2018. [22] m. a. khamis, w. gomaa, and h. el-shishiny, “multi-objective traffic light control system based on bayesian probability interpretation,” in 2012 15th international ieee conference on intelligent transportation systems, pp. 995– 1000, ieee, 2012. [23] m. a. khamis and w. gomaa, “adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework,” engineering applications of artificial intelligence, vol. 29, pp. 134–151,... [24] a. vidali, l. crociani, g. vizzari, and s. bandini, “a deep reinforcement learning approach to adaptive traffic lights management,” in proceedings of the 20th workshop” from objects to agents”, parma, italy, 2019. [25] x. zang, h. yao, g. zheng, n. xu, k. xu, and z. li, "metalight: value-based meta-reinforcement learning for traffic signal control," proceedings of the aaai conference on artificial intelligence, vol. 34. no. 01, pp. 1153-1160, apr. 2020. keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 17–26 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p17–26 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) social distancing monitoring system using deep learning amelia ritahani ismail * , nur shairah muhd affendy, ahsiah ismail, asmarani ahmad puzi department of computer science, kulliyyah of ict, international islamic university malaysia 53100 kuala lumpur, malaysia amelia@iium.edu.my * *corresponding author i. introduction during the covid-19 pandemic, social distancing is a feasible approach to reducing the virus's spread [1]. social distancing means keeping a safe space between people to prevent spreading a contagious disease. during the pandemic, most countries have enforced social distancing toward their people as one of the standard operating procedures (sop). hence, social distancing has become a new normal during the pandemic. however, some individuals do not take this action seriously or are unaware of their surroundings, making it harder for the authorities to observe the sop [2][3][4]. at the early phase of the movement control order (mco) in malaysia, national security council reported that although 92% have complied with the sop during mco, most citizens have failed to observe the social distancing [3]. furthermore, during the recovery movement control order (rmco), the numbers of sop violators kept increasing, and some premises were compounded because of social distancing violations offenses [4]. the current sop inspections were made manually by the authorities. this requires the workforce to do the observations that involve police officers, the people's volunteer corps (rela), and the city council [4]. with the help of real-time object detection using a deep neural network, the sop can be monitored remotely by the authorities hence, improving the efficiency of the inspections, especially in a crowded place. a social distancing detection system requires an object detection system that can detect a person automatically. the previous study has investigated various deep learning algorithms based on convolutional neural network (cnn) to be used for the object detection system such as faster rcnn, ssd and yolov4. article info a b s t r a c t article history: received 19 october 2021 revised 25 december 2021 accepted 15 august 2022 published online 7 november 2022 covid-19 has been declared a pandemic in the world by 2020. one way to prevent covid-19 disease, as the world health organization (who) suggests, is to keep a distance from other people. it is advised to stay at least 1 meter away from others, even if they do not appear to be sick. the reason is that people can also be the virus carrier without having any symptoms. thus, many countries have enforced the rules of social distancing in their standard operating procedure (sop) to prevent the virus spread. monitoring the social distance is challenging as this requires authorities to carefully observe the social distancing of every single person in a surrounding, especially in crowded places. real-time object detection can be proposed to improve the efficiency in monitoring the social distance sop inspection. therefore, in this paper, object detection using a deep neural network is proposed to help the authorities monitor social distancing even in crowded places. the proposed system uses the you only look once (yolo) v4 object detection models for the detection. the proposed system is tested on the ms coco image dataset with a total of 330,000 images. the performance of mean average precision (map) accuracy and frame per second (fps) of the proposed object detection is compared with faster region-based convolutional neural network (r-cnn) and multibox single shot detector (ssd) model. finally, the result is analyzed among all the models. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: deep learning object detection social distancing 18 a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 many research works have been done to promote social distancing during the pandemic. in object detection applications, person detection is crucial for detecting social distancing between them. a new network structure, yolo-r, was introduced by lan et al. (2018) to improve the network structure of the yolov2 algorithm in detecting pedestrians by altering the network structure [5]. three passthrough layers are added to the yolov2 network to extract the shallow layer pedestrian features, and the shallow layer features extracted from the route layer of the original algorithm are improved from the 16 th layer to the 12 th layer, combining shallow layer features with deep layer features to extract more fine-grained features. the dataset used for the model is the inria dataset which consists of 2416 data for training and 1126 data for testing. the comparison between yolov2 and yolo-r was shown, and yolo-r has proven to perform better than the yolov2 model. the precision for yolov2 is 97.37%, yolo-r's is 98.56%, and both algorithms' recall is 89.33% and 91.21%, respectively. the missed rate of the yolo-r network model is also lower than the yolov2 model, which is 10.05%, and 11.29% for yolo v2 network model. a study presented the monitoring of covid-19 social distancing with person detection and tracking using yolo v3 for person detection and deepsort for person tracking [6]. the yolo v3 object detection model was used to distinguish the persons and deepsort to track the identified people and assigned ids. apart from yolo v3, faster r-cnn and ssd algorithms are also being used to compare the performance of people detection in the real-time video surveillance system. as mentioned, the deepsort technique is used to track custom objects in the video and is an extension of sort (simple real-time tracker). for effective tracking, the kalman filter and the hungarian algorithm are used and also include mahalanobis distance to calculate the distance for social distancing between people. the distance calculation is computed based on 3d feature space obtained using centroid coordinates and a bounding box. the dataset used for the model is from the open image dataset (oid) repository by the google open-source community consisting of 800 images divided into an 8:2 ratio for training and testing. the model was then tested on surveillance footage of the oxford town center. between faster r-cnn, ssd, and yolo v3, yolo v3 has achieved the best results for object detection with balanced map and fps scores. faster r-cnn works on region proposals to create boundary boxes to indicate objects and has shown a better accuracy but has slow processing of fps, making it unsuitable for real-time detection. the ssd algorithm has improved the fps of faster r-cnn by using multiscale features and default boxes in a single process for real-time processing. the results for the mentioned model are 96.9% map with 3 fps for faster r-cnn, 69.1% map with 10 fps for ssd, and 84.6% and 23 fps for yolov3. the further study proposed an ai-based real-time social distancing detection and warning system using a monocular camera and deep learning-based real-time object detectors to measure social distancing during the pandemic [7]. a pre-trained deep convolutional neural network (cnn) is being used to detect the individuals who are faster r-cnn and yolov4 using ms coco dataset. the distance between the pedestrian is calculated using euclidean distance after getting the image to real-world mapping coordinates. three experiments were conducted in three different places using oxford town center dataset (an urban street), mall dataset (an indoor mall), and train station dataset (new york city grand central terminal). both detectors that are using faster r-cnn and yolov4 algorithms achieve the real-time performance shown by map in three places with 42.1%42.7% and 41.2%-43.5% for faster r-cnn and yolov4, respectively. this research is proposed to develop a social distancing monitoring system based on a deep neural network, evaluate the model performance and develop a monitoring system for the authorities to observe the social distancing in a specific place. the object detection will be using object detection algorithms which are faster r-cnn, ssd and yolov4 to detect the object (person) using microsoft common objects in context (ms coco) dataset [8]. next, the system will calculate the distance between two persons and identify the number of violations in one place. the outcome expected for this research is to determine which detection algorithm performs better in monitoring social distancing for the authorities. the research develops a social distancing detection and monitoring system based on a deep neural network. furthermore, it evaluates the object detection model performance and compares the detection model performance for a social distancing monitoring system. this research is to investigate a detection algorithm that is suitable for social distancing monitoring to assist the authorities in observing social distancing on-premises. the systems will detect the social distancing a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 19 between the people and show the level of the violation on the premises to prepare the authorities if any action should be taken. it also can improve the efficiency of the authorities’ inspections and encourage people to abide by the rules. the remainder of this article is prepared as follows: the methodology section describes the approach taken to monitor social distancing. results and discussion section present the result for each algorithm in detecting acceptable social distancing practice and analyzes the result obtained. the conclusion section draws the present work's conclusion and future in improvising the present investigation. ii. methods figure 1 illustrates the methodology flowchart for the research. this research is proposed to assist the authority in observing social distancing during the pandemic. yolov4, faster r-cnn, and ssd deep neural network algorithm are being applied for person detection with ms coco dataset and the results are being analyzed to find the most suitable algorithm for the social distancing system. a. data preparation ms coco image dataset [8] is used to evaluate the performance of the proposed system. it is large-scale object detection, segmentation, and captioning dataset. ms coco dataset consists of 330,000 images with 80 object categories, including 64,115 images for the person category. for this research, 10,000 images were randomly selected from coco person images. the images were downloaded using coco application programming interface (api) with a filtered category (person) to get the images. coco api assists in loading, parsing, and visualizing annotations in its dataset. figure 2 shows examples taken from the ms coco dataset for the person category. the image in the dataset has its annotations provided that can be extracted with coco api. for faster r-cnn and ssd, the annotations were converted into pascal voc format, while yolov4 model, the annotations need to be converted to yolo format to fit into the model. the annotation type used for the model is bounding boxes. in coco format, the bounding box was fig. 1. methodology flowchart based on machine learning process 20 a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 displayed as [x, y, width, height], where x and y are the top-left edges of the bounding box, followed by its width and height. while yolo format is displayed as where x and y are the center of the bounding boxes followed by its width and height. figure 3 depicts the annotations example of the bounding boxes for the person images. b. data modeling the main idea of r-cnn is composed of two steps. girshick et al. (2014) proposed using selective search to extract the regions in the image to identify the region of interest (roi) and extract the features from each region for classification [9]. girshick et al. (2015) proposed a new improvement of r-cnn called fast r-cnn after determining some drawbacks from the previous rcnn. the approach is similar to r-cnn, but instead of feeding the region proposals to cnn, the input image was fed to the cnn to generate a convolutional feature map and identify the region proposals [10]. both r-cnn and fast r-cnn use selective search to find the region proposals [9][10]. therefore, ren et al. (2016) [11] eliminate the selective search process and let the network learn the region proposals. faster r-cnn is the improvement of fast r-cnn comprising two modules. based on figure 4, the first module is a feature extraction network consisting of deep convolutional layers and the second is a fast r-cnn detector based on the proposed regions in the first module. the second module contains two subnetworks: region proposal network (rpn) and classifier. using rpn in faster r-cnn has improved the efficiency of the detection. it is a fully convolutional network that is trained, and predicts object boundaries and scores for each detection. in short, the second module is to generate object proposals followed by the classifier to predict the actual class of the object [11]. fig. 2. examples of coco person images dataset fig. 3. examples of bounding boxes annotations a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 21 despite using two modules in faster r-cnn, ssd has no delegated region proposal network, and it predicts the classes directly from feature maps and uses small convolutional filters to predict. ssd is designed for real-time object detection. it applies multiscale features and default boxes to improve accuracy. figure 5 represents that the vgg16 network is used to extract feature maps from the input image and applies 3×3 convolution filters for each cell to make predictions. six additional convolutional layers, then follow it after the vgg16. five of them are used for object detection, and six predictions are made using six layers [12]. instead of selecting parts of an image for prediction, yolo predicts classes and bounding boxes for the whole image in one run of the algorithm and is mainly used for real-time object detection. yolo predicts the object based on the bounding boxes and class probabilities for the boxes that define whether an object is present or not. the general yolo system consists of three steps. first, get the input image and divide it into grids. second, run the convolutional network on the image to predict the bounding boxes and their class probabilities. finally, it applies non-max suppression where it cleans the multiple detections by selecting the highest probability [13]. yolov2 was introduced to improve the initial yolo detection by altering the layers in yolo [14]. yolov3 is built on yolov2 with several improvements. on the other hand, it makes detections at three scales that give input dimensions by 32, 16 and 8. in yolov3, the detection is done by applying 1×1 detection kernels generated by the convolutional network on feature maps of three different sizes at three different places in the network [15]. the latest yolo, yolov4 as shown in figure 6, is an improved architecture of the previous yolo version consisting of four blocks: backbone, neck, dense prediction, and sparse prediction. backbone block refers to feature extraction architecture, and the neck adds extra layers between fig. 4. faster r-cnn model architecture [11] fig. 5. ssd model architecture [12] 22 a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 blocks. head comprises dense prediction and sparse prediction to locate bounding boxes and classify what is inside each box [16]. a comparison of the speed and accuracy of object detectors on the ms coco dataset is shown in table 1 based on bochkovskiy et al. [16]. faster r-cnn has shown a good map value but with the lowest speed compared to other detectors. however, ssd has the fastest speed for 300×300 image resolutions with a higher map value than faster r-cnn. the emergence of yolov4 with the highest map value with balanced speed makes it a better detector than others. the algorithm used for person detection is yolov4, where the architecture of the network is imported from darknet for model training. the platform for the training is google colab which has the below specifications: • cpu: intel(r) xeon(r) cpu @ 2.20ghz • gpu: tesla t4 • ram: 12gb the model was taken from tensorflow object detection api for transfer learning for faster rcnn and ssd. faster r-cnn was trained using inceptionv2 as the backbone. inceptionv2 is the improvement of inceptionv1 wherein the inceptionv2 architecture and the two 3×3 convolutions replace the 5×5 convolution. this decreases computational time and thus increases computational speed because a 5×5 convolution is 2.78 more expensive than a 3×3 convolution. to sum up, using two 3×3 layers instead of 5×5 increases the performance of architecture [17]. however, ssd was trained using mobilenetv2 as the backbone for the algorithm. mobilenet is a streamlined architecture that uses depthwise separable convolutions to construct lightweight deep convolutional neural networks and provides an efficient model for mobile and embedded vision applications. as a lightweight deep neural network, mobilenet has fewer parameters and higher classification accuracy [18]. for yolov4, cspdarknet53 serves as the backbone for this model [16]. it is a cnn and foundation for object detection that employs darknet-53. it divides the feature map of the base layer into two pieces using a cross-stage-partial-connections network (cspnet) technique [19] and then combines them using a cross-stage hierarchy. a split and merge method provides more gradient flow over the network. hyperparameters were tuned based on the model, machine memory, and capability shown in table 2. fig. 6. yolov4 model architecture [16] table 1. the map and fps for different object detection model [16] model map fps gpu faster r-cnn 59.2% 9.4 pascal ssd300 43.1% 43 maxwell ssd500 48.5% 22 maxwell yolov2 48.1% 40 maxwell yolov3 57.9% 20 maxwell yolov4 65.7% 23 maxwell yolov4 65.7% 33 pascal a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 23 after the person detection training, the distance between two persons is calculated using the euclidean distance formula (equation 1) to determine whether the minimum distance has been followed as in sop guidelines. the points taken from the center of each bounding box detect the person. ( ) √∑ ( ) (1) where d is the distance, x, y represent two points in euclidean n-space, xi, yi determine the euclidean vectors, starting from the origin of the space (initial point), and n defines the n-space the system will detect whether there will be more than one person in the frame and calculate the distance between them. the social distance threshold is set at 40.0 pixels, equivalent to approximately 1 meter, assuming the relative scale ratio is 1:4000. the risk percentage is shown at the bottom left of the frame using equation 2. ( ) (2) iii. results and discussion the results of models have been analyzed with the intersection over union (iou) threshold of 0.5, following the standard requirement set by ms coco benchmark challenge [8]. based on figure 7, iou is calculated by dividing the intersection area with the union area between ground truth and predicted bounding boxes. for object detection, the precision and recall are calculated using iou. if iou is bigger or equal to 0.5, the object is classified as true positive (tp). if iou is lower than 0.5, it is considered a false positive (fp). false negative (fn) is classified when the ground truth is present, but the model failed to detect the object [20]. the model is evaluated by calculating the precision (3), recall (4), f1-score (5), and map for accuracy and fps for model performance. the general definition of ap is finding the area under the precision-recall curve, which can be calculated using (6). the map score is calculated by taking the average of ap over all classes for an iou threshold of 0.5. since this research contains only one table 2. hyperparameter tuning model yolov4 faster rcnn ssd backbone cspdarknet53 inceptionv2 mobilenetv2 batch 64 1 24 learning rate 0.001 0.0002 0.04 iterations 6000 200000 160000 momentum 0.949 0.9 0.9 size 416×416 600×600 300×300 fig. 7. calculation of iou 24 a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 class (person), the ap and map will be the same. (3) (4) (5) ∑ ( ) * + (6) in model testing, yolov4 has achieved a map score of 82.47% for 2,000 testing images, while faster r-cnn is 66.10%, and ssd is 41.34%. the performance is tested on the video, and yolov4 can detect around 14~17 fps while faster r-cnn is 7~8 fps and ssd is 49~54 fps. table 3 shows the model performance for person detection using different deep learning algorithms for the person coco dataset. a good detector for object detection should give the best balance of speed and accuracy needed for the application [21]. based on the result, faster rcnn has the lowest fps and a better map score than ssd. however, ssd has the best speed compared to the other models. to conclude, yolov4 has been proven to be the best detection model, which it shows a balance of accuracy and speed for detection and has been applied to the monitoring system. the model is then tested on the test video from the oxford town centre dataset [22]. the sample fig. 8. social distancing detection for the test video fig. 9. social distancing detection for the test video a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 25 was taken from the video for 21 seconds and 531 frames. the distance is calculated using equation 1, and the color of the bounding box is determined whether it has satisfied the conditions of social distancing. the red box is set for a risky person with less than 1 meter, and the green box indicates that the distance is more than one meter between each detection. the risk percentage is shown at the bottom left of the frame using equation 2. based on figure 8, 10 green boxes and 13 red boxes were detected, resulting in a risk percentage of 56%. for figure 9, there were 4 green and 22 red boxes were detected, giving 84% of the risk percentage. the percentage was captured during the testing and displayed in the graph shown in figure 10 to show the trend of the level of compliance by the citizen. this data can be taken into consideration by the authorities for them to improve future inspection efficiency. iv. conclusions in conclusion, the research has investigated the reliability of the detection algorithms for social distancing inspection. three deep learning models are studied to determine the best social distancing algorithms. experimental results showed that yolov4 achieved the highest performance of map compared to other detection models with a balance speed. despite the highest performance, the calculation of social distancing detection did not use a proper camera calibration, and the distance is based on the assumption and may lead to inaccuracy for the social distance. therefore, future work can be extended to include a proper camera calibration and alert system to improve the social distance monitoring system. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. table 3. performance metrics for person detection model yolov4 faster rcnn ssd precision 0.77 0.72 0.49 recall 0.79 0.75 0.89 f1-score 0.78 0.73 0.31 map 82.47% 66.10% 41.34% fps 14~17 7~8 49~54 fig. 10. risk percentage vs frame for the test video 26 a.r. ismail et al. / knowledge engineering and data science 2022, 5 (1): 17–26 funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] world health organization: coronavirus disease (covid-19): how is it transmitted? https://www.who.int/newsroom/q-a-detail/coronavirus-disease-covid-19-how-is-it-transmitted (2020). accessed 5 apr 2021. [2] i. leong, pkp: kadar pematuhan 92 peratus tapi penjarakan sosial gagal dipatuhi ismail sabri. astro awani. https://www.astroawani.com/berita-malaysia/pkp-kadar-pematuhan-92-peratus-tapi-penjarakan-sosial-gagal-dipatuhiismail-sabri-234881 (2020). accessed 5 nov 2020. [3] a. povera: significant increase in the number of sop flouters. new straits times. https://www.nst.com.my/news/nation/2020/07/611744/significant-increase-number-sop-flouters (2020). accessed 5 nov 2020. [4] i. hilmy: eight slapped with rm1k compound each for breaching recovery mco in penang. the star. https://www.thestar.com.my/news/nation/2020/07/25/eight-slapped-with-rm1k-compound-each-for-breachingrecovery-mco-in-penang (2020). accessed 5 nov 2020. [5] w. lan, j. dang, y. wang and s. wang, "pedestrian detection based on yolo network model," 2018 ieee international conference on mechatronics and automation (icma), 2018, pp. 1547-1551. [6] n. s. punn, s. k. sonbhadra, s. agarwal, and g. rai, “monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolo v3 and deepsort techniques,” may 2020. [7] d. yang, e. yurtsever, v. renganathan, k. a. redmill, and ü. özgüner, “a vision-based social distancing and critical density detection system for covid-19,” jul. 2020. [8] t.y. lin, m. maire, s. belongie, j. hays, p. perona, d. ramanan, p. dollár, and c.l. zitnick. microsoft coco: common objects in context. in: fleet d., pajdla t., schiele b., tuytelaars t. (eds) computer vision – eccv 2014, lecture notes in computer science, vol. 8693, pp. 740–755. springer, cham (2014). [9] r. girshick, j. donahue, t. darrell and j. malik, "rich feature hierarchies for accurate object detection and semantic segmentation," 2014 ieee conference on computer vision and pattern recognition, 2014, pp. 580-587. [10] r. girshick, "fast r-cnn," 2015 ieee international conference on computer vision (iccv), 2015, pp. 1440-1448. [11] s. ren, k. he, r. girshick and j. sun, "faster r-cnn: towards real-time object detection with region proposal networks," in ieee transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137-1149, 1 june 2017. [12] w. liu et al., “ssd: single shot multibox detector,” 2016, pp. 21–37. [13] j. redmon, s. divvala, r. girshick, and a. farhadi, “you only look once: unified, real-time object detection,” in 2016 ieee conference on computer vision and pattern recognition (cvpr), jun. 2016, pp. 779–788. [14] j. redmon and a. farhadi, "yolo9000: better, faster, stronger," 2017 ieee conference on computer vision and pattern recognition (cvpr), 2017, pp. 6517-6525. [15] j. redmon and a. farhadi, “yolov3: an incremental improvement,” apr. 2018. [16] a. bochkovskiy, c.-y. wang, and h.-y. m. liao, “yolov4: optimal speed and accuracy of object detection,” apr. 2020. [17] c. szegedy, v. vanhoucke, s. ioffe, j. shlens and z. wojna, "rethinking the inception architecture for computer vision," 2016 ieee conference on computer vision and pattern recognition (cvpr), 2016, pp. 2818-2826. [18] a. g. howard et al., “mobilenets: efficient convolutional neural networks for mobile vision applications,” apr. 2017. [19] c. -y. wang, h. -y. mark liao, y. -h. wu, p. -y. chen, j. -w. hsieh and i. -h. yeh, "cspnet: a new backbone that can enhance learning capability of cnn," 2020 ieee/cvf conference on computer vision and pattern recognition workshops (cvprw), 2020, pp. 1571-1580. [20] k. e. koech: confusion matrix for object detection. towards data science. https://towardsdatascience.com/confusion-matrix-and-object-detection-f0cbcb634157 (2020). accessed 29 apr 2021. [21] n.-d. nguyen, t. do, t. d. ngo, and d.-d. le, “an evaluation of deep learning methods for small object detection,” j. electr. comput. eng., vol. 2020, pp. 1–18, apr. 2020. [22] b. benfold and i. reid, "stable multi-target tracking in real-time surveillance video," cvpr 2011, 2011, pp. 34573464. http://journal2.um.ac.id/index.php/keds https://www.who.int/news-room/questions-and-answers/item/coronavirus-disease-covid-19-how-is-it-transmitted https://www.who.int/news-room/questions-and-answers/item/coronavirus-disease-covid-19-how-is-it-transmitted file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/i.%20leong,%20pkp:%20kadar%20pematuhan%2092%20peratus%20tapi%20penjarakan%20sosial%20gagal%20dipatuhi%20-%20ismail%20sabri.%20astro%20awani.%20https:/www.astroawani.com/berita-malaysia/pkp-kadar-pematuhan-92-peratus-tapi-penjarakan-sosial-gagal-dipatuhi-ismail-sabri-234881%20(2020).%20accessed%205%20nov%202020. file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/i.%20leong,%20pkp:%20kadar%20pematuhan%2092%20peratus%20tapi%20penjarakan%20sosial%20gagal%20dipatuhi%20-%20ismail%20sabri.%20astro%20awani.%20https:/www.astroawani.com/berita-malaysia/pkp-kadar-pematuhan-92-peratus-tapi-penjarakan-sosial-gagal-dipatuhi-ismail-sabri-234881%20(2020).%20accessed%205%20nov%202020. file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/i.%20leong,%20pkp:%20kadar%20pematuhan%2092%20peratus%20tapi%20penjarakan%20sosial%20gagal%20dipatuhi%20-%20ismail%20sabri.%20astro%20awani.%20https:/www.astroawani.com/berita-malaysia/pkp-kadar-pematuhan-92-peratus-tapi-penjarakan-sosial-gagal-dipatuhi-ismail-sabri-234881%20(2020).%20accessed%205%20nov%202020. file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/a.%20povera:%20significant%20increase%20in%20the%20number%20of%20sop%20flouters.%20new%20straits%20times.%20https:/www.nst.com.my/news/nation/2020/07/611744/significant-increase-number-sop-flouters%20(2020).%20accessed%205%20nov%202020 file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/a.%20povera:%20significant%20increase%20in%20the%20number%20of%20sop%20flouters.%20new%20straits%20times.%20https:/www.nst.com.my/news/nation/2020/07/611744/significant-increase-number-sop-flouters%20(2020).%20accessed%205%20nov%202020 file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/a.%20povera:%20significant%20increase%20in%20the%20number%20of%20sop%20flouters.%20new%20straits%20times.%20https:/www.nst.com.my/news/nation/2020/07/611744/significant-increase-number-sop-flouters%20(2020).%20accessed%205%20nov%202020 file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/i.%20hilmy:%20eight%20slapped%20with%20rm1k%20compound%20each%20for%20breaching%20recovery%20mco%20in%20penang.%20the%20star.%20https:/www.thestar.com.my/news/nation/2020/07/25/eight-slapped-with-rm1k-compound-each-for-breaching-recovery-mco-in-penang%20(2020).%20accessed%205%20nov%202020. file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/i.%20hilmy:%20eight%20slapped%20with%20rm1k%20compound%20each%20for%20breaching%20recovery%20mco%20in%20penang.%20the%20star.%20https:/www.thestar.com.my/news/nation/2020/07/25/eight-slapped-with-rm1k-compound-each-for-breaching-recovery-mco-in-penang%20(2020).%20accessed%205%20nov%202020. file:///c:/users/kajur%20te/appdata/local/temp/rar$dia17972.47365/i.%20hilmy:%20eight%20slapped%20with%20rm1k%20compound%20each%20for%20breaching%20recovery%20mco%20in%20penang.%20the%20star.%20https:/www.thestar.com.my/news/nation/2020/07/25/eight-slapped-with-rm1k-compound-each-for-breaching-recovery-mco-in-penang%20(2020).%20accessed%205%20nov%202020. https://doi.org/10.1109/icma.2018.8484698 https://doi.org/10.1109/icma.2018.8484698 http://arxiv.org/abs/2005.01385 http://arxiv.org/abs/2005.01385 http://arxiv.org/abs/2007.03578 http://arxiv.org/abs/2007.03578 https://doi.org/10.1007/978-3-319-10602-1_48 https://doi.org/10.1007/978-3-319-10602-1_48 https://doi.org/10.1007/978-3-319-10602-1_48 https://doi.org/10.1109/cvpr.2014.81 https://doi.org/10.1109/cvpr.2014.81 https://doi.org/10.1109/iccv.2015.169 https://doi.org/10.1109/tpami.2016.2577031 https://doi.org/10.1109/tpami.2016.2577031 https://doi.org/10.1109/tpami.2016.2577031 https://doi.org/10.1007/978-3-319-46448-0_2 https://doi.org/10.1109/cvpr.2016.91 https://doi.org/10.1109/cvpr.2016.91 https://doi.org/10.1109/cvpr.2017.690 https://doi.org/10.1109/cvpr.2017.690 https://arxiv.org/abs/1804.02767j.%20redmon%20and%20a.%20farhadi, https://arxiv.org/abs/2004.10934 https://arxiv.org/abs/2004.10934 https://doi.org/10.1109/cvpr.2016.308 https://doi.org/10.1109/cvpr.2016.308 http://arxiv.org/abs/1704.04861 http://arxiv.org/abs/1704.04861 https://doi.org/10.1109/cvprw50498.2020.00203 https://doi.org/10.1109/cvprw50498.2020.00203 https://doi.org/10.1109/cvprw50498.2020.00203 https://towardsdatascience.com/confusion-matrix-and-object-detection-f0cbcb634157 https://towardsdatascience.com/confusion-matrix-and-object-detection-f0cbcb634157 https://doi.org/10.1155/2020/3189691 https://doi.org/10.1155/2020/3189691 https://doi.org/10.1109/cvpr.2011.5995667 https://doi.org/10.1109/cvpr.2011.5995667 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 97–104 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p97-104 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology melanoma classification based on simulated annealing optimization neural network edi jaya kusuma a, 1, *, ika pantiawati a, 2 , sri handayani b, 3 a department of medical record and health information, faculty of health science, universitas dian nuswantoro, semarang 50131, indonesia b department of public health, faculty of health science, universitas dian nuswantoro, semarang 50131, indonesia 1 edi.jaya.kusuma@dsn.dinus.ac.id*; 2 ikapantia13@dsn.dinus.ac.id; 3 sri.handayani@dsn.dinus.ac.id * corresponding author i. introduction cancer appears because of the uncontrollable growth of abnormal cells in the human body. cancer can occur in many parts of the human body, depending on their life habit. global cancer observatory (globocan) reports 18.1 million cancer cases around the world, with more than 9.1 million cases categorized as mortality cases in 2018 [1]. moreover, from the same report provider, in 2020, the total cancer cases increased to 19.3 million, where about 10 million were classified as deaths cases [2]. due to these reports, it can be summarized that cancer cases are increasing and spreading globally every year. one cancer disease that commonly arises in the country with a high uv (ultraviolet) light index is skin cancer [2]. skin cancer is often caused by high exposure to ultraviolet light such as sunlight directly to the skin. based on their invasion time and damage level towards the body, skin cancer can be divided into two types which are melanoma and nonmelanoma. melanoma is categorized as malignant skin cancer that can be threatened human life [3]. the invasion state can be identified from the appearance of pigment cells called melanocytes in the form of dark skin lesions. however, the skin lesion color can differ in several cases depending on the number of changed pigment cells [4]. unfortunately, benign skin cancer has a similar appearance compared to malignant. thus, it is crucial to identify whether the skin lesion is melanoma or benign early. furthermore, misclassification of skin cancer can lead to severe clinical outcomes. many researchers have been conducted several studies based on the advanced technology in images processing and artificial intelligence fields to identify cancer occurrence in early stages. the earliest the cancer was identified, the higher the patient recovered. gautam and raman [5] proposed article info a b s t r a c t article history: submitted 15 december 2021 revised 22 december 2021 accepted 28 december 2021 published online 31 december 2021 technology development in image processing and artificial intelligence leads to the high demand for smart systems, especially in the health sector. cancer is one of the diseases with the highest mortality cases worldwide. melanoma is one of the cancers commonly caused by high exposure to uv light. the earliest the melanoma is identified, the higher the patient's chance of recovering. therefore, this study proposes melanoma detection based on bpnn optimized by a simulated annealing algorithm. this research utilizes ph2 dermoscopic image data containing 200 color digital images in bmp format. the data is processed using color feature extraction techniques to identify the characteristics of each image according to the target data. the color space extraction includes mean rgb, hsv, cie lab, ycbcr, and xyz. the evaluation result showed that the bpnn-sa increased the performance accuracy in classifying skin cancer compared to the original bpnn, with an overall average accuracy of 84.03%. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: cancer melanoma neural network optimization simulated annealing http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p97-104 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 98 e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 melanoma classification based on the local binary pattern (lbp) and their variant. the evaluation was performed through several machine learning methods such as decision tree (dt), k-nearest neighbor (knn), support vector machine (svm), and random forest (rf). the result shows that the rf method achieved the best performance with 80.3% accuracy. moreover, other research proposed by zghal and derbel [6] utilized ph2 dermoscopic image dataset as the source for developing computer-aided diagnosis (cad). the proposed method used asymmetry, border, color, and diameter (abcd) rules as features extraction from the dataset. the classification of skin cancer based on the ph2 dataset has proposed the combination of color and texture extraction [7]. the features consist of five color spaces: rgb, hsv, lab, xyz, and ycbcr. the features were used as input for three different classifiers: k-nearest neighbor, support vector machine, and neural network. furthermore, some research applies the optimization method to improve the recognition performance. the use of the same data collection also proposed the skin cancer segmentation based on fuzzy c-mean clustering and skin cancer detection using integrated ann and differential evolution (de) algorithm as the training optimization method [8]. the proposed method used multifeature extraction such as red green blue (rgb), local binary pattern (lbp), and gray level cooccurrence matrix (glcm). the evaluation of this proposed method generates 97.4% of accuracy, which refers to the optimization of ann using the de algorithm to detect skin cancer effectively. this study proposed skin detection based on color features extraction and an optimized neural network through simulated annealing (sa) algorithm. sa may find the global solution using a randomized approach. moreover, sa optimizes adaptive neuro-fuzzy inference system (anfis) outperforms other optimization methods such as hyper-box (hb), backpropagation (bp), and genetic algorithm (ga) with 96.28% of accuracy [9]. furthermore, this study's proposed color features extraction consists of several color spaces: rgb, hsv, cie lab, ycbcr, and xyz. then, the proposed classifier implemented a backpropagation neural network (bpnn) in which the weight in each synapse has been optimized using simulated annealing (sa). all of these proposed methods will be evaluated using the ph2 dataset. this study will be presented in four sections. the second section describes the proposed methodology and theoretical foundation. moreover, the third section explains the result of the proposed method, and the last section concludes the proposed work. ii. methods this study proposed skin detection based on color features extraction and an optimized neural network through simulated annealing (sa) algorithm, bpnn-sa. this research utilizes the ph2 dermoscopic image dataset provided by the dermatology service of hospital pedro hispano in matosinhos, portugal, and the universidade do porto, ecnico lisboa [10]. this dataset contains 200 images of skin lesions which separated into 160 benign images (common nevi and atypical nevi) and 40 melanoma images. the image collection is saved in bmp format with 768×560 pixels. figure 1 shows the samples of the ph2 dataset. figure 2 shows the research design regarding the proposed method, bpnn-sa. from figure 2, it can be seen that the first step is to extract each component of color spaces. several color spaces such as rgb, hsv, cie lab, ycbcr, and xyz are used in this study. (a) (b) (c) fig. 1. the sample images of the ph2 dataset; (a) common nevi; (b) atypical nevi; and (c) melanoma e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 99 rgb is a color space often used in many digital devices such as smartphones, cameras, television, and computers [11]. rgb consists of three layers, where each layer represents the color of red, green, and blue. this color space used 8 bits system for color determination. the image pixel level was distributed between 0 to 255 in each layer. this distribution defines that the lower the pixel number, the darker the color. other than being used as color representation in digital devices, rgb color space is often used as a color feature in several studies, such as fruit classification [12], image segmentation [13], and object detection [14]. hsv stands for hue, saturation, and value. unlike rgb, in which the system represents each layer of color, each term in hsv has a specific function in image representation. the hue value determines the color temperature of the images, saturation represents the color domination in the images, and the value represents the brightness level. hsv is also often used as a color feature [15][16]. ycbcr is a color space that can be broken down into three components: y, cb, and cr. each component has a different function, where the y component is luma or the color brightness, then cb is the blue-green component, and cr is the red-green component. commonly, this color space was used in digital video processing [17]. commission international de i’eclairage (cie ) proposed xyz and lab colors space. xyz was proposed in the 1920s and has been used as a graphics standard at the moment [18]. meanwhile, the lab color space was proposed by cie in 1976. this color space was proposed because the color was designed as close as the human vision for color [19]. moreover, the lab value was distributed from negative to positive, representing a different color. afterward, the extracted features were divided using cross-fold validation to generate data training and data testing. the use of cross-validation is to evaluate the capability and the robustness of the proposed model to process the unknown data. the data training was used as a reference to training the backpropagation neural network (bpnn). fig. 2. research design 100 e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 backpropagation neural network or bpnn is one of multi-layer perceptron-based artificial neural networks. bpnn has a similar structure to mlp and has an input, hidden, and output layer, as shown in figure 3. the difference is the learning transfer process where bpnn propagates back and forth [21]. moreover, bpnn is also known as backward propagation error, where it uses gradient function to identify the class of the data [22]. the result of bpnn is the trained network model. from this trained network model, the weight of each synapse was extracted. the extracted weight will then be optimized using a simulated annealing (sa) algorithm. sa is a metaheuristics searching method that resolves the problem iteratively based on the specific objective function. this algorithm was inspired by the annealing process in metallurgy industries [23]. sa imitates the crystallization process of liquid metal where the process begins with heating the metal until it reaches the exact temperatures desired, then proceed the cooling process gradually and controlled until the metal can keep its optimal form [24]. this weight is considered as initial input in simulated annealing. the searching process generates random interference that moves the “particle” to cool down the initial temperatures. accept the movement if the “particle” position has a lower energy state. this searching process will last until the process exceeds the defined iteration or the boundary of error rate is fulfilled. the final output was a weight that had been optimized. this weight is then assigned to the previous network model to replace its original weight. finally, the final model was evaluated using test data to determine the performance of the proposed model. iii. experiment result and discussion the skin lesions identification based on bpnn and simulated annealing optimization have been conducted. the ph2 dataset was used as a reference in the model development. the features for classification have been extracted from ph2 images using rgb, hsv, ycbcr, xyz, and lab color spaces. the extraction process produces fifteen features for each skin lesion image. from these features, cross-validation has been applied to generate the training set and testing set. the number of folds used in this research is between 2 to 10. the use of cross-validation is to evaluate the model's capability in handling unlabeled data. table 1 shows the initial configuration of bpnn. after the bpnn has been configured, then the training set obtained from cross-validation was fed into the neural network model as a training process. the neural network produced by the training fig. 3. bpnn structure with one hidden layer [20] e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 101 process is then extracted to obtain the weight of each synapse. these weight sets were used as the initial state of the optimization model based on the simulated annealing (sa) algorithm. table 2 shows the specification of the simulated annealing. the optimization aims to obtain the optimized weight for the bpnn model. after these sets of optimized weight were found, it was assigned to the bpnn to replace the original weight. the evaluation was conducted by evaluating the testing set into the optimized model. the proposed model was compared to the original bpnn to describe the difference created by the optimization algorithm from the evaluation process. table 3 shows the accuracy of the original bpnn. table 3 presents the original bpnn for each fold in cross-validation. the ninth fold achieved the highest accuracy with 83.83% of accuracy. meanwhile, the tenth fold achieved the lowest accuracy with 68% of accuracy. overall, the original bpnn could identify skin cancer with 79.51% of the total accuracy average. moreover, the evaluation result of the proposed optimized model can be seen in table 4. the result shows that all of the fold numbers outperformed the performance of the original bpnn. in the bpnnsa, the sixth fold achieved the highest accuracy with 88.38%, while the lowest was the third fold with 81.81% accuracy. in addition, the result shows that the performance of bppn-sa reached 84.03% of overall average accuracy. this achievement can be reached due to the capability of simulated annealing in searching the global minima, which is the minimum value of the fitness function. moreover, the simulated annealing also used a randomized approach is generated the global solution, which theoretically has a broader probability of finding the best solution [9]. table 1. bpnn configureation parameter setting or specification hidden layer 1 nodes 5 epochs 100 learning rate 0.001 activation function tangent sigmoid activation performance function mean square error (mse) table 2. the specification of the simulated annealing parameter setting or specification maximum iteration 50 initial temperatures 100 fitness function mean square error (mse) table 3. the accuracy of the original bpnn no number of k in cross-validation 2 3 4 5 6 7 8 9 10 1 80.00 84.84 88.00 87.50 78.78 85.71 80.00 86.36 80.00 2 76.00 77.27 82.00 80.00 63.63 89.28 72.00 81.81 90.00 3 81.81 74.00 65.00 84.84 89.28 92.00 77.27 75.00 4 78.00 85.00 90.90 78.57 96.00 77.27 75.00 5 87.50 90.90 82.14 80.00 86.36 20.00 6 75.75 64.28 68.00 77.27 15.00 7 78.57 80.00 77.27 80.00 8 80.00 95.45 90.00 9 95.45 75.00 10 80.00 average 78.00 81.31 80.50 81.00 80.80 81.12 81.00 83.83 68.00 102 e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 the comparison graph between the original bpnn and bpnn-sa can be seen in figure 4. from this graph, it can be seen that almost in every fold, the bpnn-sa outperforms the original bpnn, especially the sixth and tenth folds, which significantly differ from the original bpnn accuracy result. this case happened due to the bpnn-sa, which utilized the simulated annealing as searching function relies on the randomized approach, which has a broader chance of finding the best solution compared to the original bpnn, which relies on the gradient-based search that has a tendency to be trapped in local minima [25]. this result referred to the simulated annealing algorithm as capable of improving the performance of bpnn in classifying skin cancer. iv. conclusion the proposed improved skin cancer classification using bpnn and simulated annealing methods have been carried out. this research utilizes ph2 dermoscopic image data containing 200 color digital images in bmp format. the data is processed using color feature extraction techniques to identify the characteristics of each image according to the target data. the color space extraction includes mean rgb, hsv, cie lab, ycbcr, and xyz. the experiment was conducted using the cross-fold validation method to evaluate the model robustness toward the appearance of unknown data. the bpnn model was first trained using a training set; then, the trained weight was obtained as the initial weight for simulated annealing. the simulated annealing was used for searching the optimal weight for the bpnn model. the evaluation result showed that the bpnn-sa method increased the performance accuracy in classifying skin cancer compared to the original bpnn method, with an overall average accuracy of 84.03%. table 4. the accuracy of the bpnn-sa no number of k in cross-validation 2 3 4 5 6 7 8 9 10 1 83.00 80.30 88.00 87.50 84.84 85.71 84.00 90.90 80.00 2 85.00 80.30 84.00 75.00 69.69 89.28 88.00 90.90 95.00 3 84.84 78.00 70.00 87.87 85.71 88.00 77.27 75.00 4 78.00 90.00 93.93 85.71 92.00 77.27 80.00 5 90.00 100.00 82.14 88.00 90.90 70.00 6 93.93 82.14 72.00 77.27 90.00 7 78.57 88.00 81.81 85.00 8 80.00 90.90 95.00 9 81.81 85.00 10 85.00 average 84.00 81.81 82.00 82.50 88.38 84.18 85.00 84.34 84.00 fig. 4. accuracy comparison of bpnn and bpnn-sa e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 103 acknowledgment we gratefully thanks to lembaga penelitian dan pengabdian kepada masyarakat (lppm) (grand id: 056/a38-04/udn-09/vi/2021) and faculty of health science of universitas dian nuswantoro for supporting this research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] kementerian kesehatan republik indonesia, “kementerian kesehatan republik indonesia,” kementerian kesehatan ri. p. 1, 2019. [2] h. sung et al., “global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” ca. cancer j. clin., vol. 71, no. 3, pp. 209–249, may 2021. [3] r. b. oliveira, j. p. papa, a. s. pereira, and j. m. r. s. tavares, “computational methods for pigmented skin lesion classification in images: review and future trends,” neural comput. appl., vol. 29, no. 3, pp. 613–636, feb. 2018. [4] n. k. mishra and m. e. celebi, “an overview of melanoma detection in dermoscopy images using image processing and machine learning,” pp. 1–15, jan. 2016. [5] a. gautam and b. raman, “skin cancer classification from dermoscopic images using feature extraction methods,” in 2020 ieee region 10 conference (tencon), nov. 2020, pp. 958–963. [6] n. s. zghal and n. derbel, “melanoma skin cancer detection based on image processing,” curr. med. imaging former. curr. med. imaging rev., vol. 16, no. 1, pp. 50–58, 2018. [7] s. oukil, r. kasmi, k. mokrani, and b. garcía‐zapirain, “automatic segmentation and melanoma detection based on color and texture features in dermoscopic images,” ski. res. technol., nov. 2021. [8] m. kumar, m. alshehri, r. alghamdi, p. sharma, and v. deep, “a de-ann inspired skin cancer detection approach using fuzzy c-means clustering,” mob. networks appl., vol. 25, no. 4, pp. 1319–1329, 2020. [9] b. haznedar, m. t. arslan, and a. kalinli, “optimizing anfis using simulated annealing algorithm for classification of microarray gene expression cancer data,” med. biol. eng. comput., vol. 59, no. 3, pp. 497–509, mar. 2021. [10] e. m. senan and m. e. jadhav, “analysis of dermoscopy images by using abcd rule for early detection of skin cancer,” glob. transitions proc., vol. 2, no. 1, pp. 1–7, 2021. [11] r. a. asmara, f. rahutomo, q. hasanah, and c. rahmad, “chicken meat freshness identification using the histogram color feature,” proc. 2017 int. conf. sustain. inf. eng. technol. siet 2017, vol. 2018-janua, no. c, pp. 57–61, 2018. [12] s. tu, y. xue, c. zheng, y. qi, h. wan, and l. mao, “detection of passion fruits and maturity classification using redgreen-blue depth images,” biosyst. eng., vol. 175, no. 4, pp. 156–167, nov. 2018. [13] g. f. shidik, f. n. adnan, c. supriyanto, r. a. pramunendar, and p. n. andono, “multi color feature, background subtraction and time frame selection for fire detection,” proc. 2013 int. conf. robot. biomimetics, intell. comput. syst. robionetics 2013, no. november, pp. 115–120, 2013. [14] m. k. alsmadi, “content-based image retrieval using color, shape and texture descriptors and features,” arab. j. sci. eng., vol. 45, no. 4, pp. 3317–3330, 2020. [15] o. r. indriani, e. j. kusuma, c. a. sari, e. h. rachmawanto, and d. r. i. m. setiadi, “tomatoes classification using k-nn based on glcm and hsv color space,” in 2017 international conference on innovative and creative information technology (icitech), nov. 2017, pp. 1–6. [16] d. wu, c. zhang, l. ji, r. ran, h. wu, and y. xu, “forest fire recognition based on feature extraction from multiview images,” trait. du signal, vol. 38, no. 3, pp. 775–783, jun. 2021. [17] d. chai and a. bouzerdoum, “a bayesian approach to skin color classification in ycbcr color space,” in 2000 tencon proceedings. intelligent systems and technologies for the new millennium (cat. no.00ch37119), 2000, vol. 2, pp. 421–424. [18] j. schanda, colorimetry: understanding the cie system. john wiley \& sons, 2007. http://journal2.um.ac.id/index.php/keds http://p2p.kemkes.go.id/penyakit-kanker-di-indonesia-berada-pada-urutan-8-di-asia-tenggara-dan-urutan-23-di-asia/ http://p2p.kemkes.go.id/penyakit-kanker-di-indonesia-berada-pada-urutan-8-di-asia-tenggara-dan-urutan-23-di-asia/ https://doi.org/10.3322/caac.21660 https://doi.org/10.3322/caac.21660 https://doi.org/10.1007/s00521-016-2482-6 https://doi.org/10.1007/s00521-016-2482-6 https://doi.org/10.48550/arxiv.1601.07843 https://doi.org/10.48550/arxiv.1601.07843 https://doi.org/10.1109/tencon50793.2020.9293863 https://doi.org/10.1109/tencon50793.2020.9293863 https://doi.org/10.2174/1573405614666180911120546 https://doi.org/10.2174/1573405614666180911120546 https://doi.org/10.1111/srt.13111 https://doi.org/10.1111/srt.13111 https://doi.org/10.1007/s11036-020-01550-2 https://doi.org/10.1007/s11036-020-01550-2 https://doi.org/10.1007/s11517-021-02331-z https://doi.org/10.1007/s11517-021-02331-z https://doi.org/10.1016/j.gltp.2021.01.001 https://doi.org/10.1016/j.gltp.2021.01.001 https://doi.org/10.1109/siet.2017.8304109 https://doi.org/10.1109/siet.2017.8304109 https://doi.org/10.1016/j.biosystemseng.2018.09.004 https://doi.org/10.1016/j.biosystemseng.2018.09.004 https://doi.org/10.1109/robionetics.2013.6743589 https://doi.org/10.1109/robionetics.2013.6743589 https://doi.org/10.1109/robionetics.2013.6743589 https://doi.org/10.1007/s13369-020-04384-y https://doi.org/10.1007/s13369-020-04384-y https://doi.org/10.1109/innocit.2017.8319133 https://doi.org/10.1109/innocit.2017.8319133 https://doi.org/10.1109/innocit.2017.8319133 https://doi.org/10.18280/ts.380324 https://doi.org/10.18280/ts.380324 https://doi.org/10.1109/tencon.2000.888774 https://doi.org/10.1109/tencon.2000.888774 https://doi.org/10.1109/tencon.2000.888774 https://www.wiley.com/en-us/colorimetry%3a+understanding+the+cie+system-p-9780470049044 104 e.j. kusuma et al. / knowledge engineering and data science 2021, 4 (2): 97–104 [19] e. e. lavindi, e. j. kusuma, g. f. shidik, r. a. pramunendar, a. z. fanani, and pujiono, “neural network based on glcm, and cie l∗a∗b∗ color space to classify tomatoes maturity,” proc. 2019 int. semin. appl. technol. inf. commun. ind. 4.0 retrosp. prospect. challenges, isemantic 2019, pp. 45–50, 2019. [20] e. kusuma, g. shidik, and r. pramunendar, “optimization of neural network using nelder mead in breast cancer classification,” int. j. intell. eng. syst., vol. 13, no. 6, pp. 330–337, dec. 2020. [21] t. s and m. n, “detection, segmentation and recognition of face and its features using neural network,” j. biosens. bioelectron., vol. 7, no. 2, 2016. [22] s. tikoo and n. malik, “detection of face using viola jones and recognition using back propagation neural network,” int. j. comput. sci. mob. comput., vol. 5, no. 5, pp. 288–295, 2016. [23] m. riza, p. d. sentia, a. andriansyah, and a. muslim, “simulated annealing-based optimization of biodegradable plastic synthesis,” int. rev. model. simulations, vol. 12, no. 1, p. 24, feb. 2019. [24] g. f. shidik, e. j. kusuma, s. nuraisha, and p. n. andono, “heuristic vs metaheuristic method: improvement of spoofed fingerprint identification in iot devices,” int. rev. model. simulations, vol. 12, no. 3, pp. 168–175, 2019. [25] s. s. behera and s. chattopadhyay, “a comparative study of back propagation and simulated annealing algorithms for neural net classifier optimization,” procedia eng., vol. 38, pp. 448–455, 2012. https://doi.org/10.1109/isemantic.2019.8884307 https://doi.org/10.1109/isemantic.2019.8884307 https://doi.org/10.1109/isemantic.2019.8884307 https://doi.org/10.22266/ijies2020.1231.29 https://doi.org/10.22266/ijies2020.1231.29 https://doi.org/10.4172/2155-6210.1000210 https://doi.org/10.4172/2155-6210.1000210 https://doi.org/10.48550/arxiv.1701.08257 https://doi.org/10.48550/arxiv.1701.08257 https://doi.org/10.15866/iremos.v12i1.15852 https://doi.org/10.15866/iremos.v12i1.15852 https://doi.org/10.15866/iremos.v12i3.17330 https://doi.org/10.15866/iremos.v12i3.17330 https://doi.org/10.1016/j.proeng.2012.06.055 https://doi.org/10.1016/j.proeng.2012.06.055 i. introduction ii. methods iii. experiment result and discussion iv. conclusion acknowledgment declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 101–108 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p101-108 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) sentiment analysis of amazon product reviews using supervised machine learning techniques naveed sultan * department of information technology, khwaja fareed university of engineering and information technology, abu dhabi rd, rahim yar khan, punjab, pakistan naveedsultan587@gmail.com * * corresponding author i. introduction people buy goods from various e-commerce websites as the world's commercial sites are practically online [1]. it is also a privileged condition where products are checked before the purchase. consumers are more likely to buy a product through reviews. internet retailers and distributors invite clients to express their thoughts on their merchandise. millions of feedback on products, facilities, and places are produced daily online [2]. this makes the internet the primary source of a product or service's knowledge. reviews, therefore, offer valuable feedback on a business, including its venue, pricing, and advice, allowing customers to consider every part of the business [3]. this is positive for consumers and encourages marketers to understand shoppers and their preferences that render their products. when a company's amount of comments available rises, it gets more challenging for a potential consumer to decide whether or not to purchase it [4]. in this age of artificial intelligence, it takes time to polarize a sample into unique categories to read thousands of reviews and recognize a brand to consider its attractiveness among customers worldwide [5][6]. today, studying data from actual customer reviews is an important field. the author in [7] has worked in film reviews. since vast repositories of online reviews are readily accessible, this domain is easy to work on. also, with a machine-extractable ranking metric such as several ratings, reviewers usually summarize their overall sentiment, but they did not hand-label the data for implementing supervised learning and assessment. the internet movie database (imdb) is their database root, where the database includes only numeric values or scores. ratings are collected randomly and grouped into three categories: positive, negative, or neutral. they focused only on finding the tendency of the emotion to be either positive or negative. the following three naïve bayes machine learning algorithms were used: maximum entropy classification and help vector machinery (svm). article info a b s t r a c t article history: received 3 june 2022 revised 10 july 2022 accepted 14 august 2022 published online 7 november 2022 today, everything is sold online, and many individuals can post reviews about different products to show feedback. serves as feedback for businesses regarding buyer reviews, performance, product quality, and seller service. the project focuses on buyer opinions based on mobile phone reviews. sentiment analysis is the function of analyzing all these data, obtaining opinions about these products and services that classify them as positive, negative, or neutral. this insight can help companies improve their products and help potential buyers make the right decisions. once the preprocessing is classified on a trained dataset, these reviews must be preprocessed to remove unwanted data such as stop words, verbs, pos tagging, punctuation, and attachments. many techniques are present to perform such tasks, but in this article, we will use a model that will use different inspection machine techniques. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: supervised machine learning random forest classification decision tree support vector machine k-nearest neighbor classification http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 102 n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 there is an emphasis in [8]. this is the definitive flipkart feedback study using algorithms from the bayes naïve and decision tree. using the product ratings and reviews of the single data set of flipkart sellers and its classification, the subjectivity and objectivity, and that the buyer is negative to the positive meaning of the term. these assessments were, to a certain degree, positive and prospective both for your purchasers and for your providers. it is an observational research analyzing the efficacy of the semantic significance of the product evaluation categorization. in [9], feedback from numerous e-shopping websites is evaluated. analyzing ratings for online shopping sites is the primary goal of the framework. the ratings are categorized according to positive, negative, and neutral. such findings help pick a specific e-shopping website based on the highest favorable reviews and scores. firstly, the data collection of e-shopping websites providing ratings relevant to the services of individual websites is gathered. then, add specific preprocessing methods to datasets to delete unwanted items and organize details correctly. after that, we use the pos tagger to assign tags according to the position of each phrase. to find the score of each word, "sentiwordnet dictionary" is used. sentiments then positive, negative, and neutral are graded. in the graphical style, the comparison of the providers based on positive and negative feedback can be seen. this paper aims to distinguish customers' positive and negative feedback of various products and develop a supervised learning model to polarize large quantities of reviews. our dataset consists of feedback and ratings from consumers that we received from user reviews of amazon products. based on that, we extracted the features of our dataset and established several supervised models. such models provide algorithms for supervised machine learning such as naive bays, logistic regression, support vector machines, ensemble classification, decision tree, and k-nearest neighbor. at last, we will compare all the models and check each model's accuracy with the roc curve, recall, and precision. ii. methods a. data preprocessing we take the dataset from reviews of amazon products [3]. our dataset has 483148 of the total reviews. in this case, the product name, brand, price, rating, text of the review, and the review of the device's cast. we will review in the review column to better use the data for the first, as they are the most critical aspects of this project. we separate positive and negative reviews below. figure 1 is for positive reviews, and it is for negative reviews. fig. 1. data preprocessing besides the brief overview of the dataset, we have plotted a distribution of ratings concerning the number of reviews, and we also perform the task where it calculates the total number of reviews with ratings 5,4,3,2,1. it shows there are five classes in our dataset, which is the rating starts from 1 to 5 stars, as well as the division among them the five classes have been wrong, which is a class 2 and 3 with a small amount of data, while grade 5 has more than 175000 reviews. here is an example from our data set: a revision of the text: "i am using this phone, this is amazing, rating: '5'. the rating distribution of amazon reviews can be seen in figure 2. n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 103 fig. 2. rating distribution of amazon reviews for the research purpose of this project, we filtered the dataset with 16000 reviews and then again separated based on the review's rating. b. features we have tried two types of features in our project. the first type is countvectorizer [10]. the text must be analyzed to remove some terms to use textual data for predictive modeling, and it is also called the tokenization procedure. these words must then be encoded as integers or fluid-point values for machine algorithms as inputs. this procedure is known as function removal (or vectorization). we use a scikit learn library of countvectorizer to convert a text collection into a vector of term/tokenization. this functionality makes it more flexible for text representation. count_vector=countvectorizer(stop_words="english") the other method is tfidf [11]. it is a statistical metric that assesses the significance of a word about a document in a collection of documents. this is because two components are multiplied: the number of times the term is in a document and the other way round the frequency of a document. tfidf_vector = tfidfvectorizer(stop_words="english") tfidf_vector.fit(x_train_data). c. classification this research used six classification methods. the first is naïve bayes. the naïve bayes classification algorithm uses the alien of the theorem of bayes to forecast the text tag based on the knowledge of its rules, terms, and circumstances [12]. it evaluates the chance of every tag being a text and then forecasts the time as likely as possible. one of the most frequent tasks is the classification problems learning methods. in this approach, it is supposed that the 𝑥 is dependent on the 𝑦, termed the assumption of naïve bayes. the calculation of naïve bayes as in (1). 𝑃(𝑥1 … … … . 𝑥𝑘 |𝑦) = ∏ 𝑝(𝑥𝑖 ||𝑦) 𝑘 𝑖=1 (1) second, utilized logistic regression to fix the binary classification problem using a classification technique in the classification of logistic regression, which utilizes a weighted combination of input and much effort [13]. the function sigmoid transforms an actual number a to a number from 0 to 1. a logistic regression classifier on count vectorizer and tfidf features to compare it with rating accuracy. the default parameters that give us the accuracy of the results will be shown in the results section. logistic regression work with a sigmoid function, which predicts that the outcome values range from 0 to 1 or true false. the visualization of logistic regression can be seen in figure 3. 104 n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 fig. 3. logistic regression third, a non-parametric classification procedure is the k-nearest neighbor (knn). in recent years it has been frequently utilized. this approach is the closest neighbor of the input data to create a forecast for the first time for 𝐾 = 𝑛. the great majority of the class's neighbors should then be mentioned. the distance between each neighbor and the distance euclidean is a measure of the extent of similarity between the data points [14]. the equation of logistic regression as in (2). 𝑓(𝑥) = 1 𝐾 (𝑥 + 𝑎)𝑛 = ∑ 𝑦𝑖𝑥 𝜖 𝑁𝑘 (𝑥) (2) fourth, the support vector machine (svm) is a technique of classification that uses a small quantity of data to its best [15]. it is among the vectors belonging to a particular group or category and among those not belonging to the group. suppose, for example, two tags are available: costly and cheap, and the data contains two characteristics: 𝑥 and 𝑦. it should be up to you to select which coordinates are more expensive and which are cheaper for each coordinate pair (𝑥, 𝑦). in order to accomplish so, the svm is to divide the two points, the so called border of decision, and, on the one hand, the group is so costly, and we cannot, on the other hand, reduce our costs. fifth, ensemble methods can create more than one model and then combine them to achieve better results [16]. ensemble approaches are generally more precise than a single model [17]. this is also the case in several machine learning competitions, where the winning solutions are used in ensemble methods. the popular netflix is ahead of the competition, with the winner using a complex approach to implement a collaborative filtering algorithm. here is the related code for this ensemble. ess_model = randomforestclassifier() #train model ess_model.fit(x_train_data_new,y_train_data) #test model predictions["essembleclasification"]= ess_model.predict(x_test_data_new) the last is the decision tree. decision tree is an algorithm of the supervised algorithm family of machine learning. it may be utilized both as a classification and regression problem [18]. the objective of the approach is to develop a model that predicts the value of a variable [19]. in order to resolve the problem of the leaf, the decision tree utilizes a tree representation to match a class label, and characteristics in the interior node of the tree are represented. the related code of decision tree as follows. from sklearn import tree tree_model = tree.decisiontreeclassifier() n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 105 d. evaluation parameter the methods or metrics we use to measure our project's evaluation are accuracy, precision, recall, and f1-score [20]. precision predicts the percentage of positive reviews that use truly positive divided by the truly positive plus false positive as defined as in (3). 𝑃𝑅 = 𝑡𝑝 𝑡𝑝+𝑓𝑝 (3) where 𝑡𝑝 is known as true positive and 𝑓𝑝 as false positive. the recall measures the truly positive reviews divided by the total number of true positive and false positive reviews, as in (4). 𝑅𝐶 = 𝑡𝑝 𝑡𝑝+𝑓𝑛 (4) where 𝑡𝑝 for true positive and 𝑓𝑛 for false negative f1 score is the combination of both precision and recalls, as in (5). 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 𝑃𝑅∗𝑅𝐶 𝑃𝑅+𝑅𝐶 (5) accuracy measures the system's performance, the true positive and true negative reviews divided by the total number of actual, false positive, and false negative reviews, as in (6). 𝐴𝐶𝐶 = 𝑡𝑝+𝑡𝑛 𝑡𝑝+𝑡𝑛+𝑓𝑝+𝑓𝑛 (6) iii. results and discussion we divide the dataset of 483148 reviews into 80% of the training set and 20% of the testing set. after successfully training machine learning models, we used test data set to predict the model and test for accuracy. when the project was completed, we decided it was a significant activity that enabled us to reach our goal and gave us much confidence. we have designed a machine learning model that will help predict user review sentiments. this system can predict with different models' accuracy, which is quite valuable. then the accuracy results are given in table 1. the receiver operating curve (roc) is a probability curve that indicates our binary classification based on the true and false-positive ratings. the area underneath the curve (auc) is a metric of 0 to 1. the region underneath is the roc curve. the roc curve of ensemble classification using tdidf can be seen in figure 4. table 1. the accuracy of count vectorizer and tfidf model model accuracy count vectorizer tfidf multinomial naïve bayes 0.924750 0.934750 bernouli naïve bayes 0.819750 0.811625 logistic regression 0.952625 0.944750 knn 0.898875 0.813750 svm 0.749125 0.491625 ensemble classification 0.956750 0.960500 decision tree 0.938125 0.945500 106 n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 fig. 4. roc curve of ensemble classification using tfidf the above curve is only for ensemble classification using tfidf techniques, and we also perform the same task for every model using tfidf and count vector. we perform the following tasks with every model. these tasks were also performed with tfidf and also with count vectorizer. the result of the evaluation can be seen in table 2. from table 2, our model is quite successful as it produces 89-90 or more than 90% accuracy on test data set with different models and techniques, but it does not mean it can consistently produce such highly accurate results. there is a possibility that it can produce false results to some extent and can produce completely false results in some exceptions case. positive reviews predictions must lie between the range of 0.5 and less than 1 and false reviews ranges from 0 to 0.5 but from the figure below, some false prediction of positive reviews represented pessimistically, and some pessimistic predictions represented positive ones. so there are some deficiencies which need to be resolved in future works. the actual and predicted output can be seen in figure 5. fig. 5. actual and predicted output we live in a world of technology where artificial intelligence is a part of every system making it more autonomous and efficient. nowadays, large ad networks and social or e-commerce businesses are implemented at a vast scale which uses targeted marketing and storing user data in a targeted manner by classifying user reviews in positive and negative using a system just like the system or algorithm we have developed using machine learning models. we also evaluated that combined or ensemble machine learning models can produce more accurate and reasonable results than simple machine learning. at last, we compare all the models to check which model has the most fantastic accuracy, and our system is based on the gui model, which performs the tasks in the following manners. the gui model can be seen in figure 6. the comparison results of the classification of all models in the system can be seen in figure 7. table 2. the result evaluation features model precision recall f1-score count vector 0.93 0.92 0.92 tfidf 0.96 0.96 0.96 n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 107 fig. 6. system overview 01 fig. 7. models comparison iv. conclusion in conclusion, as we used two methods for different models, tfidf and count vector, we used them with all the algorithms we mentioned in the model part, including naive bayes, svm, knn, decision tree, logistic regression, and ensemble classification. as we can see from the results, we have better accuracy on the test set with the following algorithms, multinomial, ensemble, and svm logistic regression on both types of features. the same approach may be expanded to many more classification methods and utilizing a neural network to decide whether the best classification for opinion mining and sentiment analysis will be chosen. one of the main features of this project, which remains a problem, is problems extraction from reviews. if this work is done in the future, it will benefit the suppliers or the company. 108 n. sultan. / knowledge engineering and data science 2022, 5 (1): 101–108 declarations author contribution all authors contributed equally as the primary contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] g. taher, “e-commerce: advantages and limitations,” int. j. acad. res. accounting, financ. manag. sci., vol. 11, no. 1, feb. 2021. [2] a. datta, “the digital turn in postcolonial urbanism: smart citizenship in the making of india’s 100 smart cities,” trans. inst. br. geogr., vol. 43, no. 3, pp. 405–419, sep. 2018. [3] a. s. rathor, a. agarwal, and p. dimri, “comparative study of machine learning approaches for amazon reviews,” procedia comput. sci., vol. 132, pp. 1552–1561, 2018. [4] s. n. ahmad and m. laroche, “analyzing electronic word of mouth: a social commerce construct,” int. j. inf. manage., vol. 37, no. 3, pp. 202–213, jun. 2017. [5] z. xiang, q. du, y. ma, and w. fan, “a comparative analysis of major online review platforms: implications for social media analytics in hospitality and tourism,” tour. manag., vol. 58, pp. 51–65, feb. 2017. [6] j. wang, m. d. molina, and s. s. sundar, “when expert recommendation contradicts peer opinion: relative social influence of valence, group identity and artificial intelligence,” comput. human behav., vol. 107, p. 106278, jun. 2020. [7] zhu zhang, “weighing stars: aggregating online product reviews for intelligent e-commerce applications,” ieee intell. syst., vol. 23, no. 5, pp. 42–49, sep. 2008. [8] g. kaur and a. singla, “sentimental analysis of flipkart reviews using naïve bayes and decision tree algorithm,” int. j. adv. res. comput. eng. technol., vol. 5, no. 1, pp. 148–153, 2016. [9] u. r. babu and n. reddy, “sentiment analysis of reviews for e-shopping websites,” int. j. eng. comput. sci, vol. 6, no. 1, p. 19966, 2017. [10] s. khomsah and agus sasmito aribowo, “text-preprocessing model youtube comments in indonesian,” j. resti (rekayasa sist. dan teknol. informasi), vol. 4, no. 4, pp. 648–654, aug. 2020. [11] a. i. kadhim, “an evaluation of preprocessing techniques for text classification,” int. j. comput. sci. inf. secur., vol. 16, no. 6, pp. 22–32, 2018. [12] m. castelli, l. vanneschi, and á. r. largo, “supervised learning: classification,” encycl. bioinforma. comput. biol. abc bioinforma., vol. 1–3, no. 2, pp. 342–349, 2018. [13] m. nabipour, p. nayyeri, h. jabani, s. s., and a. mosavi, “predicting stock market trends using machine learning and deep learning algorithms via continuous and binary data; a comparative analysis,” ieee access, vol. 8, pp. 150199–150212, 2020. [14] s. hota and s. pathak, “knn classifier based approach for multi-class sentiment analysis of twitter data,” int. j. eng. technol, vol. 7, no. 3, pp. 1372–1375, 2018. [15] d. a. ragab, m. sharkas, s. marshall, and j. ren, “breast cancer detection using deep convolutional neural networks and support vector machines,” peerj, vol. 7, p. e6201, jan. 2019. [16] o. sagi and l. rokach, “ensemble learning: a survey,” wires data min. knowl. discov., vol. 8, no. 4, jul. 2018. [17] y. xiao, j. wu, z. lin, and x. zhao, “a deep learning-based multi-model ensemble method for cancer prediction,” comput. methods programs biomed., vol. 153, pp. 1–9, jan. 2018. [18] b. choubin, e. moradi, m. golshan, j. adamowski, f. sajedi-hosseini, and a. mosavi, “an ensemble prediction of flood susceptibility using multivariate discriminant analysis, classification and regression trees, and support vector machines,” sci. total environ., vol. 651, pp. 2087–2096, feb. 2019. [19] t. shaikhina, d. lowe, s. daga, d. briggs, r. higgins, and n. khovanova, “decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation,” biomed. signal process. control, vol. 52, pp. 456–462, jul. 2019. [20] a. tripathy, a. agrawal, and s. k. rath, “classification of sentimental reviews using machine learning techniques,” procedia comput. sci., vol. 57, pp. 821–829, 2015. http://journal2.um.ac.id/index.php/keds http://dx.doi.org/10.6007/ijarafms/v11-i1/8987 http://dx.doi.org/10.6007/ijarafms/v11-i1/8987 https://doi.org/10.1111/tran.12225 https://doi.org/10.1111/tran.12225 https://doi.org/10.1016/j.procs.2018.05.119 https://doi.org/10.1016/j.procs.2018.05.119 https://doi.org/10.1016/j.ijinfomgt.2016.08.004 https://doi.org/10.1016/j.ijinfomgt.2016.08.004 https://doi.org/10.1016/j.tourman.2016.10.001 https://doi.org/10.1016/j.tourman.2016.10.001 https://doi.org/10.1016/j.chb.2020.106278 https://doi.org/10.1016/j.chb.2020.106278 https://doi.org/10.1109/mis.2008.95 https://doi.org/10.1109/mis.2008.95 https://www.irjet.net/archives/v5/i3/irjet-v5i3140.pdf https://www.irjet.net/archives/v5/i3/irjet-v5i3140.pdf http://dx.doi.org/10.18535/ijecs/v6i1.20 http://dx.doi.org/10.18535/ijecs/v6i1.20 https://doi.org/10.29207/resti.v4i4.2035 https://doi.org/10.29207/resti.v4i4.2035 https://www.researchgate.net/publication/329339664_an_evaluation_of_preprocessing_techniques_for_text_classification https://www.researchgate.net/publication/329339664_an_evaluation_of_preprocessing_techniques_for_text_classification http://dx.doi.org/10.1016/b978-0-12-809633-8.20332-4 http://dx.doi.org/10.1016/b978-0-12-809633-8.20332-4 https://doi.org/10.1109/access.2020.3015966 https://doi.org/10.1109/access.2020.3015966 https://doi.org/10.1109/access.2020.3015966 http://dx.doi.org/10.14419/ijet.v7i3.12656 http://dx.doi.org/10.14419/ijet.v7i3.12656 https://doi.org/10.7717/peerj.6201 https://doi.org/10.7717/peerj.6201 https://doi.org/10.1002/widm.1249 https://doi.org/10.1016/j.cmpb.2017.09.005 https://doi.org/10.1016/j.cmpb.2017.09.005 https://doi.org/10.1016/j.scitotenv.2018.10.064 https://doi.org/10.1016/j.scitotenv.2018.10.064 https://doi.org/10.1016/j.scitotenv.2018.10.064 https://doi.org/10.1016/j.bspc.2017.01.012 https://doi.org/10.1016/j.bspc.2017.01.012 https://doi.org/10.1016/j.bspc.2017.01.012 https://doi.org/10.1016/j.procs.2015.07.523 https://doi.org/10.1016/j.procs.2015.07.523 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 105–116 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p105-116 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology similarity identification of large-scale biomedical documents using cosine similarity and parallel computing merlinda wibowo a, 1, *, christoph quix b, 2 , nur syahela hussien c, 3 , herman yuliansyah d, 4 , faisal dharma adhinata a, 5 a faculty of informatics, institut teknologi telkom purwokerto, jl. di panjaitan no.128, karangreja, purwokerto, indonesia b information systems & data science, hochschule niederrhein, adlerstraße 35, 47798 krefeld, germany c universiti kuala lumpur malaysian institute of information technology (unikl miit) 1016, jln sultan ismail, bandar wawasan, 50250 kuala lumpur, malaysia d informatics department, universitas ahmad dahlan jl. kapas no.9, semaki,umbulharjo, yogyakarta, indonesia 1 merlinda@iitelkom-pwt.ac.id*; 2 christoph.quix@hs-niederrhein.de; 3 syahela@unikl.edu.my; 4 herman.yuliansyah@tif.uad.ac.id; 5 faisal@ittelkom-pwt.ac.id * corresponding author i. introduction the number of articles added to the literature databases is proliferating. large amounts of textual data could be collected as a part of the research, such as scientific literature, transcripts in the marketing and economic sectors, speeches in political discourse, such as presidential campaigns and inauguration speeches, and meeting transcripts [1]. pubmed dataset of medline also has grown enormously [2]. this large amount of textual information has created the problem of finding the relevance level between documents. besides, it has become challenging to manage and exploit them. this difficulty is closely related to the semantic aspect of these documents. a large amount of data brings about new opportunities for discovering new values, helps to gain an in-depth understanding of hidden values, and incurs new challenges such as how effectively organized and recognized data character [3][4]. there are two main parts for identifying pubmed documents to overcome the challenges. the two parts are abstract and medical subject heading (mesh) heading. mesh heading article info a b s t r a c t article history: submitted 7 december 2021 revised 25 december 2021 accepted 29 december 2021 published online 31 december 2021 document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. the similarity value is between 0 and 1, then the closest value to 1 is represented both documents is considered more relevant, vice versa. however, the large scale of textual information has created the problem of finding the relevance level between documents. therefore, the relevance between mesh heading text in the pubmed documents is higher than the relevance of the abstract text in the pubmed documents. furthermore, parallel computing is implemented to speed up the large-scale documents similarity identification process that automatically calculates in the pubmed application. the execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. the execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. this study has successfully identified the similarity between large-scale biomedical documents of the pubmed documents that implemented a cosine similarity algorithm. the result has shown that the cosine similarity of the mesh heading texts is higher than the abstract text in the form of a graph and table shown in the pubmed application. the cosine similarity is useful to measure the similarity between documents based on the tf*idf calculation result. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: biomedical documents cosine similarity keyword extraction large scale parallel computing similarity identification http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p105-116 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 106 m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 is the thesaurus for indexing, cataloging, and searching biomedical and health-related information. the relevance between mesh heading text in the pubmed documents is higher than the relevance of the abstract text in the pubmed documents. besides, the national library of medicine provides the mesh heading. text mining in big data analytics is emerging as a powerful tool for harnessing the power of unstructured textual data by analyzing it to extract new knowledge and to identify significant patterns and correlations hidden in the data [1][5]. furthermore, quickly detecting similar documents becomes a fundamental problem as times go on [6]. this difficulty is closely related to the semantic aspect of these documents. indeed, manual operation is possible and gives good results. however, a manual procedure is not possible with a large corpus. therefore, document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. moreover, parallel computing (for big data) reduces the processing time and quickly detects similar documents [7][8]. thus, the parallelization of big data is emerging as an essential framework for large-scale parallel data applications. some research determines the similarity between text used extracted keywords generated based on term frequency-inverse document frequency (tf*idf) [9][10][11][12]. this research focuses on detecting the similarity of the document. the method for calculating similarity is cosine similarity then the result demonstrates that cosine similarity can calculate the difference of text document. keyword extraction is a vital algorithm to extract appropriate keywords that can easily choose which document to read to learn the relationship between documents in the form of document retrieval, web page retrieval, document clustering, summarization, text mining, and others. it will automatically identify terms that best describe the keywords of a document [2][9][13]. then, to obtain a suitable text relevance algorithm to demonstrate relevance calculation between two documents, many studies have been implemented the cosine similarity [9][14][15]. the cosine similarity is useful to measure the similarity between documents based on the result of the keyword extraction. however, the large-scale documents are needed extra time execution. therefore, parallel computing is implemented to enhance the computing speeds by running several different tasks simultaneously on the same data [7][8]. parallel computing refers to the breaking process of a more significant problem into smaller, independent parts. often it can be executed concurrently by multiple processors communicating via shared memory then the results are combined upon completion as part of the overall algorithm. the main purpose of parallel computing is to increase the available computing power for faster application processing and troubleshooting. this research aims to develop a text mining application that adapts a text similarity algorithm for the biomedical domain to identify the relationship and relevance between large-scale documents. the implemented algorithms are run on a set of the published article from the biomedical documents to which keyword annotations by experts exist to compare with automatically extracted keywords by a parallel computing engine. ii. methods in this study, the similarity identification framework provided a guideline to conduct and organize the research properly. the framework illustrated in figure 1 showed the workflow divided into several research phases that describe the action plan step by step as a guide to complete this study. each phase will require the output to ensure that the research goals are achieved successfully. a. master data pubmed is an open-access search engine launched in january 1996 and made freely available online one year and a half years later. it has become one of the most commonly used search tools for retrieving scientific data. an almost continuous increase in the performed searches has been observed in biomedical and life sciences [2][16][17][18]. pubmed is a search tool provided by the united states national library of medicine (nlm). medline is a central bibliographic database maintained by the united states national library of medicine (nlm), is the most commonly used electronic database in applied, systematic reviews of biomedical research. it covers articles published from 1946 to the present, primarily in a scholarly journal. this database is freely accessible via the pubmed website for 24 million records. the sample of pubmed documents is depicted in figure 2. m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 107 figure 2(a) depicts the sample image of pubmed document, and figure 2(b) shows the dataset represented in the xml format. each xml file consists of different publication articles; more than three thousand articles are in every xml file. dataset will be stored in mongodb to support the parallel computing process for document similarity identification. mongodb is the most popular nosql database system [19]. mongodb is a cross-platform document-oriented database system. as a nosql database, mongodb avoids traditional table-based relational database structures that support json documents with dynamic schemes, making data integration in some application types easier and faster. data is stored in a document consisting of key and value with type and size variable (not set before). figure 3 illustrates the sample of the pubmed documents stored in mongodb. the data successfully inserted in mongodb will be used for the following process. this dataset will be in json format inside the mongodb collection with the same tag as data in xml format. this tag can be used for reading the data for the following process. mongodb does not use the query to read the data like a sql database. fig. 1. the similarity identification framework (a) (b) fig. 2. (a) sample image of pubmed document, and (b) the image of a data set represented in xml format 108 m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 b. documents similarity engine machine learning is a type of artificial intelligence that can learn from the data without explicit instructions and follow the instructions programmed [4]. machine learning will assist in finding a solution optimizing performance by using sample data or previous experience to gain new insights, reveal new patterns, and produce more accurate results. this research will implement machine learning in the documents similarity engine to identify the similarity between large-scale documents known as master data by automatically extracting keywords using node.js. javascript is a programming language that runs on the client or browser side only, then node.js exists to complete the javascript role. it can also apply as a programming language running on the server-side, like php, ruby, or perl. with parallel computing, the process will reduce the processing time and quickly detect the relationship and relevance between large-scale documents. 1) preprocessing at this stage, the results obtained from the master data will automatically go through to preprocess. the tag used in this study is mesh heading and abstract. both of the tags can represent the entire contents of the article published as testing data. this preprocessing will reduce the number of words that exist by removing stopwords and changing the words into the basic form (stemming) [9][20]. stopword is words that are not a feature or unique word of a document like conjunctions. taking into stopword in-text transformation will make the whole text mining system depend on the language factor. therefore, it is a weakness of the stopword removal process. however, the stopword removal process is still used because this process will significantly reduce the system workload. by removing the stopword of a text, the system will only consider the considered important words. stemming reduces derived words to their word stem, base, or basic form. one of the most widely used stemming algorithms is the porter stemmer [9][20]. the process of treating words with the same stem as synonyms, e.g., query expansion for search engines, is called conflation. the stem does need not be identical to the morphological root of a word since, for purposes of conflation, it is usually sufficient that related words map to the same stem even if this stem is not in itself a valid root. for example, the preprocessing depicts in figure 4. 2) representative algorithm: tf*idf this phase is representative of algorithm tf*idf. the tf*idf-statistic short for term frequency times inverse the document frequency can extract keywords from a document by considering a single document and all documents from the corpus [2][21]. the promising candidate for a keyword in a specific document if it shows up relatively often within the document and rarely in the rest of the fig. 3. sample of pubmed documents stored in mongodb m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 109 corpus is a word in the term of tf*idf. the term frequency is given by the ratio of the number of term occurrences in the document and the number of occurrences of the most frequent word in one document. the formula of tf*idf is shown in equation (1). 𝑇𝐹 ∗ 𝐼𝐷𝐹 = 𝑓𝑟𝑒𝑞(𝑃,𝐷) 𝑠𝑖𝑧𝑒(𝐷) . 𝑙𝑜𝑔2 ( 𝑁 𝑑𝑓(𝑃) ) (1) where freq(p,d) is the number of times p occurs in document d, size(d) is the number of words in document d, df(p) is the number of documents containing p in the global corpus, and n is the size of the global corpus. 3) cosine similarity cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them [9][14][15]. cosine similarity measures the similarity between two vectors in a dimensional space obtained from the cosine value of the angle from the product of the two vectors being compared because the cosine of 0° is 1 and less than 1 for other angles values. the similarity value of the two vectors is similar when the value of cosine similarity is 1. cosine similarity is used in positive space, where the result is limited between values 0 and 1. if the value is 0, then the document is similar. if the result is 1, then the value is said to be dissimilar [9][14][15]. this limit applies to some dimensions. therefore, cosine similarity is most often used in high-dimensional positive spaces. for example, in information retrieval, each term is assumed to be a different dimension. furthermore, the document is marked with a vector where each dimension corresponds and how many terms appear. equation (2) depicts the formula of cosine similarity. similarity = cos(𝜃) = 𝐴 .𝐵 ||𝐴||||𝐵|| = ∑ 𝐴𝑖 𝐵𝑖 𝑛 𝑖=1 √∑ 𝐴𝑖 2𝑛 𝑖=1 √∑ 𝐵𝑖 2𝑛 𝑖=1 (2) where ai and bi are components of vectors a and b. a is the weight of each feature in vector a. b is the weight of each feature in b. if it is associated with information retrieval, then a is the weight of each term in document a, and b is the weight of each term in document b. in this study, cosine similarity is used because large-scale pubmed documents are high-dimensional data. in large-scale pubmed documents that contain many published articles, it also can be said that each document consists of many different tags. measurement of similarity can be done by comparing document 1 with document 2 then the system will calculate the similarity value. ai.bi is the value obtained from term a and term b, then the two values are added together. the value of ai 2 is all values of termdocument a, and all values are raised to the power of two, and term bi 2 , all values obtained are raised to the power of two, then all values obtained are added up. c. similarity identification result in this stage, the identification results of document similarities will be represented in a graph, statistical table, and web application. the visualization data using a graph and statistical table are intended to make it easier to present and understand the result [4][22]. meanwhile, web application development can enhance the end-user experience and real-time data collection and provide custom content [22]. this study will show the graph and statistical table in the web application after the document similarity engine process has finished. for example, the pubmed application interface web application depicts in figure 5. the documents will be uploaded to the application. the application will automatically calculate the similarity between biomedical documents with parallel computing, fig. 4. preprocessing 110 m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 reducing the processing time and quickly detecting the relationship and relevance between large-scale documents. therefore, the results will be in the form of a graph and table that facilitate reading the calculation results. iii. results and discussions the pubmed application developed as an identification documents similarity engine as an intelligent application that automatically calculated the similarity between biomedical documents then visualized the identification result in the form of a graph and table. the calculation process is used parallel computing that is reduced the processing time and quickly detects the relationship and relevance between large-scale documents. the first process is storing the master data in mongodb. then the punctuation will be removed, converted to lower case, implemented stop word removal, and extracted the basic word using the porter stemming algorithm. two tags were used in this study, abstract and mesh heading. this tag can be used to read the data for the next process. figure 6 depicts the sample abstract dataset from pubmed publications captured from mongodb. in addition, the captured dataset is then transformed into the basic word. the basic word is the biomedical word, including the chemical formulation, medicine name, and others. therefore, this need is needed to be considered. fig. 5. pubmed application m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 111 fig. 6. sample captured abstract dataset the listing program to get the extracted keywords can be seen in preprocessing program. the input in preprocessing program is all abstract data, and the output is the string of each word from the abstract. the first step of preprocessing is removing all conjunction and punctuation in the abstract then transforming the letter into lowercase. the next step is stemming the words into their roots. preprocessing program input: abs_all output: all_string initialization var abs_all, all_string, removed_conjuction, text_array, reg, rm_punctutation, reg removed_conjuction  abstrak_fix.replace(regex_rm_conjuction," ") text_array  removed_conjuction.replace(/(\s)?\d\s+/g, ' ').replace(/\n+/g,' ').split(" ").filter((d) => { return d != '' && conjuction_list.indexof(d.tolowercase()) < 1 }).map((d) => { reg  new regexp(/\d/,'gi') rm_punctuaction  d.replace(regex_rm_punctuaction,'') return reg.test(d) ? d : stemmer.stem(rm_punctuaction) }) the sample of extracted keywords result is depicted in figure 7. 112 m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 fig. 7. sample of extracted keyword results afterward, the extracted keyword weighting is carried out to calculate the frequency of occurrence of each word of the testing document in each document in the dataset. this phase is representative of algorithm tf*idf. the tf*idf can extract keywords from a document by considering a single document and all documents from the corpus. finally, the tf*idf calculation result is used to calculate the similarity of the documents testing with the pubmed documents using the cosine similarity algorithm. the listing program to get the term frequency value can be seen in the tfidf program. tfidf program input: all_string output: tf initialization var all_string, tfidf, tf tfidf  natural.tfidf tfidf  new tfidf() abs_all.foreach((dataa) => { tfidf.adddocument(dataa) }) all_string.foreach((as) => { tfidf.tfidfs(as, function(i, measure) { }) the sample of tf*idf results stored in mongodb is captured in figure 8. cosine similarity is particularly used in positive space, where the outcome is neatly bounded in 0 and 1. this similarity calculation will result in a value between 0 and 1. the closer value to 1, then both documents are more related, vice versa. m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 113 fig. 8. sample of captured tf*idf results from the similarity process that has been done, the cosine similarity produces similarity values between one document compared to other documents. the document comparison focused on the abstract and mesh heading tag of the pubmed publications document as the testing data. the listing code to measure the cosine similarity between documents can be seen in the cosine similarity program. cosine_similarity program input: tf output: cos_sim initialization var tf, cos_sim_all, l1, l2, tf1, tf2, sum, a, b, a, b, cos_sim, len_avg, len_avg2, tf_sum l1  tf[item.first].length l2  tf[item.second].length tf1  tf[item.first] tf2  tf[item.second] if ( l1 > l2 ) { len_avg  l1-l2 for (var j=0; j { a  tf2.filter((d) => { return item.term == d.term && item.term != '-' && d.term != '-'}) if (a.length > 0) { b  item.tfdif*a[0].tfdif tf_sum.push(b) }}) sum  tf_sum.length > 0 ? tf_sum.reduce((accumulator, currentvalue) => accumulator + currentvalue) : 0 a  tf1.map((data, index) => {return math.pow(data.tfdif,2)}).reduce((accumulator, currentvalue) => accumulator + currentvalue) b  tf2.map((data, index) => {return math.pow(data.tfdif,2)}).reduce((accumulator, currentvalue) => accumulator + currentvalue) cos_sim sum / (math.sqrt(a)*math.sqrt(b)) 114 m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 the cosine similarity results shown in figure 9 illustrated the sample result of cosine similarity between abstract text with different abstracts in other publications and mesh heading text with the different mesh heading in other publications. for example, the cosine similarity between document 2 and document 1 between the mesh heading of published articles in the pubmed documents is 0.0045 and indicates that the cosine similarity is 0.45%. figure 10 illustrates the result of cosine similarity measurement between documents. in this case, it is using abstract and mesh heading text in each pubmed document. the graph of the cosine similarity result from this pubmed document is shown the mesh heading texts cosine similarity is higher than the abstract text. the results showed that the relevance between mesh heading text in the pubmed documents is higher than the relevance of the abstract text in the pubmed documents. hence, the relationship and correlation between published articles in pubmed documents can be known from the mesh heading text. the number of words and terms in the abstract can affect text similarity results. besides, this mesh heading tag can be used for subsequent data processing, such as classifying or clustering the pubmed documents. fig. 9. cosine similarity results between biomedical documents fig. 10. visualization of comparison of cosine similarity result between documents m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 115 both visualizations of the calculation similarity result depicted in figure 9 and figure 10, known as similarity identification results, make it easier to present and understand the comparison result. this identification similarity result is shown in the pubmed application. in addition, this result is produced by the parallel computing engine in the pubmed application that reduced the processing time and quickly detected the relationship and relevance between large-scale biomedical documents. meanwhile, figure 11 is shown the execution time of the similarity engine application. the execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. the execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. documents similarity identification application has successfully identified the similarity between large-scale documents of the pubmed documents known as biomedical documents. the implemented cosine similarity and parallel computing as the document similarity engine is executed the documents faster. the execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. based on the results, the mesh heading runtime is higher than the abstract because the abstract contains more words than the mesh heading. in addition, using the abstract and mesh heading tag can represent the similarity between documents. the result is shown that the cosine similarity of the mesh heading texts is higher than the mesh abstract text. iv. conclusion the documents similarity identification application has successfully identified the similarity between large-scale documents of the pubmed documents known as biomedical documents. this study implemented cosine similarity and parallel computing as the document similarity engine that executed the documents faster. the execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. the mesh heading runtime is higher than the abstract because the abstract contains more words than the mesh heading. therefore, using the abstract and mesh heading tag can represent the similarity between documents—the result is shown that the cosine similarity of the mesh heading texts is higher than the mesh abstract text. besides, the results showed that the relevance between mesh heading text in the pubmed documents is higher than the relevance of the abstract text in the pubmed documents. on the other hand, the number of words and terms in the abstract can affect the percentage of text similarity results. in the future, this mesh heading and abstract tag can be used for the next data processing, such as classification or clustering datasets. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. fig. 11. execution time of document similarity application http://journal2.um.ac.id/index.php/keds 116 m. wibowo et al. / knowledge engineering and data science 2021, 4 (2): 105–116 references [1] h. hassani, c. beneki, s. unger, and m. t. mazinani, “text mining in big data analytics,” big data cogn. comput., vol. 4, pp. 1–34, 2020. [2] r. islamaj et al., “pubmed text similarity model and its application to curation efforts in the conserved domain database,” database, vol. 1, pp. 1–13, 2019. [3] s. f. wamba, a. gunasekaran, s. akter, s. j. ren, r. dubey, and s. j. childe, “big data analytics and firm performance: effects of dynamic capabilities,” j. bus. res., vol. 70, pp. 356–365, 2016. [4] m. wibowo, f. noviyanto, s. sulaiman, and s. m. shamsuddin, “machine learning technique for enhancing classification performance in data summarization using rough set and genetic algorithm,” int. j. sci. technol. res., vol. 8, no. 10, pp. 1108–1119, 2019. [5] r. m. packiam and v. s. j. prakash, “an empirical study on text analytics in big data,” 2016. [6] m. erritali, a. beni-hssane, m. birjali, and y. madani, “an approach of semantic similarity measure between documents based on big data,” int. j. electr. comput. eng., vol. 6, no. october 2017, pp. 2454–2463, 2016. [7] l. a. rahim, k. mohan, k. id, and s. bahattacharjee, “framework for parallelisation on big data,” plosone 14(5), pp. 1–19, 2019. [8] b. parhami, “parallel processing with big data,” pp. 1–7, 2018. [9] r. darmawan, r. s. wahono, “hybrid keyword extraction algorithm and cosine similarity for improving sentences cohesion in text summarization,” j. intell. syst., vol. 1, no. 2, pp. 109–114, 2015. [10] s. w. iriananda, m. a. muslim, and h. s. dachlan, “identifikasi kemiripan teks menggunakan class indexing based dan cosine similarity untuk klasifikasi dokumen pengaduan,” matics, vol. 10, no. 2, p. 30, 2019. [11] d. a. r. ariantini, a. s. m. lumenta, and a. jacobus, “pengukuran kemiripan dokumen teks bahasa indonesia menggunakan metode cosine similarity,” j. tek. inform., vol. 9, no. 1, pp. 1–8, 2016. [12] m. z. naf’an, a. burhanuddin, and a. riyani, “penerapan cosine similarity dan pembobotan tf-idf untuk mendeteksi kemiripan dokumen,” j. linguist. komputasional, vol. 2, no. 1, pp. 23–27, 2019. [13] j. wang and y. dong, “measurement of text similarity: a survey,” inf., vol. 11, no. 9, pp. 1–17, 2020. [14] d. kurniadi, s. f. c. haviana, and a. novianto, “implementasi algoritma cosine similarity pada sistem arsip dokumen di universitas islam sultan agung,” j. transform., vol. 17, no. 2, p. 124, 2020. [15] d. gunawan, c. a. sembiring, and m. a. budiman, “the implementation of cosine similarity to calculate text relevance between two documents,” j. phys. conf. ser., vol. 978, no. 1, 2018. [16] j. bian, m. amin, s. jonnalagadda, g. luo, and g. del, “automatic identification of high impact articles in pubmed to support clinical decision making,” j. biomed. inform., vol. 73, pp. 95–103, 2017. [17] c. w. halladay, t. a. trikalinos, i. t. schmid, c. h. schmid, and i. j. dahabreh, “using data sources beyond pubmed has a modest impact on the results of systematic reviews of therapeutic interventions,” in journal of clinical epidemiology, 2015, vol. 68, no. 9, pp. 1076–1084. [18] k. z. vardakas, g. tsopanakis, a. poulopoulou, and m. e. falagas, “an analysis of factors contributing to pubmed’s growth,” j. informetr., vol. 9, no. 3, pp. 592–617, 2015. [19] mongodb, “mongodb,” 2017. [20] p. dwi nurfadila, a. p. wibawa, i. a. e. zaeni, and a. nafalski, “journal classification using cosine similarity method on title and abstract with frequency-based stopword removal ,” int. j. artif. intell. res., vol. 3, no. 2, 2019. [21] n. ghasemi and s. momtazi, “neural text similarity of user reviews for improving collaborative filtering recommender systems,” electron. commer. res. appl., vol. 45, no. october 2019, p. 101019, 2021. [22] m. wibowo, s. sulaiman, s. mariyam, and h. hashim, “mobile analytics database summarization using rough set,” int. j. innov. comput., vol. 7, no. 2, pp. 6–12, 2017. https://doi.org/10.3390/bdcc4010001 https://doi.org/10.3390/bdcc4010001 https://doi.org/10.1093/database/baz064 https://doi.org/10.1093/database/baz064 https://doi.org/10.1016/j.jbusres.2016.08.009 https://doi.org/10.1016/j.jbusres.2016.08.009 https://www.ijstr.org/paper-references.php?ref=ijstr-1019-23769 https://www.ijstr.org/paper-references.php?ref=ijstr-1019-23769 https://www.ijstr.org/paper-references.php?ref=ijstr-1019-23769 https://doi.org/10.1109/iccic.2015.7435747 http://doi.org/10.11591/ijece.v6i5.pp2454-2461 http://doi.org/10.11591/ijece.v6i5.pp2454-2461 https://doi.org/10.1371/journal.pone.0214044 https://doi.org/10.1371/journal.pone.0214044 https://doi.org/10.1007/978-3-319-63962-8_165-1 http://journal.ilmukomputer.org/index.php?journal=jis&page=article&op=view&path%5b%5d=44 http://journal.ilmukomputer.org/index.php?journal=jis&page=article&op=view&path%5b%5d=44 https://doi.org/10.18860/mat.v10i2.5327 https://doi.org/10.18860/mat.v10i2.5327 https://doi.org/10.35793/jti.9.1.2016.13752 https://doi.org/10.35793/jti.9.1.2016.13752 https://doi.org/10.26418/jlk.v2i1.17 https://doi.org/10.26418/jlk.v2i1.17 https://doi.org/10.3390/info11090421 http://dx.doi.org/10.26623/transformatika.v17i2.1613 http://dx.doi.org/10.26623/transformatika.v17i2.1613 https://doi.org/10.1088/1742-6596/978/1/012120 https://doi.org/10.1088/1742-6596/978/1/012120 https://doi.org/10.1016/j.jbi.2017.07.015 https://doi.org/10.1016/j.jbi.2017.07.015 https://doi.org/10.1016/j.jclinepi.2014.12.017 https://doi.org/10.1016/j.jclinepi.2014.12.017 https://doi.org/10.1016/j.jclinepi.2014.12.017 https://doi.org/10.1016/j.joi.2015.06.001 https://doi.org/10.1016/j.joi.2015.06.001 https://www.mongodb.com/ https://doi.org/10.29099/ijair.v3i2.99 https://doi.org/10.29099/ijair.v3i2.99 https://doi.org/10.29099/ijair.v3i2.99 https://doi.org/10.1016/j.elerap.2020.101019 https://doi.org/10.1016/j.elerap.2020.101019 https://ijic.utm.my/index.php/ijic/article/view/144 https://ijic.utm.my/index.php/ijic/article/view/144 i. introduction ii. methods a. master data b. documents similarity engine 1) preprocessing 2) representative algorithm: tf*idf 3) cosine similarity c. similarity identification result iii. results and discussions iv. conclusion declarations author contribution funding statement conflict of interest additional information references keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 78–86 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p78-86 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) human facial expressions identification using convolutional neural network with vgg16 architecture luther alexander latumakulita a,1,*, sandy laurentius lumintang a,2, deiby tineke salaki a,3, steven r. sentinuwo a,4, alwin melkie sambul a,5, noorul islam b,6 a sam ratulangiuniversity, jalan kampus, manado 95115, indonesia b kanpur institute of technology, a-1, upsidc, rooma industrial area, kanpur, 201008, india 1 latumakulitala@unsrat.ac.id; 2 sandy.laurentius@gmail.com; 3 deibyts.mat@unsrat.ac.id, 4 steven@unsrat.ac.id, 5 asambul@unsrat.ac.id, 6 noorul.islam3101@gmail.com * latumakulitala@unsrat.ac.id i. introduction humans can produce different facial expressions [1], but some distinctive facial configurations are associated with specific emotions [2], regardless of gender [3], age [4], cultural background [5], and socialization history [6]. facial expressions accounted for 55% of message delivery, while language and voice accounted for 7% and 38%, respectively [7]. universally, six basic expressions have been put forward in ekman and friesen's research, namely anger, disgust, happiness, sadness, and surprise expressions. along with the developments of technology, the interaction between humans and technology plays a vital role in daily activities. artificial intelligence can facilitate work and help humans make decisions based on the results of their analysis. one example of the application of this technology is to identify human facial expressions. some services currently use scoring systems by manually selecting on a computer display, but these systems are considered inappropriate for showing expressions of customer satisfaction [8]. in addition, facial expression identification systems can be developed and applied in various fields, such as psychological patient emotion detection, lie detection, security system with face recognition, entertainment recommendations according to emotions (movies, music, tourist attractions, shopping products), robot development, monitoring system of an employee's facial expressions when interacting with customers and so on. the development of facial expression identification technology can use the deep learning method that makes a computer learn from the depths of an image and identify it. one type of deep learning method currently the most significant in image recognition is the convolutional neural network (cnn) with a 16-layer visual geometry group (vgg) architecture. sang et al. [9] showed an average test accuracy of 71.9% using cnn's deep bkvgg12. gultom et al. [10] showed an average accuracy article info a b s t r a c t article history: received 30 may 2022 revised 30 june 2022 accepted 14 october 2022 published online 7 november 2022 the human facial expression identification system is essential in developing human interaction and technology. the development of artificial intelligence for monitoring human emotions can be helpful in the workplace. commonly, there are six basic human expressions, namely anger, disgust, fear, happiness, sadness, and surprise, that the system can identify. this study aims to create a facial expression identification system based on basic human expressions using the convolutional neural network (cnn) with a 16-layer vgg architecture. two thousand one hundred thirty-seven facial expression images were selected from the fer2013, jaffe, and mug datasets. by implementing image augmentation and setting up the network parameters to epoch of 100, the learning rate of 0,0001, and applying in the 5fold cross validation, this system shows performance with an average accuracy of 84%. results show that the model is suitable for identifying the basic facial expressions of humans. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: cnn deep learning facial expressions identification vgg16 http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1):78–86 79 of 89±7% using vgg16 transfer learning for batik classification. porcu et al. [11] found that applying image augmentation can significantly improve test acumen compared to previous studies. caroppo et al. [12] found that using cnn's deep learning vgg16 architecture for facial expression identification showed the highest accuracy compared to other architectures. cnn is one part of deep feedforward artificial neural networks (ann) widely applied to computer vision, also known as convnet, and has an architecture derived from nodes or neurons connected at a layer [13]. in general, the types of layers on cnn are divided into the feature extraction layer and the classification layer. in this study, the cnn-based deep learning method will be used to identify six basic expressions of the human face with the vgg16 architecture to get good enough accuracy. image augmentation will be applied to the image data, and then the data will be trained using the k-fold cross validation method, which produces a confusion matrix due to its evaluation. in the future, the intelligent model proposed in this research can be implemented in a control system that needs human expression, like an automated gate or system surveillance. ii. methods in this study, the identification of basic expressions of the human face consisted of several stages, as seen in figure 1. fig. 1. research flowchart the selecting dataset stage is the process of collecting the dataset. the secondary data collected is determined one by one by looking at the accuracy of expressions based on physical descriptions of basic human facial expressions and image clarity (no watermarks or other objects hinder facial clarity). human facial expression image data has been selected from open-source datasets (fer2013, jaffe, and mug). facial expressions to be trained and identified are the six basic human expressions based on physical descriptions or criteria as follows [14]. anger; brows wrinkled, eyes wide, lips tightened and pressed together. disgust; eyebrows fall, eyes narrow, nose wrinkles, lips split, jaw drop. fear; eyebrows raised and pulled together, upper eyelids raised, lower eyelids tense, lips parted and stretched. happiness; eyes narrowed and wrinkles around him, cheeks raised, lips pulled back, showing teeth in a smile. sadness; eyebrows are knitted, eyes are slightly closed, the corners of the 80 l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1): 78–86 lips are depressed, and the lower lip is raised. surprise; eyebrows raised, upper eyelids raised, lips parted, jaws dropped. the result for each label is shown in table 1. in table 1, the human facial expression image data that has been selected amounts to 2137 images, where anger expressions amounted to 325 data, expressions of disgust amounted to 216 data, expressions of fear amounted to 197 data, expressions of happiness amounted to 758 data, sadness expressions amounted to 260 data, and expressions of surprise amounted to 381 data. this data input stage retrieves data from directories that are then labeled accordingly for each image, and the data is inputted using sizes 224 × 224 and channel 3 (rgb). before entering the training stage, the data will be divided first using k-fold cross validation with k = 5, resulting in training and testing data with a ratio of 80:20, which means the model trains the training dataset 5 times with different training data for each fold. the training data entered will be normalized, and applied image augmentation for each data at the preprocessing stage. data normalization is a linear scale technique for changing the pixel scale of an image from 0 to 1. the entered image data will be divided by 255 (rgb range 0-255). paper [15] suggests that image augmentation increases the size and diversity of existing training pools without manually collecting new data. this process generates additional training data from existing examples by adding them using random transformations that produce impressions that appear trustworthy. image augmentation is helpful so that computers can learn more about the data that has been trained from various points of view and multiply the data to be prepared. image augmentation is applied by performing a series of random preprocessing transformations to existing data, such as flipping horizontally and vertically, tilting, cropping, cropping, zooming in and out, and rotating. the image augmentation applied is shown in table 2. the training dataset preprocessed before will be entered into the vgg16 architecture model and evaluated using the testing dataset divided before for each fold. vgg 16 is one of the cnn architectures, which was put forward by simonyan and zisserman when competing in the imagenet large scale visual recognition challenge and making the top-5 with an accuracy of 92.7% [16]. vgg16 is an improvement over alexnet by replacing large kernel-sized filters (11 and 5 in the first and second convolution layers, respectively) with multiple 3x3 kernel-sized filters one after another [17]. the vgg16 architecture can be seen in figure 2. table 1. image data of facial expressions no facial expressions data 1 anger 325 2 disgust 216 3 fear 197 4 happiness 758 5 sadness 260 6 surprise 381 table 2. the image augmentation applied parameter value detail rotation range 40 the image is rotated at an angle of 0.2 degrees width shift range 0.25 the image’s width is shifted with an angle of 0.25 degrees height shift range 0.25 the height of the image is shifted at an angle of 0.25 degrees shear range 0.20 the image is shifted clockwise by 0.2 degrees zoom range 0.2 the image is enlarged by 1 + 0.2 from the area of the image horizontal flip true the image is rotated horizontally fill mode nearest the image pixels lost during changes will be filled with the nearest pixel value to maintain the integrity of the image quality l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1):78–86 81 fig. 2. vgg16 architecture figure 2 shows 16 layers with 13 convolution layers and 3 fully connected layers. in 13 convolution layers are given a filter of 64 for the first two layers, a filter of 128 for the following two layers, a filter of 256 for the subsequent three layers, a filter of 512 for the following three layers, and a filter of 512 for the subsequent three layers with the max-pooling layer in each filter change, which amounts to 5 layers. the image is inputted using a size of 224 x 224 x 3, which indicates that the model will read the image with a size of 224 x 224 with channel 3, namely rgb (red, green, blue). maxpooling layer uses a size of 2 x 2 and stride two so that it changes the image size, which was originally 224 x 224, and produces a feature map of 112 x 112 (filter 128), then 56 x 56 (filter 256), then to 28 x 28 (filter 512), then to 14 x 14 (filter 512) and finally to 7 x 7. the pooling layer is a way to reduce matrix size to speed up computing and easily control overfitting. one way to use this pooling layer is to apply maxpooling. maxpooling is a function that selects the maximum value of a window region and is then represented as a new pixel [18]. use two dense layers of 4096 (with two dropouts (0.5) layers) and one softmax layer for three fully connected layers. dropout temporarily eliminates a neuron in the network's hidden layer or visible layer [19]. the model obtained will then be assessed to determine whether the accuracy is enough. if the accuracy is still lacking, changes will be made to the parameters of the epoch network, learning rate, and batch size until the accuracy obtained is satisfactory. at the identification stage, the primary data obtained will be tested using the model with the best accuracy. the entered data will be normalized and resized to 224x224 with channel 3 (rgb). the last stage is the prediction of the result of the identification stage and is evaluated using the confusion matrix. a confusion matrix is a table frequently used to evaluate the effectiveness of a classification model [20]. in order to assess the accuracy of a model's predictions, the confusion matrix compares the predicted labels against the actual labels. we may construct various model performance measures with these four results, including accuracy, precision, recall, and f1-score [21]. these measures give a more detailed picture of the model's performance than the model's overall accuracy rate alone. overall, a confusion matrix is a valuable tool for evaluating the performance of a classification model and finding potential areas for improvement. iii. results and discussion testing is done by applying 5-fold cross validation, where the data is divided into five different sets, which are then carried out the testing process five times. the testing process is done by setting the epoch value = 100, learning rate = 0.0001, and batch size = 32. each fold of the model obtained performed performance testing by applying a confusion matrix to data testing. here are the results of testing accuracy in each fold. 82 l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1): 78–86 fig. 3. plot training fold 1 figure 3 shows the results of plot accuracy and loss on fold 1 with a total of 1794 data trains, where the accuracy is increasing close to the number 1.0, with the highest value reaching 92%. the loss is getting closer to 0, with the lowest value of 0.224. based on table 3, the red labeled numbers are the data predicted to be correct based on the test results of the fold one training model. the confusion matrix from a total of 343 data testing has an accuracy of 87.7%, a precision of 88%, and an f1 score of 87.6%. fig 4. plot training fold 2 figure 4 shows the results of plot accuracy and loss on fold 2 with a total of 1795 data trains, where the accuracy is increasing close to the number 1.0, with the highest value reaching 94%. the loss is getting closer to 0, with the lowest value of 0.160. table 3. confusion matrix fold 1 fold 1 anger disgust far happiness sadness surprise anger 49 0 0 0 4 0 disgust 3 31 0 3 0 0 fear 2 0 17 2 5 1 happiness 0 1 1 111 0 0 sadness 9 0 0 4 35 0 surprise 0 0 5 2 0 58 accuracy 87.7% precision 88% f1 score 87.6% table 4. confusion matrix fold 2 fold 2 anger disgust far happiness sadness surprise anger 42 0 2 0 2 0 disgust 0 33 0 1 2 0 fear 0 3 25 1 6 2 happiness 0 0 0 133 0 0 sadness 2 0 0 0 33 0 surprise 1 0 5 0 0 49 accuracy 92.1% precision 92.2% f1 score 92.0% l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1):78–86 83 based on table 4, the red labeled numbers are the data predicted to be correct based on the test results of the fold-two training model. the confusion matrix from a total of 342 data testing has an accuracy of 92.1%, a precision of 92.2%, and an f1 score of 92.0%. fig 5. plot training fold 3 figure 5 shows the results of plot accuracy and loss on fold 3 with a total of 1795 data trains, where the accuracy is increasing close to the number 1.0, with the highest value reaching 61%. the loss is getting closer to the number 0, with the lowest value of 0.892. based on table 5, the red labeled numbers are the data predicted to be correct based on the test results of the fold three training model. the confusion matrix from a total of 342 data testing has an accuracy of 59.6%, a precision of 69.3%, and an f1 score of 56.9%. fig. 6. plot training fold 4 figure 6 shows the results of plot accuracy and loss on fold 4 with a total of 1795 data trains, where the accuracy is increasing close to the number 1.0, with the highest value reaching 92%. the loss is getting closer to 0, with the lowest value of 0.202. table 5. confusion matrix fold 3 fold 3 anger disgust far happiness sadness surprise anger 25 1 0 26 2 0 disgust 6 7 1 19 0 0 fear 3 1 14 18 0 1 happiness 4 0 0 116 0 0 sadness 5 0 2 22 13 0 surprise 5 0 0 22 0 29 accuracy 59.6% precision 69.3% f1 score 56.9% 84 l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1): 78–86 based on table 6, the red labeled numbers are the data predicted to be correct based on the test results of the fold-four training model. the confusion matrix from a total of 342 data testing has an accuracy of 90.6%, a precision of 91.6%, and an f1 score of 90.9%. fig 7. plot training fold 5 figure 7 shows the results of plot accuracy and loss on fold 5 with a total of 1795 data trains, where the accuracy is increasing close to the number 1.0, with the highest value reaching 93%. as for the loss is getting closer to the number 0, with the lowest value being 0.187. based on table 7, the red labeled numbers are the data predicted to be correct based on the test results of the fold-five training model. the confusion matrix from a total of 342 data testing has an accuracy of 92.1%, a precision of 92.4%, and an f1 score of 92.1%. the average accuracy of each fold can be seen in table 8. from table 8, fold accuracy is obtained from 1 to 5 with an average of 84.4%, where the highest accuracy is received in the second and fifth folds at 92.1%, and the lowest accuracy is 59.6% in the third fold. the best model that will be used for the identification stage is the fold 2 model. figure 8 shows the plotting graphic of accuracies between training and testing processes for all folds. table 6. confusion matrix fold 4 fold 4 anger disgust far happiness sadness surprise anger 50 0 2 0 8 0 disgust 1 36 1 0 2 0 fear 0 1 20 2 0 0 happiness 1 2 2 118 1 0 sadness 3 0 0 1 30 0 surprise 0 0 4 1 0 56 accuracy 90.6% precision 91.6% f1 score 90.9% table 7. confusion matrix fold 5 fold 5 anger disgust far happiness sadness surprise anger 42 1 0 0 4 0 disgust 1 24 0 0 2 0 fear 0 0 25 1 5 3 happiness 0 2 0 113 1 1 sadness 2 0 0 1 46 0 surprise 0 0 3 0 0 65 accuracy 92.1% precision 92.4% f1 score 92.1% table 8. average training accuracy fold-k 1 2 3 4 5 accuracy 87.7% 92.1% 59.6% 90.6% 92.1% average 84.4% l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1):78–86 85 fig 8. accuracies comparison between training and testing processes from table 9, identification of primary data was carried out and found from 36 data, 31 of which were predicted to be correct and five were predicted to be correct. the accuracy of the confusion matrix is 86.1%. iv. conclusion cnn with the vgg16 architecture model managed to identify the primary expressions of the human face with an epoch count of 100, a learning rate of 0.0001, and a batch size of 32, resulting in an average accuracy of the 1st to fifth fold is 84.4% with an average test data accuracy of 86.1%, so it can be said that the model built is stable and good enough to use. in the future, we will develop a system to control automatic gates at sam ratulangi university. the gate will automatically open after receiving the best smile from people who want to enter the university area. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. table 9. identification results indicator facial expressions anger disgust far happiness sadness surprise true 6 5 4 6 6 4 false 0 1 2 0 0 2 accuracy 86.1% http://journal2.um.ac.id/index.php/keds 86 l.a. latumakulita et al. / knowledge engineering and data science 2022, 5 (1): 78–86 references [1] d. l. z. astuti, s. samsuryadi, and d. p. rini, “real-time classification of facial expressions using a principal component analysis and convolutional neural network,” sinergi, vol. 23, no. 3, p. 239, oct. 2019. [2] l. f. barrett, r. adolphs, s. marsella, a. m. martinez, and s. d. pollak, “emotional expressions reconsidered: challenges to inferring emotion from human facial movements,” psychol. sci. public interes., vol. 20, no. 1, pp. 1– 68, jul. 2019. [3] y. park and m. garcia, “pedestrian safety perception and urban street settings,” int. j. sustain. transp., vol. 14, no. 11, pp. 860–871, sep. 2020. [4] [4] s. simpson, l. richardson, g. pietrabissa, g. castelnuovo, and c. reid, “videotherapy and therapeutic alliance in the age of covid‐19,” clin. psychol. psychother., vol. 28, no. 2, pp. 409–421, mar. 2021. [5] d. t. cordaro, r. sun, d. keltner, s. kamble, n. huddar, and g. mcneil, “universals and cultural variations in 22 emotional expressions across five cultures.,” emotion, vol. 18, no. 1, pp. 75–93, feb. 2018. [6] a. j. umaña‐taylor and n. e. hill, “ethnic–racial socialization in the family: a decade’s advance on precursors and outcomes,” j. marriage fam., vol. 82, no. 1, pp. 244–271, feb. 2020. [7] s. m. saleem abdullah and a. m. abdulazeez, “facial expression recognition based on deep learning convolution neural network: a review,” j. soft comput. data min., vol. 02, no. 01, apr. 2021. [8] l. zahara, p. musa, e. prasetyo wibowo, i. karim, and s. bahri musa, “the facial emotion recognition (fer-2013) dataset for prediction system of micro-expressions face using the convolutional neural network (cnn) algorithm based raspberry pi,” in 2020 fifth international conference on informatics and computing (icic), nov. 2020, pp. 1– 9. [9] d. v. sang, n. van dat, and d. p. thuan, “facial expression recognition using deep convolutional neural networks,” in 2017 9th international conference on knowledge and systems engineering (kse), oct. 2017, pp. 130–135. [10] y. gultom, a. m. arymurthy, and r. j. masikome, “batik classification using deep convolutional network transfer learning,” j. ilmu komput. dan inf., vol. 11, no. 2, p. 59, jun. 2018. [11] s. porcu, a. floris, and l. atzori, “evaluation of data augmentation techniques for facial expression recognition systems,” electronics, vol. 9, no. 11, p. 1892, nov. 2020. [12] a. caroppo, a. leone, and p. siciliano, “comparison between deep learning models and traditional machine learning approaches for facial expression recognition in ageing adults,” j. comput. sci. technol., vol. 35, no. 5, pp. 1127–1146, oct. 2020. [13] c. modarres, n. astorga, e. l. droguett, and v. meruane, “convolutional neural networks for automated damage recognition and damage type identification,” struct. control heal. monit., vol. 25, no. 10, p. e2230, oct. 2018. [14] d. keltner, d. sauter, j. tracy, and a. cowen, “emotional expression: advances in basic emotion theory,” j. nonverbal behav., vol. 43, no. 2, pp. 133–160, jun. 2019. [15] g. ramirez-gargallo, m. garcia-gasulla, and f. mantovani, “tensorflow on state-of-the-art hpc clusters: a machine learning use case,” in 2019 19th ieee/acm international symposium on cluster, cloud and grid computing (ccgrid), may 2019, pp. 526–533. [16] s. a. asmai*, m. n. d. mohamad zukhairin, a. s. m. jaya, a. f. n. abdul rahman, and z. b. abal abas, “mosquito larvae detection using deep learning,” int. j. innov. technol. explor. eng., vol. 8, no. 12, pp. 804–809, oct. 2019. [17] r. jain, p. nagrath, g. kataria, v. sirish kaushik, and d. jude hemanth, “pneumonia detection in chest x-ray images using convolutional neural networks and transfer learning,” measurement, vol. 165, p. 108046, dec. 2020. [18] y.-j. cha, w. choi, g. suh, s. mahmoudkhani, and o. büyüköztürk, “autonomous structural visual inspection using region-based deep learning for detecting multiple damage types,” comput. civ. infrastruct. eng., vol. 33, no. 9, pp. 731–747, sep. 2018. [19] m. elleuch, r. maalej, and m. kherallah, “a new design based-svm of the cnn classifier architecture with dropout for offline arabic handwritten recognition,” procedia comput. sci., vol. 80, pp. 1712–1723, 2016. [20] a. p. wibawa, s. a. kurniawan, and i. a. e. zaeni, “determining journal rank by applying particle swarm optimization-naive bayes classifier,” j. inf. technol. manag., vol. 13, no. 4, 2021. [21] t. b. alakus and i. turkoglu, “comparison of deep learning approaches to predict covid-19 infection,” chaos, solitons & fractals, vol. 140, p. 110120, nov. 2020. http://doi.org/10.22441/sinergi.2019.3.008 http://doi.org/10.22441/sinergi.2019.3.008 https://doi.org/10.1177/1529100619889954 https://doi.org/10.1177/1529100619889954 https://doi.org/10.1177/1529100619889954 https://doi.org/10.1080/15568318.2019.1641577 https://doi.org/10.1080/15568318.2019.1641577 https://doi.org/10.1002/cpp.2521 https://doi.org/10.1002/cpp.2521 https://doi.org/10.1037/emo0000302 https://doi.org/10.1037/emo0000302 https://doi.org/10.1111/jomf.12622 https://doi.org/10.1111/jomf.12622 http://dx.doi.org/10.30880/jscdm.2021.02.01.006 http://dx.doi.org/10.30880/jscdm.2021.02.01.006 https://doi.org/10.1109/icic50835.2020.9288560 https://doi.org/10.1109/icic50835.2020.9288560 https://doi.org/10.1109/icic50835.2020.9288560 https://doi.org/10.1109/icic50835.2020.9288560 https://doi.org/10.1109/kse.2017.8119447 https://doi.org/10.1109/kse.2017.8119447 https://doi.org/10.21609/jiki.v11i2.507 https://doi.org/10.21609/jiki.v11i2.507 https://doi.org/10.3390/electronics9111892 https://doi.org/10.3390/electronics9111892 https://doi.org/10.1007/s11390-020-9665-4 https://doi.org/10.1007/s11390-020-9665-4 https://doi.org/10.1007/s11390-020-9665-4 https://doi.org/10.1002/stc.2230 https://doi.org/10.1002/stc.2230 https://doi.org/10.1007%2fs10919-019-00293-3 https://doi.org/10.1007%2fs10919-019-00293-3 https://doi.org/10.1109/ccgrid.2019.00067 https://doi.org/10.1109/ccgrid.2019.00067 https://doi.org/10.1109/ccgrid.2019.00067 http://dx.doi.org/10.35940/ijitee.l3213.1081219 http://dx.doi.org/10.35940/ijitee.l3213.1081219 https://doi.org/10.1016/j.measurement.2020.108046 https://doi.org/10.1016/j.measurement.2020.108046 https://doi.org/10.1111/mice.12334 https://doi.org/10.1111/mice.12334 https://doi.org/10.1111/mice.12334 https://doi.org/10.1016/j.procs.2016.05.512 https://doi.org/10.1016/j.procs.2016.05.512 https://doi.org/10.22059/jitm.2021.305435.2559 https://doi.org/10.22059/jitm.2021.305435.2559 https://doi.org/10.1016/j.chaos.2020.110120 https://doi.org/10.1016/j.chaos.2020.110120 knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 137–142 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p137-142 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) performance of ensemble classification for agricultural and biological science journals with scopus index nastiti susetyo fanany putri a,1, aji prasetya wibawa a,2,*, harits ar rosyid a,3, agung bella putra utama a,3, wako uriu b,5 a department of electrical engineering, faculty of engineering, universitas negeri malang, jl semarang 5, malang, east java 65145, indonesia b department of english, chikushi jogakuen university, 2-chōme-12-1 ishizaka, dazaifu, fukuoka 818-0118, japan 1 nastiti.susetyo.2005348@students.um.ac.id; 2 aji.prasetya.ft@um.ac.id*; 3 harits.ar.ft@um.ac.id; agungbpu02@gmail.com 4 ; ue2017119@chikushi-u.ac.jp 5 * corresponding author i. introduction the agricultural sector is one of the research areas that is expanding yearly. the journal grew, demonstrated by a considerable increase in the total number of papers published in this discipline each year at scimago. figure 1 depicts this industry's increased journals over the previous 20 years. figure 1 illustrates this industry's increased number of journals over the last 20 years. this increase dramatically impacts the increasing number of literature sources for further research. scimago ranks the journal itself. there are four journal classes, namely q1, q2, q3, and q4. however, in the provision of quartiles, there are some differences in values in the same journal in different fields. therefore, it is necessary to have data processing methods, such as classification. the technique for finding models or functions that explain and differentiate ideas or classes of data is known as classification [1]. this technique can predict the class label of an object whose label is unknown [2]. therefore, we attempt to utilize a classification technique using the idea of an ensemble in this paper. where bagging and boosting comprise the ensemble, this research aims to evaluate the ensemble classification mechanism's performance using quartile data from agricultural and biological science periodicals. article info a b s t r a c t article history: received 3 october 2022 revised 29 october 2022 accepted 30 november 2022 published online 30 december 2022 the ensemble method is considered an advanced method in both prediction and classification. the application of this method is estimated to have a more optimal output than the previous classification method. this article aims to determine the ensemble's performance to classify journal quartiles. the subject of agriculture was chosen because indonesia is an agricultural country, and the interest of researchers in this field shows a positive response. the data is downloaded through the scimago journal and country rank with the accumulation in 2020. labels have four classes: q1, q2, q3, and q4. the ensemble applied is boosting and bagging with decision tree (dt) and gaussian naïve bayes (gnb) algorithms compiled from 2144 instances. the boosting meta-ensembles used are adaboost and xgboost. from this study, the bagging decision tree has the highest accuracy score at 71.36, followed by xgboost decision tree with 69.51. the third is xgboost gaussian naïve bayes with 68.82, adaboost decision tree with 60.42, adaboost gaussian naïve bayes with 58.2, and bagging gaussian naïve bayes with 56.12 results. this paper shows that the bagging decision tree is the ensemble method that works optimally in this subject classification. this result suggests that the ensemble method can still fail to produce an ideal outcome that approaches the sjr system. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: quartile journals ensemble classification bagging boosting http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ n.s.f putri / knowledge engineering and data science 2022, 5 (2): 137–142 138 fig. 1. the growth of agricultural and biological sciences journal the ensemble model is a further development of the usual classification method. the working principle of this method is to combine the same two algorithms with a specific pattern [3] and decide the final result by the voting system [4]. the fundamental objective of using an ensemble is to achieve superior outcomes to a conventional single classifier. this is due to the method's ability to combat overfitting [5] and noise data [6]. the purpose of this study is to assess the effectiveness of the ensemble classification using bagging and boosting. agricultural and biological science journal quartiles, particularly data accumulating for 2020, are the data sources. the research questions cover these points: out of all the strategies used, which ensemble mechanism performs best? are the publications in the domains of agriculture and biology ranked differently, and can the chosen ensemble solve this issue? ii. method this research is divided into four stages. the acquiring of datasets is the initial step. data preprocessing, which aims to provide clean data suited for classification, comes next. the classification stage is the third. ensemble bagging and boosting is the technique employed. the confusion matrix evaluation stage is the final step. in figure 2, the research procedure is displayed. fig. 2. method process a. data collecting the first process carried out in this research is data collecting. secondary data is collected from the scimago page for journal and country rankings. the data subject used in agriculture and biological science in 2020. it was composed in february 2022. it consists of 2164 instances, with details listed in table 1. twenty qualities are present. however, just nine were applied. this is because these nine attributes are visible on the scimago home page, leading one to believe that these are the ones that decide the journal quartiles [7][8]. sjr best quartile is the name of the label. this study falls under the multi-class classification because it includes the four classes q1, q2, q3, and q4. h index, total docs (2020), total docs (3 years), total refs, total cites (3 years), city docs (3 years), cite/ doc (2 years), and ref.doc are some of the attributes used. the label class, or the journal quartiles, are predicted using this feature as an independent variable. b. pre-processing the data must be prepared in such a way as to produce accurate predictions. the data preparation stage to suit the needs of this process is called preprocessing [9]. preprocessing can raise a classification method's predictive value [10]. data cleansing, integration, transformation, reduction, feature selection, and resampling are a few examples of preprocessing [11][12]. however, not all types of preprocessing are used here. the technique used in this article is data cleaning. 139 n.s.f putri / knowledge engineering and data science 2022, 5 (2): 137-142 data cleaning eliminates extraneous data, such as missing values or noise [13]. several instances in the agricultural and biological sciences data lack class labels. therefore, the instances are removed to prevent incorrect classification. after this process, 2144 instances in the dataset are used. table 2 contains information on the quantity of data in each class of labels following preprocessing. c. classification the third stage that is passed is the classification process. there are two ensemble mechanisms in this stage. the first is boosting with the adaboost and xgboost meta-ensembles. the second is the bagging ensemble. ensemble techniques use decision tree (dt) and gaussian naïve bayes (gnb) algorithms as base learners. the scenario in this experiment is shown in figure 3. fig. 3. research scenario table 1. list of data set attribute attribute data type range rank integer 1-2164 sourceid real 12016-21101020133 title nominal annual review of plant biology, ecology letters, isme journal, etc type nominal journal issn nominal 995444, 00015342, etc sjr real 0.1-11695 sjr best quartile nominal q1, q2, q3, q4, nq h index integer 0-342 total docs. (2020) integer 0-3921 total docs (3 years) integer 0-6917 total refs. integer 0-251461 total cites. (3 years) integer 0-42304 citable docs. (3 years) integer 0-6322 cite/ doc (2 years) real 0-25.28 ref.doc. real 0-326.27 country nominal indonesia, hungary, poland, etc region nominal northern america, western europe, the asiatic region, etc publisher nominal sejani ltd, csic, em international, etc coverage nominal 1988-2020, 1978-2020, 1977, 1996-2020 etc categories nominal agricultural and biological sciences, ecology. evolution behavior and systematic cell biology etc table 2. label class summary class label before cleaning after cleaning q1 603 603 q2 551 551 q3 519 519 q4 471 471 20 sum 2164 2144 n.s.f putri / knowledge engineering and data science 2022, 5 (2): 137–142 140 stage one is to break the dataset into training and testing data using the split test training command. the setting used is 20%: 80%. this comparison was chosen because this value produced sound output in several similar studies [14][15]. in addition, this value is often used [16]. the ensemble method's quartile classification of agricultural journals comes next. for both dt and gnb, this algorithm uses a base-learner repetition setting of 100. regarding the 50 times set for the n depth dt, these numbers were selected randomly, understanding that they would be sufficient for this investigation. d. evaluation the evaluation procedure used is the confusion matrix [17]. information on predictable classifications and actual values using the classification system is contained in the confusion matrix [18]. classification performance evaluation comprises six aspects: accuracy, precision, recall, specificity, f-score, and error rate [19][20]. however, not all of them will be applied in this study. the terms accuracy, precision, and recall will all be used in this essay. iii. result and discussion before you begin to format your paper, first write and save the content as a separate text file. keep your text and graphic files separate until after the text has been formatted and styled. do not use hard tabs, and limit use of hard returns to only one return at the end of a paragraph. do not add any kind of pagination anywhere in the paper. do not number text heads-the template will do that for you. the method has undergone various revisions during the classification phase. adaboost dt, adaboost gnb, xgboost dt, xgboost gnb, boosting dt, and boosting gnb are a few of them. table 3 includes a list of the classification's outcomes. figure 4 shows the table's results graphically. fig. 4. classification performance table 3 and figure 4 show that the ensemble mechanism that works optimally, in this case, is bagging dt, with an accuracy score of 71.59%. the second-best value is the xgboost meta-ensemble with base learner dt with an accuracy value of 69.97%. if sorted from optimal to less than optimal performance, this classification process is bagging dt, xgboost dt, xgboost gnb, adaboost dt, adaboost gnb, and finally, bagging gnb. 40 45 50 55 60 65 70 75 80 adadt adagnb xgbdt xgbgnb bagdt baggnb recall precision accuracy table 3. classification results ensemble meta-ensemble algorithm accuracy (%) precision (%) recall (%) boosting adaboost dt 60.54 60.34 60.79 gnb 59.58 46.76 47.96 xgboost dt 69.97 76.96 63.31 gnb 69.75 76.93 62.75 bagging dt 71.59 76.43 67.21 gnb 56.12 47.78 46.29 141 n.s.f putri / knowledge engineering and data science 2022, 5 (2): 137-142 it is also seen that the xgboost method has the slightest difference between the two algorithms, only 0.22%, as opposed to the bagging approach, where there is a significant 15.47% difference. the difference in base-learner accuracy in the adaboost meta-ensemble is 0.96%. the ratio of correct optimistic predictions to the total number of positive predicted outcomes is known as precision [21]. in this realm, the values are xgboost dt, xgboost gnb, bagging dt, adaboost dt, adaboost gnb, and bagging gnb in that order from lowest to highest. bagging dt had the highest recall score, coming up at 67.21%. out of the six cases, adaboost gnb has the lowest value. recall quantifies the ratio of correctly predicted positive facts to actual positive facts [22]. this study produces a prediction accuracy value with an average of above 60%. these results indicate that all scenarios can be used to assess the journal quartiles because the results are still more than 50%. bagging can work better because bagging extracts additional data for training from the dataset [23]. each data component has the same chance of being selected. this data set is used to conduct model training simultaneously. the more training data obtained, the better knowledge of algorithms for classifying [24] and can reduce the variance of the classification process [25]. dt is a derivative of the independent variable, where each node has its conditions for features [26]. this node determines which node to go to in the following state. the proper sequence of nodes can produce the best output. dt does not make assumptions on the distribution of data [27], overcomes collinearity efficiently [28], and does not require data preprocessing [29]. however, this method can give overfitting if it uses too many branches. in this article, not too many branches are used so that the model can work optimally. in the case of naïve bayes often working by chance, this case cannot measure the accuracy of the prediction. on the other hand, naïve bayes is also weak in selecting attributes that can affect accuracy [30]. the data used is only quartile data for agricultural and biological science journals in the 2020 accumulation. this study also only uses simple settings in preprocessing. this action affects the performance of the classifier. iv. conclusion in conclusion, the classification using ensemble models is applicable. according to the research findings, the bagging decision tree is a method with reasonable accuracy, precision, and recall. thus, it can be inferred that this approach may be used to resolve problems of a similar nature. the xgboost meta-ensemble performs better in terms of the boosting mechanism. xgboost can indirectly minimize variance by lowering overfitting. the outcomes, nevertheless, can be improved. therefore, it is essential to investigate other ensemble approaches, such as stacking, for future research. using meta-ensemble and other base learners is strongly advised to create a better prediction score. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. n.s.f putri / knowledge engineering and data science 2022, 5 (2): 137–142 142 references [1] d. zhang and s. lou, “the application research of neural network and bp algorithm in stock price pattern classification and prediction,” futur. gener. comput. syst., vol. 115, pp. 872–879, feb. 2021. [2] y. zhang, y. wang, x.-y. liu, s. mi, and m.-l. zhang, “large-scale multi-label classification using unknown streaming images,” pattern recognit., vol. 99, p. 107100, mar. 2020. [3] j. lin, h. chen, s. li, y. liu, x. li, and b. yu, “accurate prediction of potential druggable proteins based on genetic algorithm and bagging-svm ensemble classifier,” artif. intell. med., vol. 98, pp. 35–47, jul. 2019. [4] g. kaur, “a comparison of two hybrid ensemble techniques for network anomaly detection in spark distributed environment,” j. inf. secur. appl., vol. 55, p. 102601, dec. 2020. [5] f. liu, m. cai, l. wang, and y. lu, “an ensemble model based on adaptive noise reducer and over-fitting prevention lstm for multivariate time series forecasting,” ieee access, vol. 7, pp. 26102–26115, 2019. [6] a. b. shaik and s. srinivasan, “a brief survey on random forest ensembles in classification model,” 2019, pp. 253–260. [7] a. p. wibawa, “international journal quartile classification using the k-nearest neighbor method,” 2019. [8] a. p. wibawa et al., “naïve bayes classifier for journal quartile classification,” int. j. recent contrib. from eng. sci. it, vol. 7, no. 2, p. 91, 2019. [9] m. j. willemink et al., “preparing medical imaging data for machine learning,” radiology, vol. 295, no. 1, pp. 4– 15, apr. 2020. [10] b. sekeroglu, k. dimililer, and k. tuncal, “student performance prediction and classification using machine learning algorithms,” in proceedings of the 2019 8th international conference on educational and information technology, mar. 2019, pp. 7–11. [11] i. cordón, j. luengo, s. garcía, f. herrera, and f. charte, “smartdata: data preprocessing to achieve smart data in r,” neurocomputing, vol. 360, pp. 1–13, sep. 2019. [12] x. shi, y. d. wong, m. z.-f. li, c. palanisamy, and c. chai, “a feature learning approach based on xgboost for driving assessment and risk prediction,” accid. anal. prev., vol. 129, pp. 170–179, aug. 2019. [13] t. wang, h. ke, x. zheng, k. wang, a. k. sangaiah, and a. liu, “big data cleaning based on mobile edge computing in industrial sensor-cloud,” ieee trans. ind. informatics, vol. 16, no. 2, pp. 1321–1329, feb. 2020. [14] q. h. nguyen et al., “influence of data splitting on performance of machine learning models in prediction of shear strength of soil,” math. probl. eng., vol. 2021, pp. 1–15, feb. 2021. [15] r. patgiri, h. katari, r. kumar, and d. sharma, “empirical study on malicious url detection using machine learning,” 2019, pp. 380–388. [16] a. mirbolouki, s. heddam, k. singh parmar, s. trajkovic, m. mehraein, and o. kisi, “comparison of the advanced machine learning methods for better prediction accuracy of solar radiation using only temperature data: a case study,” int. j. energy res., vol. 46, no. 3, pp. 2709–2736, mar. 2022. [17] s. ruuska, w. hämäläinen, s. kajava, m. mughal, p. matilainen, and j. mononen, “evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle,” behav. processes, vol. 148, pp. 56–62, mar. 2018. [18] j. xu, y. zhang, and d. miao, “three-way confusion matrix for classification: a measure driven view,” inf. sci. (ny)., vol. 507, pp. 772–794, jan. 2020. [19] n. khare et al., “smo-dnn: spider monkey optimization and deep neural network hybrid classifier model for intrusion detection,” electronics, vol. 9, no. 4, p. 692, apr. 2020. [20] o. s. albahri et al., “systematic review of artificial intelligence techniques in the detection and classification of covid-19 medical images in terms of evaluation and benchmarking: taxonomy analysis, challenges, future solutions and methodological aspects,” j. infect. public health, vol. 13, no. 10, pp. 1381–1396, oct. 2020. [21] d. chicco, n. tötsch, and g. jurman, “the matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” biodata min., vol. 14, no. 1, p. 13, feb. 2021. [22] m. grandini, e. bagli, and g. visani, “metrics for multi-class classification: an overview,” aug. 2020. [23] p. yariyan et al., “improvement of best first decision trees using bagging and dagging ensembles for flood probability mapping,” water resour. manag., vol. 34, no. 9, pp. 3037–3053, jul. 2020. [24] a. m. abdi, “land cover and land use classification performance of machine learning algorithms in a boreal landscape using sentinel-2 data,” giscience remote sens., vol. 57, no. 1, pp. 1–20, jan. 2020. [25] s. alelyani, “stable bagging feature selection on medical data,” j. big data, vol. 8, no. 1, p. 11, dec. 2021. [26] g. pappalardo, s. cafiso, a. di graziano, and a. severino, “decision tree method to analyze the performance of lane support systems,” sustainability, vol. 13, no. 2, p. 846, jan. 2021. [27] s. park, s.-y. hamm, and j. kim, “performance evaluation of the gis-based data-mining techniques decision tree, random forest, and rotation forest for landslide susceptibility modeling,” sustainability, vol. 11, no. 20, p. 5659, oct. 2019. [28] b. aronov, e. ezra, and m. sharir, “testing polynomials for vanishing on cartesian products of planar point sets: collinearity testing and related problems,” mar. 2020. [29] h. fujita and d. cimr, “decision support system for arrhythmia prediction using convolutional neural network structure without preprocessing,” appl. intell., vol. 49, no. 9, pp. 3383–3391, sep. 2019. [30] m. li and k. liu, “causality-based attribute weighting via information flow and genetic algorithm for naive bayes classifier,” ieee access, vol. 7, pp. 150630–150641, 2019. https://doi.org/10.1016/j.future.2020.10.009 https://doi.org/10.1016/j.future.2020.10.009 https://doi.org/10.1016/j.patcog.2019.107100 https://doi.org/10.1016/j.patcog.2019.107100 https://doi.org/10.1016/j.artmed.2019.07.005 https://doi.org/10.1016/j.artmed.2019.07.005 https://doi.org/10.1016/j.jisa.2020.102601 https://doi.org/10.1016/j.jisa.2020.102601 https://doi.org/10.1109/access.2019.2900371 https://doi.org/10.1109/access.2019.2900371 https://doi.org/10.1007/978-981-13-2354-6_27 https://doi.org/10.1007/978-981-13-2354-6_27 https://doi.org/10.1109/iceeie47180.2019.8981413 https://doi.org/10.3991/ijes.v7i2.10659 https://doi.org/10.3991/ijes.v7i2.10659 https://doi.org/10.1148/radiol.2020192224 https://doi.org/10.1148/radiol.2020192224 https://doi.org/10.1145/3318396.3318419 https://doi.org/10.1145/3318396.3318419 https://doi.org/10.1145/3318396.3318419 https://doi.org/10.1016/j.neucom.2019.06.006 https://doi.org/10.1016/j.neucom.2019.06.006 https://doi.org/10.1016/j.aap.2019.05.005 https://doi.org/10.1016/j.aap.2019.05.005 https://doi.org/10.1109/tii.2019.2938861 https://doi.org/10.1109/tii.2019.2938861 https://doi.org/10.1155/2021/4832864 https://doi.org/10.1155/2021/4832864 http://dx.doi.org/10.1007/978-3-030-05366-6_31 http://dx.doi.org/10.1007/978-3-030-05366-6_31 https://doi.org/10.1002/er.7341 https://doi.org/10.1002/er.7341 https://doi.org/10.1002/er.7341 https://doi.org/10.1016/j.beproc.2018.01.004 https://doi.org/10.1016/j.beproc.2018.01.004 https://doi.org/10.1016/j.beproc.2018.01.004 https://doi.org/10.1016/j.ins.2019.06.064 https://doi.org/10.1016/j.ins.2019.06.064 https://doi.org/10.3390/electronics9040692 https://doi.org/10.3390/electronics9040692 https://doi.org/10.1016/j.jiph.2020.06.028 https://doi.org/10.1016/j.jiph.2020.06.028 https://doi.org/10.1016/j.jiph.2020.06.028 https://doi.org/10.1186/s13040-021-00244-z https://doi.org/10.1186/s13040-021-00244-z https://doi.org/10.1186/s13040-021-00244-z https://arxiv.org/abs/2008.05756 https://doi.org/10.1007/s11269-020-02603-7 https://doi.org/10.1007/s11269-020-02603-7 https://doi.org/10.1080/15481603.2019.1650447 https://doi.org/10.1080/15481603.2019.1650447 https://doi.org/10.1186/s40537-020-00385-8 https://doi.org/10.3390/su13020846 https://doi.org/10.3390/su13020846 https://doi.org/10.3390/su11205659 https://doi.org/10.3390/su11205659 https://doi.org/10.3390/su11205659 https://doi.org/10.1007/s00454-022-00437-1 https://doi.org/10.1007/s00454-022-00437-1 https://doi.org/10.1007/s10489-019-01461-0 https://doi.org/10.1007/s10489-019-01461-0 https://doi.org/10.1109/access.2019.2947568 https://doi.org/10.1109/access.2019.2947568 knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 143–149 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p143-149 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) social media mining with fuzzy text matching: a knowledge extraction on tourism after covid-19 pandemic ida bagus putra manuabaa,1,*, i wayan budi sentanab,2, i nyoman gede arya astawa a,3, i wayan suasnawa a,4, i putu bagus arya pradnyana a,5 a electrical engineering department, politeknik negeri bali, kampus jimbaran, badung, bali, 80361 indonesia b school of computing, macquarie university, 4 research park dr, macquarie park nsw 2113, australia 1 manuabaputra@pnb.ac.id*; 2 i-wayan-budi.sentana@students.mq.edu.au, i-wayan-budi.sentana@hdr.mq.edu.au; 3 arya_kmg@pnb.ac.id ; 4 suasnawa@pnb.ac.id; 5 bagusarya12@pnb.ac.id * corresponding author i. introduction the covid-19 pandemic has forced governments worldwide to restrict the movement of populations, bringing economic activity to a total standstill [1]. governmental policies restricting mobility, such as bans, lockdowns, and social distancing, have obstructed the tourism industry [2]. according to the united nations world tourism organization, global international tourist arrivals have fallen by 74%, resulting in a loss of international tourism receipts of about us$1.3 trillion [3]. while social media is a leading source of instant data representing human expression, the methods for analyzing data retrieved from social media about covid-19 recovery from a tourism perspective are limited. the primary objectives of this study were (i) to extract knowledge about covid-19 recovery from a tourism perspective, as reported on social media, and (ii) to design a text mining and fuzzy matching approach for collecting data on this topic. twitter was used as the social media platform for data collection due to its availability to the public and ease of collecting massive amounts of data as a dataset [4]. this study mined tweets using four different parameters related to covid-19 recovery from the tourism perspective. the parameters were a vaccine, travel, restriction, and work. text mining was used to analyze the collected dataset. this method has been previously used to investigate diseases article info a b s t r a c t article history: received 26 october 2022 revised 3 november 2022 accepted 4 december 2022 published online 30 december 2022 social media mining is an emerging technique for analyzing data to extract valuable knowledge related to various domains. however, traditional text matching techniques, such as exact matching, are not always suitable for social media data, which can contain spelling mistakes, abbreviations, and variations in the use of words. fuzzy matching is a text matching technique that can handle such variations and identify similarities between two texts, even if there are differences in spelling or phrasing. the gap in existing research is the limited use of fuzzy matching in social media mining for tourism recovery analysis. by applying fuzzy matching to social media data related to covid-19 and tourism recovery, this research seeks to bridge this gap and extract valuable insights related to the impact of the pandemic on tourism recovery. we manually retrieved 19,462 twitter records and differentiated the data sources using four diver parameters to indicate data related to the impact of covid19 on the tourism industry, such as the economy, restrictions, government policies, and vaccination. we conducted text mining analysis on the collected 7,352 words and identified 25 highly recommended words that indicated covid-19 recovery from a tourism perspective. we separated the four words representing the tourism perspective to perform fuzzy matching as a dataset. we then used the inbound dataset on the fuzzy matching process, with the 7,352-word data collected from the text mining process. the matching process resulted in 18 words representing covid-19 recovery from a tourism perspective. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: social media text mining fuzzy matching covid-19 tourism http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ i.b.p manuaba / knowledge engineering and data science 2022, 5 (2): 143–149 144 and chemicals related to covid-19 [5], the impact of the covid-19 pandemic on business [6], and public attention about covid-19 on social media [7]. the contribution of this research is to provide a better understanding of how covid-19 recovery impacts the tourism industry and to demonstrate the potential of text mining and fuzzy matching techniques in extracting insights from social media data. the research also presents a methodology for data cleaning, tokenization, filtering, and n-gram generation that can be applied to other social media mining studies. additionally, the fuzzy matching process presented in this research can help identify similar words and phrases that may not have been captured in single-word data collection, thereby providing a more comprehensive understanding of the studied topic. overall, the research aims to contribute to the knowledge of covid-19 recovery in the tourism industry and provide a practical methodology for extracting insights from social media data. ii. method a. data collecting the primary goal of this study was to retrieve data from twitter to create a dataset related to covid-19 recovery from a tourism perspective. the study used a keyword-based approach to retrieve data from twitter based on parameters related to the tourism perspective, such as a vaccine, travel, restriction, and work. data were retrieved from twitter for all parameters, resulting in 19,462 records. data retrieval for each parameter was limited to a maximum of 5,000 records to optimize the performance of the text-mining analysis process. the details of the data retrieved at the beginning of 2021 for each parameter are presented in table 1. after retrieving the data, the collected data from each parameter were combined into one dataset. before conducting text mining analysis, the dataset was cleaned using several methods to produce a dataset related to the covid-19 recovery in the tourism industry. removing html links from the dataset records was a part of the dataset cleansing process. it was essential to remove duplicate records to prevent redundant data in the dataset. at the end of the dataset-cleaning process, there were approximately 7,352 records. b. social media text mining in this study, social media text mining refers to text mining on datasets retrieved from social media. twitter was chosen as the social media platform to be used as a data source due to its public availability and ease of collecting massive amounts of data as a dataset. text mining is generally used on social media to identify public perception [8], mine information [9], or investigate problems on social media [10]. several text mining processes were applied to extract knowledge from the data retrieved on twitter, such as tokenization, case transformation, stop word filtering, token filtering, and generating n-grams. the tokenization process divides the text into specific parts or tokens, transforming all characters into lower cases. the stop word filtering process selects important words using a stop list algorithm. n-gram items can sequence contiguous items, which group items within a sequence of text most frequently extracted from text or speech corpora. n-gram can generate patterns of word occurrences related to covid-19, which helps classify tweets as positive or negative for covid19 [11]. table 1 data parameter on twitter parameter name number data retrieve on twitter vaccine 4607 record work 5000 record restriction 4.877 record tourism 4.978 record 145 i.b.p manuaba / knowledge engineering and data science 2022, 5 (2): 143-149 c. fuzzy matching in the mining process, fuzzy logic is generally used to enrich and treat uncertainty [12], setting association rules mining for generating classifiers [13] and bridging the gap between the ambiguities of different understandings [14]. the fuzzy matching approach was slightly better than the weighted approach [15]. fuzzy logic is also a promising approach that can significantly improve accuracy [16]. fuzzy matching matches entities with entities contained in the provided entity dictionary. the study shows that the proposed method results in an entity recognition accuracy of 86.69% and an entity disambiguation accuracy of 88.69% [17]. fuzzy matching can help improve large-scale data integration efficiently and accurately [18]. fuzzy string matching algorithms are applied in various applications, including information retrieval, data cleaning, and natural language processing. fuzzy string matching techniques include levenshtein distance, jaro-winkler distance, and n-grams [19]. fuzzy string matching also can be a valuable tool in automating the assessment of listener transcripts in speech intelligibility studies, reducing the time and effort required for manual assessment and improving the accuracy of the assessment process [20]. based on the description above, this research uses fuzzy matching using the jaccard similarity approach to improve data quality and increase the accuracy of information extraction. the steps of fuzzy matching using jaccard similarity are: 1. tokenization: convert the text into tokens (words and phrases). 2. preprocessing: apply various preprocessing techniques such as removing stopwords, converting text to lowercase, and removing special characters. 3. creating n-grams: create n-grams (a sequence of n tokens) from the tokens. 4. calculating jaccard similarity: calculate the jaccard similarity between the n-grams of two strings. the jaccard similarity is the size of the intersection divided by the size of the union of two sets. 5. fuzzy matching: compare the jaccard similarity scores between the n-grams of two strings and determine if they match based on a predefined threshold. the sample dataset for matching was limited to 25 records of data, and during the matching process, only 10 similar data were allowed to be matched for each sample dataset. iii. result and discussion the social media text mining process with fuzzy matching involves four main steps: retrieving data from the data source, text mining, creating a dataset sample for fuzzy matching, and performing the fuzzy matching process. the raw data source is in an excel format, combining data from four different parameters related to covid-19 and tourism perspectives. the data source is cleaned and transformed into a usable dataset for the data retrieval process. before the text mining process, the dataset is converted from nominal to text. text mining involves several steps: tokenization, case transformation, stop word filtering, token filtering, and n-gram generation. the output of the text mining process is then used to create a dataset sample for fuzzy matching. simple algorithms such as data sorting, attribute selection, example filtering with range, and example filtering are used to collect dataset samples for fuzzy matching. these sample datasets are related to parameters based on covid-19 recovery and tourism perspectives. the dataset samples for matching are limited to 25 records, and the number of matches is limited to 10 similar data for each dataset sample. the primary process is illustrated in figure 1. the dataset sample for fuzzy matching is compared with the dataset from the text mining process using fuzzy matching. the jaccard similarity threshold is the minimum similarity value between two tokens to be considered a match. a threshold of 0.6 means that strings with a jaccard similarity of at least 0.6 will be considered a match. here is a pseudocode for the fuzzy matching process. #fuzzy matching using jaccard similarity: input: string s1 string s2 integer k (the size of the k-grams) i.b.p manuaba / knowledge engineering and data science 2022, 5 (2): 143–149 146 output: jaccard similarity score (a value between 0 and 1) algorithm: 1. create sets of k-grams for both s1 and s2 split s1 into k-grams and store them in set a split s2 into k-grams and store them in set b 2. calculate the intersection of a and b (i.e., the number of k-grams that appear in both sets) 3. calculate the union of a and b (i.e., the number of k-grams that appear in either set) 4. calculate the jaccard similarity score as the ratio of the intersection to the union: j(a,b) = |a ∩ b| / |a ∪ b| 5. return the jaccard similarity score in this pseudocode, |a| and |b| denotes the size of sets a and b, respectively. the k parameter determines the size of the k-grams, which are contiguous sequences of k characters from the input strings. the jaccard similarity score is between 0 and 1, where 1 indicates a perfect match between the two strings, and 0 indicates no similarity. fig. 1. social media text mining analysis with fuzzy matching before cleaning the data source, it consisted of 19,462 records, but after the cleaning process, it was reduced to 7,352 records. various techniques were used to clean the data source, including removing html links, removing twitter tags, removing words similar to covid-19, and eliminating duplicate data records. the text mining process extracted 10,173 words with varying frequency levels from the data source. several processes involve tokenization, case transformation, stop word filtering, token filtering, and generating n-grams to extract knowledge from twitter data using text mining. the first step, tokenization, is dividing a text into specific parts or tokens. the next step is case transformation, where all characters are transformed into lowercase. in token filtering, additional rules are added, such as a minimum of 4 characters and a maximum of 25 characters. n-grams are used to sequence contiguous items. persistent words are extracted using a word extraction process to analyze the data. the resulting data is shown in figure 2. 147 i.b.p manuaba / knowledge engineering and data science 2022, 5 (2): 143-149 fig. 2. result of text mining process figure 2 shows that the most frequent words are "travel", "vaccine", "shanghai", "restrictions", "month", "pandemic", "people", "china", and "health". these words are then collected into a dataset sample for fuzzy matching. additional rules were added to the fuzzy matching process, such as a maximum of 10-word results for each sample dataset and a minimum similarity value of 50. the results of the fuzzy matching process are shown in table 2. the data from table 2 is visually represented in a word cloud, where words with greater prominence appear more frequently. the words with a similarity value greater than 75 are travel_restrictions, china_largest, ending_month, month_lockdown, vaccines, vaccinated, month, and travelers. among these, travel_restrictions appears twice with a similarity of 100. when visualized as a word cloud, the data is shown in figure 3. table 3 shows the data comparison result without fuzzy matching and with the fuzzy matching process. without fuzzy matching, the process only has to collect single-word data, and with fuzzy matching, the process can be explored more deeply. the process data, with fuzzy matching, appears more than single-word data result such as travel restriction, month lockdown, and ending months. fuzzy matching can appear similarity words on highly frequent appears, such as on travel word, this process shows the similarity words travelers and travel restriction, on vaccine word this process shows the similarity words: vaccines, vaccinated, and vaccination. table 2. result of fuzzy matching sample dataset word similarity similarity sample dataset word similarity similarity travel take 60.0 restrictions travel_restrictions 100.0 travel state 55.0 month months 91.0 travel travel_restrictions 100.0 month ending_month 100.0 travel travellers 75.0 month month_lockdown 100.0 vaccine chinese 57.0 month north 60.0 vaccine vaccines 93.0 people reopening 53.0 vaccine vaccinated 82.0 china vaccinated 53.0 vaccine vaccination 67.0 china coming 55.0 vaccine airlines 53.0 china chinese 67.0 restrictions authorities 52.0 china children 62.0 restrictions getting 53.0 china china_largest 100.0 restrictions residents 57.0 china canada 55.0 restrictions return 56.0 china india 60.0 restrictions testing 53.0 health deaths 67.0 i.b.p manuaba / knowledge engineering and data science 2022, 5 (2): 143–149 148 fig. 3. word cloud of text mining with fuzzy matching table 3 compares the results obtained without the fuzzy matching process and those obtained using it. only single-word data was collected when the fuzzy matching process was not used. however, a deeper data exploration was made possible by utilizing fuzzy matching. the fuzzy matching process revealed more than just single-word data, including insights on travel restrictions, month-long lockdowns, and ending months. fuzzy matching also identified similar words that frequently appeared. for example, the word "travel" is similar to the words "travelers" and "travel restriction," while "vaccine" is similar to "vaccines", "vaccinated", and "vaccination". iv. conclusion the study highlights the importance of analyzing social media data to gain insights into the impact of the pandemic on the tourism industry. the study uses text mining and fuzzy matching techniques to extract relevant data from twitter and analyze it. the results show that the recovery of the tourism industry is a topic of discussion on twitter and that certain words are frequently mentioned in this context, such as travel restrictions, vaccines, and lockdowns. this study contributes to the growing literature on social media mining and its applications in the tourism industry. it provides insights that can be useful for policymakers and businesses to make informed decisions regarding the recovery of the tourism industry in the context of the ongoing pandemic. the study can be extended to other social media platforms, such as instagram or facebook, to gain a more comprehensive understanding of how covid-19 recovery is affecting the tourism industry. the use of more advanced natural languages processing techniques, such as sentiment analysis or topic modeling, can provide more nuanced insights into the attitudes and opinions of social media users toward covid-19 recovery in the tourism industry. table 3. text mining with or without fuzzy matching without fuzzy matching with fuzzy matching travel china travel_restriction authorities vaccine health travelers month_lockdown sanghai month chinese reopening restrictions canada vaccines getting month wednesday vaccinated ending_months pandemic cases vaccination resident people world airlines testing 149 i.b.p manuaba / knowledge engineering and data science 2022, 5 (2): 143-149 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] m. nicola et al., “the socio-economic implications of the coronavirus pandemic (covid-19): a review,” int. j. surg., vol. 78, pp. 185–193, jun. 2020. [2] m. sigala, “tourism and covid-19: impacts and implications for advancing and resetting industry and research,” j. bus. res., vol. 117, pp. 312–321, sep. 2020. [3] unwto, “2020: a year in review,” world tourism organization, 2020. (access on 29 october 2022) [4] j. x. koh and t. m. liew, “how loneliness is talked about in social media during covid-19 pandemic: text mining of 4,492 twitter feeds,” j. psychiatr. res., vol. 145, pp. 317–324, jan. 2022. [5] a. karami, b. bookstaver, m. nolan, and p. bozorgi, “investigating diseases and chemicals in covid-19 literature with text mining,” int. j. inf. manag. data insights, vol. 1, no. 2, p. 100016, nov. 2021. [6] p. carracedo, r. puertas, and l. marti, “research lines on the impact of the covid-19 pandemic on business. a text mining analysis,” j. bus. res., vol. 132, pp. 586–593, aug. 2021. [7] k. hou, t. hou, and l. cai, “public attention about covid-19 on social media: an investigation based on data mining and text analysis,” pers. individ. dif., vol. 175, p. 110701, jun. 2021. [8] j. y. park, e. mistur, d. kim, y. mo, and r. hoefer, “toward human-centric urban infrastructure: text mining for social media data to identify the public perception of covid-19 policy in transportation hubs,” sustain. cities soc., vol. 76, p. 103524, jan. 2022. [9] a. kang et al., “environmental management strategy in response to covid-19 in china: based on text mining of government open information,” sci. total environ., vol. 769, p. 145158, may 2021. [10] s. luo and s. y. he, “understanding gender difference in perceptions toward transit services across space and time: a social media mining approach,” transp. policy, vol. 111, pp. 63–73, sep. 2021. [11] n. nasser, l. karim, a. el ouadrhiri, a. ali, and n. khan, “n-gram based language processing using twitter dataset to identify covid-19 patients,” sustain. cities soc., vol. 72, p. 103048, sep. 2021. [12] c. fernandez-basso, k. gutiérrez-batista, r. morcillo-jiménez, m.-a. vila, and m. j. martin-bautista, “a fuzzybased medical system for pattern mining in a distributed environment: application to diagnostic and co -morbidity,” appl. soft comput., vol. 122, p. 108870, jun. 2022. [13] d. rohidin, n. a. samsudin, and m. m. deris, “association rules of fuzzy soft set based classification for text classification problem,” j. king saud univ. comput. inf. sci., vol. 34, no. 3, pp. 801–812, mar. 2022. [14] s. rameem zahra, m. ahsan chishti, a. iqbal baba, and f. wu, “detecting covid-19 chaos driven phishing/malicious url attacks by a fuzzy logic and data mining based intelligence system,” egypt. informatics j., vol. 23, no. 2, pp. 197–214, jul. 2022. [15] c. peng, p. goswami, and g. bai, “fuzzy matching of openapi described rest services,” procedia comput. sci., vol. 126, pp. 1313–1322, 2018. [16] ida bagus putra manuaba, komang ayu triana indah, muhammad fahmi, and irma nuraeni salsabila, “an improvement object detection method findcontour with fuzzy logic for detect balinese script object,” aptisi trans. technopreneursh., vol. 4, no. 3, pp. 257–262, oct. 2022. [17] m. singh, m. kumar, and j. malhotra, “energy efficient cognitive body area network (cban) using lookup table and energy harvesting,” j. intell. fuzzy syst., vol. 35, no. 2, pp. 1253–1265, aug. 2018. [18] l. guan-feng and m. zong-min, “an efficient matching algorithm for fuzzy rdf graph,” j. inf. sci. eng., vol. 34, no. 2, pp. 519–534, 2018. [19] m. pikies and j. ali, “analysis and safety engineering of fuzzy string matching algorithms,” isa trans., vol. 113, pp. 1–8, jul. 2021. [20] h. r. bosker, “using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies,” behav. res. methods, vol. 53, no. 5, pp. 1945–1953, oct. 2021. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1016/j.ijsu.2020.04.018 https://doi.org/10.1016/j.ijsu.2020.04.018 https://doi.org/10.1016/j.jbusres.2020.06.015 https://doi.org/10.1016/j.jbusres.2020.06.015 https://www.unwto.org/covid-19-and-tourism-2020 https://doi.org/10.1016/j.jpsychires.2020.11.015 https://doi.org/10.1016/j.jpsychires.2020.11.015 https://doi.org/10.1016/j.jjimei.2021.100016 https://doi.org/10.1016/j.jjimei.2021.100016 https://doi.org/10.1016/j.jbusres.2020.11.043 https://doi.org/10.1016/j.jbusres.2020.11.043 https://doi.org/10.1016/j.paid.2021.110701 https://doi.org/10.1016/j.paid.2021.110701 https://doi.org/10.1016/j.scs.2021.103524 https://doi.org/10.1016/j.scs.2021.103524 https://doi.org/10.1016/j.scs.2021.103524 https://doi.org/10.1016/j.scitotenv.2021.145158 https://doi.org/10.1016/j.scitotenv.2021.145158 https://doi.org/10.1016/j.tranpol.2021.07.018 https://doi.org/10.1016/j.tranpol.2021.07.018 https://doi.org/10.1016/j.scs.2021.103048 https://doi.org/10.1016/j.scs.2021.103048 https://doi.org/10.1016/j.asoc.2022.108870 https://doi.org/10.1016/j.asoc.2022.108870 https://doi.org/10.1016/j.asoc.2022.108870 https://doi.org/10.1016/j.jksuci.2020.03.014 https://doi.org/10.1016/j.jksuci.2020.03.014 https://doi.org/10.1016/j.eij.2021.12.003 https://doi.org/10.1016/j.eij.2021.12.003 https://doi.org/10.1016/j.eij.2021.12.003 https://doi.org/10.1016/j.procs.2018.08.081 https://doi.org/10.1016/j.procs.2018.08.081 https://doi.org/10.34306/att.v4i3.264 https://doi.org/10.34306/att.v4i3.264 https://doi.org/10.34306/att.v4i3.264 https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs169669 https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs169669 https://doi.org/10.6688/jise.201803_34(2).0013 https://doi.org/10.6688/jise.201803_34(2).0013 https://doi.org/10.1016/j.isatra.2020.10.014 https://doi.org/10.1016/j.isatra.2020.10.014 https://doi.org/10.3758/s13428-021-01542-4 https://doi.org/10.3758/s13428-021-01542-4 knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 122–128 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p122-128 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) adaptive neuro-fuzzy inference system for waste prediction haviluddina,1,*, herman santoso pakpahan a,2, novianti puspitasari a,3, gubtha mahendra putra a,4, rima yustika hasnida a,5, rayner alfred b,6 a informatics engineering, faculty of engineering, universitas mulawarman, jalan sambaliung no.9 gunung kelua campus, samarinda, 75119, indonesia b knowledge technology research unit, universiti malaysia sabah, jalan ums, kota kinabalu, sabah 88400, malaysia 1 haviluddin@unmul.ac.id; 2 pakpahan.herman891@gmail.com; 3 novianti_miechan@yahoo.com; 4 gubthamp@fkti.unmul.ac.id; 5 ryustica@yahoo.com; 6 ralfred@ums.edu.my * corresponding author i. introduction the ever-increasing volume of waste produced due to human activity poses a massive threat to both the environment and public health. waste management has emerged as one of the most pressing concerns for governments and communities worldwide due to the ongoing rise in the global population and the acceleration of urbanization [1]. in recent years, there has been an increasing awareness of the detrimental effects of waste on the environment, such as contamination of the air and water [2], deterioration of the land [3], and emissions of greenhouse gases [4]. ineffective waste management procedures have also been linked to various health issues, such as respiratory disorders, infectious infections, and malignancies [5]. as a direct consequence of this, waste management has risen to the top of the agenda for decision-makers in government, those in charge of waste management, and other stakeholders [6]. they are actively looking for ways to manage waste sustainably, cut down on the amount of waste generated, and lessen the adverse effects of waste on the environment and public health. the difficulty of accurately predicting the amount of waste produced in a particular location is one of the obstacles faced in waste management. prediction is a process that involves certain behaviors or phenomena that will occur in the future. predictions can be made quantitatively or qualitatively. the quantitative measurement uses statistical methods, while the qualitative measurement is based on the opinion (judgment) of those who make predictions. based on the time horizon, predictions can be grouped into three parts: long-term, medium-term, and short-term [7]. predictions can be qualitative (not in the form of numbers) or quantitative (in the form of numbers). qualitative predictions are difficult to do to obtain good results because the variables are very relative. quantitative prediction is divided into two, namely: single prediction (point prediction) and interval prediction (interval prediction). a single prediction consists of one value, while an interval prediction consists of several values in the form of an interval (interval) bounded by lower limit values (lower limit prediction) and upper limit (high prediction) [8]. prediction of the volume of landfill waste is carried out to assist related parties in making policies on the growth article info a b s t r a c t article history: received 3 october 2022 revised 29 october 2022 accepted 30 october 2022 published online 30 december 2022 the volume of landfills that are increasingly piled up and not handled properly will have a negative impact, such as a decrease in public health. therefore, predicting the volume of landfills with a high degree of accuracy is needed as a reference for government agencies and the community in making future policies. this study aims to analyze the accuracy of the adaptive neuro-fuzzy inference system (anfis) method. the prediction results' accuracy level is measured by the value of the mean absolute percentage error (mape). the final results of this study were obtained from the best mape test results. the best predictive results for the anfis method were obtained by mape of 3.36% with a data ratio of 6:1 in the north samarinda district. the study results show that the anfis algorithm can be used as an alternative forecasting method. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: waste prediction anfis samarinda http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ haviluddin et al. / knowledge engineering and data science 2022, 5 (2): 122–128 123 of the volume of waste increasing daily. making predictions makes it possible to make earlier preparations and considerations to anticipate if the volume of waste piles increases, which can cause environmental pollution and make residents uncomfortable. various analysis of waste handling management and municipal solid waste (msw) continues to be carried out by researchers with a predictive approach using artificial intelligence. a study in brisbane, australia, found that the long short-term memory (lstm) model outperformed traditional statistical models in forecasting msw generation, particularly when capturing long-term trends [9]. the lstm, the arima model, and traditional artificial neural networks (ann) accuracy reach 0.92, 0.10, and 0.74. the study also found that incorporating demographic data and economic indicators into the lstm model improved the accuracy of the forecasts. the use of ann and support vector machines (svm) in forecasting msw quantity can potentially improve the efficiency and effectiveness of waste management systems in johannesburg, south africa, as well as other urban areas worldwide [10]. in the ann models, the ten-neuron structure (ann10) performed best with a determination coefficient (r2) of 99.9%, while in the svm models, the linear model performed best with an r2 of 98.6%. from the results obtained from the ann10 model, the total amount of msw generated per year in the city of johannesburg is envisaged to get to 1.95 × 106 tonnes in 2050 with an average annual waste of 1.78 x 106 tonnes. the study analyzes various scenarios and uses a fuzzy technique for order of preference by similarity to the ideal solution (topsis) to forecast msw generation and evaluate the effectiveness of different waste management strategies to improve municipal msw planning and forecasting in the canary archipelago [11]. a modeling study that uses multiple models to forecast msw generation in china uses multiple regression analysis. the result predicts that msw generation in china will continue to increase, with an annual growth rate of approximately 3% [12]. the study uses a system dynamics model to simulate plastic waste generation in india over a 50-year period [13]. the model incorporates population growth, economic development, plastic consumption, and waste management practices. the results suggest that without intervention, the amount of plastic waste generated in india will continue to increase rapidly, leading to environmental hazards such as pollution and public health risks. another study in india uses four different ai models, including ann, decision trees (dt), random forests (rf), and support vector regression (svr), to develop a forecast model for msw generation [14]. the study evaluates the performance of each model and compares their accuracy in forecasting msw generation. the study results suggest that the ann model is the most accurate in forecasting msw generation, followed by the svr, rf, and dt models. waste can be said to be one of the problems faced by many cities around the world, including indonesia. in indonesia, according to law number 18 of 2008 concerning waste management, waste is the residue of daily human activities and natural processes in the solid form [15]. increasing population can cause activities in urban areas to increase, as well as the waste produced. this makes it difficult for the department of the environment (dlh) to manage waste, so a predictive study is needed to be appropriately managed. this is crucial for the proper planning and management of waste disposal systems. one method that has shown promise in waste prediction is the adaptive neuro-fuzzy inference system (anfis). anfis is a hybrid artificial intelligence technique that combines the advantages of fuzzy logic and neural networks. anfis has been successfully used in various applications, including waste prediction, because this method is adaptive, which means that if there is a change in parameter values, it will be connected to existing neurons in getting accurate prediction results [16]. in this context, anfis can be used to develop accurate models for waste prediction by analyzing various factors influencing a waste generation, such as population growth, economic development, and waste management practices. the use of anfis in waste prediction can help waste management authorities make informed decisions regarding the design and operation of waste disposal systems, thereby minimizing the impact of waste on the environment and public health. this paper will review the application of anfis in waste prediction and its potential for improving waste management practices to assist dlh samarinda city, east kalimantan, in adequately organizing the volume of landfill waste. this article consists of the motivation for writing articles in the first part. second, it describes the working model of the anfis method. third, analysis of the experimental results. the conclusion of the research is at the end. 124 haviluddin et al. / knowledge engineering and data science 2022, 5 (2): 122-128 ii. method a. data collection the data used in this study is data on the volume of waste dumps in samarinda city. samarinda city is the capital of east kalimantan, with an area of 718 km2. the period of the waste collection dataset used was from january 2012 december 2018, with a total of 840 data. the dataset consists of 10 samarinda city sub-districts, as seen in table 1. meanwhile, population growth continues to increase every year. according to the central statistics agency for samarinda city, in 2018, the population reached 858,931 people. from 2012-2019 the volume of waste piles in samarinda city experienced fluctuating changes along with population and industry growth. there are 5:2 and 6:1 data usage ratio scenarios. b. data normalization based on the principle of an intelligent system, the data on landfills' volume is first normalized [17]. data normalization is changing the scale of data within a specific range so that the data has a more balanced distribution and can be processed more effectively by machine learning algorithms. data normalization helps fix scale issues in data, where some features/variables can have an extensive range of values compared to others, thereby affecting machine learning models' analysis and prediction results. data normalization can also help speed up the model training process, reducing the number of iterations required and increasing prediction accuracy. the normalization used in this study is min-max normalization. meanwhile, the normalization formula as in (1) with a range of values [0– 1]. after obtaining the predicted results, the data will be denormalized to return the initial values [18]. 𝑋𝑛𝑜𝑟𝑚 = 𝑥′ − min(𝑥) max(𝑥) − min(𝑥) (1) where, 𝑋𝑛𝑜𝑟𝑚are the results of normalization, 𝑥′the data to be normalized, min(𝑥)and max(𝑥) are the minimum and maximum values for all data. c. adaptive neuro-fuzzy inference system (anfis) anfis is an adaptive network based on the fuzzy inference system (fis), using a hybrid learning algorithm procedure. anfis can build an input-output based on human knowledge (if-then fuzzy rules) with the proper membership function. in principle, anfis parameters can be separated into the premise and consequence parameters that can be adapted to hybrid training. hybrid learning is carried out in two steps, a forward step and a backward step [19]. the following is the pseudocode for the steps in the anfis method. inputs: monthly input data output: monthly dataset predictions 1 // begin 2 initialize the antecedent parameters and consequent parameters 3 define the input data and output data 4 determine the number of membership functions and their parameters for each input variable table 1. 10 samarinda city sub-districts subdistrict an area (km2) palaran 182.53 samarinda seberang 12.49 samarinda ulu 22.12 samarinda ilir 17.18 north samarinda 229.52 sungai kunjang 69.23 sambutan 100.95 sungai pinang 34.16 samarinda city 23.69 loa janan ilir 26.13 haviluddin et al. / knowledge engineering and data science 2022, 5 (2): 122–128 125 5 calculate the degree of membership for each input variable and membership function 6 calculate the firing strength of each rule based on the degree of membership of each input variable 7 normalize the firing strength of each rule 8 calculate the weighted average of the consequent parameters for each rule 9 calculate the output of the system by summing the weighted averages of the consequent parameters for each rule 10 use a training algorithm to adjust the antecedent and consequent parameters to minimize the error between the predicted output and the actual output 11 repeat steps 4-9 for each input-output pair in the training data 12 test the trained model on new input-output pairs and evaluate its performance. 13 // end the network to be implemented uses anfis with the sugeno model. in the sugeno model, each fuzzy rule predicts the output as a linear function of the input variables. in this experiment, the test parameters used can be seen in table 2. d. predictive accuracy mape was chosen to measure forecasting accuracy. mape is an error measurement that calculates the size of the percentage deviation between actual data and forecast data [20]. the forecasting model will be perfect if it produces a mape value that is less than 10% and will be bad if it is above 50%. this study's mape score criteria were <10% very good, 10%-20 % good, 20%-50% good enough, and > 50% bad [21]. meanwhile, the formula for calculating mape accuracy measurements can be seen in (2). 𝑀𝐴𝑃𝐸 = 1 𝑛 ∑ | 𝑋𝑡 − 𝐹𝑡 𝑋𝑡 𝑛 𝑖=1 | x 100% (2) where 𝑛 is the amount of data; 𝑋𝑡is the actual data for the period 𝑡; 𝐹𝑡is the prediction of the -th period 𝑡. iii. results and discussion in this part, the experimental findings of the anfis approach that has been employed in the process of forecasting the amount of landfill data are described. in order to make an accurate forecast of the findings based on the phases of the experiment, setting the test parameters is crucial. following that, the testing procedure that was based on data ratios of 5:2 and 6:1 was carried out so that the outcomes of the predictions could be compared, as shown in table 3. table 3 demonstrates that virtually all of the mape provided by the sub-districts were acceptable because their values were below 50%, with the exception of samarinda ulu (k3), which had a value of 56.15%. the mape for each individual subdistrict is represented by a separate number in each of the two current ratios. the mape in the sambutan district (k7) is 4.01%, whereas the mape in the sungai kunjang district (k6) is 31.79%. this represents a ratio of 5:2 between the two districts. the mape in north samarinda (k5) is the lowest, coming in at 3.37%, while the mape at k3, which comes in at 56.15%, is the highest. as compared to the 6:1 ratio, the overall findings indicate that the 5:2 ratio has a better average mape value (14.61%) than the 6:1 ratio (22.36%). predictions of waste table 2. variable test parameter value parameter mark number of mfs 2, 3 mf type trapmf, trimf, gbellmf learning rate 0.2; 0.4; 0.6 epoch 100 error rate 0.01 step size decreases the rate 0.9 step size increase rate 1.1 126 haviluddin et al. / knowledge engineering and data science 2022, 5 (2): 122-128 volume for 2019 have been established based on the findings of the most accurate ratio that is currently available; these forecasts are provided in table 4. according to table 4, it is known that there is a monthly variation in the total volume of garbage heaps across all of the subdistricts. both the sungai kunjang subdistrict (k6) and the loa janan ilir subdistrict (k10) have the same amount of expected monthly volumes over a period of one year, which comes out to 10,269 m3 and 3,047 m3, respectively. the most major garbage mounds, totaling 140,182 m3, are located in the north samarinda subdistrict (k5) after one year. in the meanwhile, the loa janan ilir district (k10) has a limited number of garbage heaps that are less than 36,564 m3. from the prediction results using anfis for the volume of landfills in 10 sub-districts in samarinda city in 2019 in table 3, several suggestions can be submitted to deal with waste problems. (1) evaluate the waste management system in each sub-district to reduce the volume of waste heaps produced. this can be done by improving the waste collection, sorting, and processing systems to make them more efficient. (2) campaign to reduce waste at the source at the individual and community levels. this can be done by educating about good and correct waste management and promoting using recycled products. (3) improvement of waste management infrastructure in each sub-district, such as constructing appropriate final disposal sites and constructing more effective and efficient waste processing facilities. thus, it is expected to reduce the volume of landfill waste generated and reduce the negative impact of waste disposal that is not appropriately managed. iv. conclusion predictions using anfis for the volume of landfill waste in 10 sub-districts in samarinda city, east kalimantan province, generally have mape that is not bad. the average mape for 2018 at a data ratio of 5:2 is 14.61 %, and a data ratio of 6:1 is 22.36 %. the 5:2 data ratio is then used to predict the volume of waste in 2019. from the results of the predictions, each sub-district experienced an increase in the volume of landfills in 2019 because the population and industry have increased. this shows that each sub-district produces predictive values for the volume of waste piles that are not much different. some suggestions that can be submitted to deal with waste problems in each sub-district are evaluating the waste management system, carrying out waste reduction campaigns, and improving table 3. mape testing 10 districts in samarinda city subdistrict code ratio 5:2 ratio 6:1 palaran k1 10.35% 9.95% samarinda seberang k2 9.61% 9.23% samarinda ulu k3 19.37% 56.15% samarinda ilir k4 6.62% 3.37% north samarinda k5 6.62% 3.36% sungai kunjang k6 31.79% 45.24% sambutan k7 4.01% 4.00% sungai pinang k8 24.88% 44.25% samarinda city k9 15.22% 4.62% loa janan ilir k10 17.60% 43.41% table 4. prediction results for the volume of landfills for ten districts in samarinda city in 2019 month prediction of waste stockpiles volume (m 3 ) in 2019 k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 january 5198 7157 10445 6176 10993 10269 5723 8928 6449 3047 february 5131 7505 10345 6124 11115 10269 5943 8874 6472 3047 march 5092 7824 10292 6090 11242 10269 6138 8851 6492 3047 april 5068 8092 10265 6067 11375 10269 6316 8842 6507 3047 may 5055 8305 10252 6053 11511 10269 6398 8838 6520 3047 june 5048 8468 10244 6044 11647 10269 6436 8836 6529 3047 july 5044 8589 10241 6038 11779 10269 6453 8835 6537 3047 august 5041 8678 10239 6034 11905 10269 6461 8835 6544 3047 september 5040 8743 10238 6032 12019 10269 6464 8835 6549 3047 october 5039 8789 10238 6030 12119 10269 6466 8835 6553 3047 november 5039 8822 10237 6029 12204 10269 6467 8835 6556 3047 december 5039 8846 10237 6029 12273 10269 6467 8835 6559 3047 total 60834 99818 123273 72746 140182 123228 75732 106179 78267 36564 haviluddin et al. / knowledge engineering and data science 2022, 5 (2): 122–128 127 waste management infrastructure. the study's results stated that the anfis method was good enough to be used as a predictive method. future research compares and optimizes the anfis method to produce various prediction accuracies. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] s. dos santos et al., “urban growth and water access in sub-saharan africa: progress, challenges, and emerging research directions,” sci. total environ., vol. 607–608, pp. 497–508, dec. 2017. [2] p. o. ukaogo, u. ewuzie, and c. v. onwuka, “environmental pollution: causes, effects, and the remedies,” in microorganisms for sustainable environment and health, elsevier, 2020, pp. 419–429. [3] j. jambeck et al., “challenges and emerging solutions to the land-based plastic waste issue in africa,” mar. policy, vol. 96, pp. 256–263, oct. 2018. [4] k. o. yoro and m. o. daramola, “co2 emission sources, greenhouse gases, and the global warming effect,” in advances in carbon capture, elsevier, 2020, pp. 3–28. [5] s. m. simkovich et al., “the health and social implications of household air pollution and respiratory diseases,” npj prim. care respir. med., vol. 29, no. 1, p. 12, apr. 2019. [6] j. m. chisholm et al., “sustainable waste management of medical waste in african developing countries: a narrative review,” waste manag. res., vol. 39, no. 9, pp. 1149–1163, 2021. [7] k. g. boroojeni, m. h. amini, s. bahrami, s. s. iyengar, a. i. sarwat, and o. karabasoglu, “a novel multi-timescale modeling for electric power demand forecasting: from short-term to medium-term horizon,” electr. power syst. res., vol. 142, pp. 58–73, jan. 2017. [8] h. haviluddin and r. alfred, “performance of modeling time series using nonlinear autoregressive with exogenous input (narx) in the network traffic forecasting,” proceeding ieee, pp. 164–168, 2016. [9] d. niu, f. wu, s. dai, s. he, and b. wu, “detection of long-term effect in forecasting municipal solid waste using a long short-term memory neural network,” j. clean. prod., vol. 290, p. 125187, mar. 2021. [10] o. o. ayeleru, l. i. fajimi, b. o. oboirien, and p. a. olubambi, “forecasting municipal solid waste quantity using artificial neural network and supported vector machine techniques: a case study of johannesburg, south africa,” j. clean. prod., vol. 289, p. 125671, mar. 2021. [11] c. estay-ossandon, a. mena-nieto, and n. harsch, “using a fuzzy topsis-based scenario analysis to improve municipal solid waste planning and forecasting: a case study of canary archipelago (1999–2030),” j. clean. prod., vol. 176, pp. 1198–1212, mar. 2018. [12] l. chhay, m. a. h. reyad, r. suy, m. r. islam, and m. m. mian, “municipal solid waste generation in china: influencing factor analysis and multi-model forecasting,” j. mater. cycles waste manag., vol. 20, no. 3, pp. 1761– 1770, jul. 2018. [13] y. van fan et al., “forecasting plastic waste generation and interventions for environmental hazard mitigation,” j. hazard. mater., vol. 424, p. 127330, feb. 2022. [14] u. soni, a. roy, a. verma, and v. jain, “forecasting municipal solid waste generation using artificial intelligence models—a case study in india,” sn appl. sci., vol. 1, no. 2, p. 162, feb. 2019. [15] l. a. h. purba and a. erliyana, “legal framework of waste management in indonesia,” 2020. [16] h. haviluddin and a. jawahir, “comparing of arima and rbfnn for short-term forecasting,” int. j. adv. intell. informatics, vol. 1, no. 1, pp. 15–22, 2015. [17] haviluddin and r. alfred, “performance of modeling time series using nonlinear autoregressive with exogenous input (narx) in the network traffic forecasting,” 2016. [18] mislan, h. haviluddin, s. hardwinarto, and m. aipassa, “rainfall monthly prediction based on artificial neural network: a case study in tenggarong station, east kalimantan indonesia,” procedia procedia comput. sci., vol. 59, pp. 142–151, 2015. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1016/j.scitotenv.2017.06.157 https://doi.org/10.1016/j.scitotenv.2017.06.157 https://doi.org/10.1016/b978-0-12-819001-2.00021-8 https://doi.org/10.1016/b978-0-12-819001-2.00021-8 doi:%2010.1016/j.marpol.2017.10.041 doi:%2010.1016/j.marpol.2017.10.041 https://doi.org/10.1016/b978-0-12-819657-1.00001-3 https://doi.org/10.1016/b978-0-12-819657-1.00001-3 https://doi.org/10.1038/s41533-019-0126-x https://doi.org/10.1038/s41533-019-0126-x https://doi.org/10.1177/0734242x211029175 https://doi.org/10.1177/0734242x211029175 https://doi.org/10.1016/j.epsr.2016.08.031 https://doi.org/10.1016/j.epsr.2016.08.031 https://doi.org/10.1016/j.epsr.2016.08.031 https://doi.org/10.1109/icsitech.2015.7407797 https://doi.org/10.1109/icsitech.2015.7407797 https://doi.org/10.1016/j.jclepro.2020.125187 https://doi.org/10.1016/j.jclepro.2020.125187 https://doi.org/10.1016/j.jclepro.2020.125671 https://doi.org/10.1016/j.jclepro.2020.125671 https://doi.org/10.1016/j.jclepro.2020.125671 https://doi.org/10.1016/j.jclepro.2017.10.324 https://doi.org/10.1016/j.jclepro.2017.10.324 https://doi.org/10.1016/j.jclepro.2017.10.324 https://doi.org/10.1007/s10163-018-0743-4 https://doi.org/10.1007/s10163-018-0743-4 https://doi.org/10.1007/s10163-018-0743-4 https://doi.org/10.1016/j.jhazmat.2021.127330 https://doi.org/10.1016/j.jhazmat.2021.127330 https://doi.org/10.1007/s42452-018-0157-x https://doi.org/10.1007/s42452-018-0157-x https://doi.org/10.2991/assehr.k.200306.191 https://doi.org/10.26555/ijain.v1i1.10 https://doi.org/10.26555/ijain.v1i1.10 https://doi.org/10.1109/icsitech.2015.7407797 https://doi.org/10.1109/icsitech.2015.7407797 https://doi.org/10.1016/j.procs.2015.07.528 https://doi.org/10.1016/j.procs.2015.07.528 https://doi.org/10.1016/j.procs.2015.07.528 128 haviluddin et al. / knowledge engineering and data science 2022, 5 (2): 122-128 [19] s. kusumadewi and i. guswaludin, “fuzzy multi-criteria decision making,” media inform., 2017. [20] a. pranolo, y. mao, a. p. wibawa, a. b. p. utama, and f. a. dwiyanto, “robust lstm with tuned-pso and bifoldattention mechanism for analyzing multivariate time-series,” ieee access, vol. 10, pp. 78423–78434, 2022. [21] a. p. wibawa, z. n. izdihar, a. b. p. utama, l. hernandez, and haviluddin, “min-max backpropagation neural network to forecast e-journal visitors,” in 2021 international conference on artificial intelligence in information and communication (icaiic), apr. 2021, pp. 052–058. https://www.readcube.com/articles/10.20885%2finformatika.vol3.iss1.art3 https://doi.org/10.1109/access.2022.3193643 https://doi.org/10.1109/access.2022.3193643 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icaiic51459.2021.9415197 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 128–137 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p128-137 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology a comparative study of machine learning-based approach for network traffic classification kien trang a, b, 1 , an hoang nguyen a, b, 2, * a school of electrical engineering, international university quarter 6, linh trung ward, thu duc city, ho chi minh city 700000, vietnam b vietnam national university, ho chi minh city linh trung ward, thu duc city, ho chi minh city 700000, vietnam 1 tkien@hcmiu.edu.vn; 2 nhan@hcmiu.edu.vn* * corresponding author i. introduction the accelerated development of the internet has led to a new era of humans in the last decades. nowadays, internet applications are applied widely in different fields, including education and the working environment. over a million learners are affected and need to switch to distance learning mode due to the outbreak of covid-19 [1]. as the survey in [2], approximately 37% of us residents work remotely full-time in the first quarter of 2020, which leads to the fact that the data usage of the internet reaches a new record height. the emergence of the internet of things (iot) has brought about a major shift in the growing number and variety of connected devices and different applications supported by the network service provider. thus, network traffic classification can solve complex network management problems for internet service providers (isps). the goal of network traffic classification is to identify various types of network protocols and applications existing in a network to facilitate network management. the packets are classified to calculate the appropriate service policy for the routers. qos, network planning, monitoring, traffic trend analysis, and firewall configuration benefit from traffic classification. moreover, internet traffic classification may play an important component of automated intrusion detection systems for automatically identifying denial of service attacks to allocate network resources to priority article info a b s t r a c t article history: submitted 7 december 2021 revised 24 december 2021 accepted 29 december 2021 published online 31 december 2021 internet usage has increased rapidly and become an essential part of human life, corresponding to the rapid development of network infrastructure in recent years. thus, protecting users’ confidential information when joining the global network becomes one of the most significant considerations. even though multiple encryption algorithms and techniques have been applied in different parties, including internet providers, and web hosting, this situation also allows the hacker to attack the network system anonymously. therefore, the significance of classifying network data streams to improve network system quality and security is attracting increasing study interests. this work introduces a machine learning-based approach to find the most suitable training model for network traffic classification tasks. data pre-processing is first applied to normalize each feature type in the dataset. different machine learning techniques, including k-nearest neighbors (knn), artificial neural network (ann), and random forest (rf), are applied based on the normalized features in the classification phase. an open-access dataset iscxvpn2016 is applied for this research, which includes two types of encryption (vpn and non-vpn) and seven classes of traffic categories classes. experimental results on the open dataset have shown that the proposed models have reached a high classification rate – over 85% in some cases, in which the rf model obtains the most refined results among the three techniques. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: artificial neural network k-nearest neighbors machine learning network traffic classification random forest http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p128-137 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 129 customers [3]. the isps can also increase the quality of services by accelerating the incident management process based on the internet traffic classification. in network traffic classification, traditional methods have certain limitations. firstly, packet marking is suggested to distinguish traffic based on its qos class. some common fields are used, such as type of service (tos), differentiated services code point (dscp), and explicit congestion notification (ecn). then, several protocols have been proposed for traffic classification, including differentiated services (diffserv), integrated services (intserv), and multi-protocol label switching (mpls). due to the system compatibility problem, these protocols are not widely deployed and applied in reality. besides, port-based and payload-based are known as the commonly applied techniques in terms of tradition. each packet is assigned a port number assigned by internet assigned number authority (iana) for the port-based method. the classification can be obtained based on the registered port number. for instance, port 25 (smtp) and port 110 (pop3) are used to send and receive mail, respectively. however, due to the increase of internet applications, dynamic port numbers and tunneling are used to hide the port number leading to the limitations in this method [4]. for the payload-based method, the data packet's content is examined against the characteristics of network applications in internet traffic. this technique is especially recommended for peer-to-peer (p2p) applications. however, this technique also has certain limitations due to the high demand for hardware to detect features in data packets and the incapacity of handling encrypted data traffic packets [5][6]. in general, these traditional approaches have drawbacks in terms of classified accuracy and resources. over the last few years, in artificial intelligence (ai) research, machine learning (ml) has achieved remarkable success that allows automatic identification and classification without human intervention in some cases. some recent research is gradually switching towards machine learning applications in network traffic classification. yuan et al. [7] introduced using an advanced version of the decision tree called hadoop c4.5 to classify the network traffic. the applied dataset contains eight classes which have 248 properties for each class. the results give an improvement in terms of classified speed and accuracy compared to the original method – reaching over 80%. the study in [8] mentioned the netmate tool to select 23 core features before training for classification. different algorithms are applied for comparting, including c4.5, support vector machine (svm), bayesnet, and naive bayes. among these experiments, c4.5 gives the highest accuracy – 78.9%, while the lowest is 68.1% belongs to bayesnet. similarly, y. ma et al. also applied the c4.5 decision tree to classify internet traffic – reaching 88% in average accuracy. svm and k-means are employed based on the realistic traces of the internet in the research of z. fan et al. [9]. they apply the feature selection before the training stage. different training and test set ratios are conducted, and the overall results are about 98% for both classifiers. according to a study of these classification outcomes, the classification model based on supervised learning algorithms has greater precision than the classification model based on unsupervised learning methods. four distinguish feature selection methods also are discussed as a pre-processing step in [10] to improve the efficiency of the computation process and limit the error in classification. besides, they also conduct experiments on different classifiers, including knearest neighbors (knn), random forest (rf), and gradient boosting. the accuracy of feature selection methods and classifiers is approximately 85% in general. naivebayes classifier also is applied in work [11], [12], and [13], reaching over 90%, 93% and around 55%, respectively. as a result of the support from hardware, deep learning becomes one of the most helpful assistants in the task of classification. convolutional neural network (cnn) is introduced as one of the powerful methods to deal with the complicated image-based classification for the huge dataset. the end-to-end architecture of cnn can feed the input data directly without feature extraction or pre-processing and output predicted probabilistic or predicted class. although many proposed models are established for graphical classification usage, inspired by previous studies, many researchers try to adjust these models to fit with the network traffic classification. f. zhang et al. [14] proposed an improved version for capsule neural network (capsnet) to identify network traffic. a conversion step and normalization are conducted to turn the features into the two-dimensional array before feeding into the networks. three versions of cnn are compared in the experiments with an average accuracy of over 95%. besides, the study [15] introduced using the pre-trained model resnet and self-developed cnn. the result from resnet outperforms self-developed cnn, which reaches nearly 95.5% and 97%, respectively. the author explains that resnet has the pre-trained weight and more complex 130 k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 architecture than the rest. however, deep learning is not a universal method to apply to every case; indeed, the dependence on the dataset is one of the big challenges. there are three distinguished methods conducted in the study [16], including random forest (rf), linear discriminant analysis (lda), and deep neural network (dnn). regarding accuracy, two traditional machine learning methods, rf and lda, have higher results than dnn for scenario a, while there is an improvement of dnn over rf in scenario b. l. zhipeng et al. [17] discussed using two famous pre-trained cnn models: resnet50 and googlenet. since these two models are used for images, one-hot encoding transforms the symbolic features to the binary features stored as vectors. afterward, the binary vectors are converted to grayscale images. the results give about 81% for the two given models. to deal with the limited samples dataset, the work in [18] proposed using deep convolutional generative adversarial network (dcgan) to generate more samples before training progress. this network can perform semi-supervised learning with the existent samples and create more new data to enrich the dataset. by this method, the learning for classification would have more data for training and testing, which can improve the generalization and prevent the overfitting problem. as the baseline cnn results, this study achieves 89% and 78% for self-collected and iscx datasets, respectively. the research in [19] proposed employing feature extraction based on a convolutional recurrent autoencoder neural network. the proposed approach is established based on the autoencoder architecture, consisting of the encoder, latent space, and decoder. different dnns are applied to verify the performance, including cnn, sparse autoencoder (sae), and long short-term memory (lstm). ultimately, the stacked-cnn–lstm architecture reaches the highest performance in almost all metrics. table 1 shows the summary of the related studies in network traffic classification. table 1. comparative studies research method result number of class [7] c4.5 decision tree over 80% 8 [8] c4.5 decision tree svm bayes net naivebayes 78.9% 74% 68.1% 71.8% 5 [9] svm k-means ~98% for both methods 6 [10] knn random forest gradient boosting ~85% for all cases not mentioned [11] naivebayes over 90% 7 [12] naivebayes 93% not mentioned [13] naivebayes 54~55% for all cases 3 [14] improved caspnet over 95% 12 [15] resnet self-developed cnn ~97% ~95.5% 8 [16] random forest lda dnn 95%, 42% for scenario a, b 98%, 76% for scenario a, b 69%, 74% for scenario a, b 3 [17] resnet50 googlenet 81.5% 81.8% 5 [18] dcgan + base-line cnn 89% for self-collected dataset 78% for iscx dataset not mentioned [19] cnn-sae-cnn lstm-sae-nn cnn-lstm-sae-nn stackedcnn-lstm-sae-nn > 95% for all cases 4 k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 131 although various parties have used different encryption methods and approaches, including internet service providers and web hosting companies, this circumstance allows a hacker to attack the network system anonymously. as a result, the importance of classifying network data streams in order to improve the quality and security of network systems is drawing an increasing amount of research interest. this work introduces a machine learning-based approach for determining the most appropriate training model for network traffic classification tasks, described in detail elsewhere. ii. approach figure 1 depicts the processing chart of the proposed approach. the dataset applied for this work is taken from [20]. before feeding into machine learning models, pre-processing is initially applied to meet some basic requirements, including normalization and data transformation. then, the dataset is divided into two subsets: training and test set. finally, different traditional machine learning models are applied to test different scenarios. from the comparative studies in the previous section, the traditional models almost give better performance than the advanced models, and it can be explained that different datasets may have a variety in size and latent properties, leading to the fact that deep learning techniques cannot perform well in some narrow size of the dataset. thus, k-nearest neighbors (knn), artificial neural network (ann), and random forest (rf) are chosen to apply in this study. a. data pre-processing since the given dataset contained different types of features with various ranges, this leads to that pre-process step being applied to deal with the classification purpose of the proposed approach. therefore, normalization is necessary to convert the numerical value to a similar scale without affecting the difference of value range. therefore, scale normalization is applied by (1). ' min( ) max( ) min( ) i i d d d d d − = − (1) where d is the feature vector, id each element in the feature vector, and the corresponding normalized element. after this process, the feature would be in the range of 0 and 1. besides, each class's label name, such as vpn-mail and vpn-voip, needs to be converted into numeric values. missing values in data can be caused by data corruption or a failure to record data which also influences classification performance. since some machine learning algorithms are not able to work with missing values. the corresponding data element will be removed to prevent the impact of the training process to deal with this phenomenon. fig. 1. the processing chart of the proposed algorithms 132 k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 b. machine learning models in general, machine learning is the process of seeking and describing structural patterns in a given data set. the output of the machine learning model will be a description of the learned knowledge which can be classification and regression. 1) k-nearest neighbors (knn) knn is one of the most fundamental and simplest in the supervised machine learning algorithms, which operates by grouping all the samples having similar characteristics of the dataset [21]. instead of learning from the training data, knn mechanically memorizes all the data. then, all the computational processes are conducted in the test phase, which means that every time a sample of the test dataset is input for classification, the algorithm computes the difference between the testing data point and the nearest ones. the predicted label is dependent on the label of the nearest data points having the minimum distance [22]. in addition, a voting process may be conducted in the case of many different labels in the data points. let 1 2 { , ,..., } n x x x x= is a sample and 1 2 , ,..., n x x x are the features of the sample. the majority rule specifies the classification procedure based on the number of k-nearest reference vectors from the projection of the sample x . an assumption of all samples in the data set corresponding to points that exist in an n-dimensional space denoted by n  is conducted. distance metrics define the distance between points in the mentioned spatial dimension. the formula for calculating the distance between samples i x and jx is defined in (2). 1/ 1 ( , ) p n p i j i j f f f d x x x x =   = −     (2) where i f x and j f x are the corresponding value of the number of features f of the data sample i x and j x , respectively. next, the algorithm selects k points corresponding to the number of samples in the training set with the closest distance to the sample at the input. the sample's label x will be classified based on the number of classes of the above k samples according to the rule of majority voting. 2) artificial neural network (ann) ann is a machine learning algorithm that simulates the biological neural activity of humans. this method consists of 3 main layers: input, hidden, and output. each layer consists of many neurons which are connected to process information. each neuron includes data inputs to receive and process to produce an output. in addition, the neuron output or the neuron processing result can be used as an input for other neurons. independent values in the input are passed to the neural network node to produce dependent values in the output. the precondition is that those output values must correspond to the input data group as independent variables. each input value i x is attached with corresponding weight i w and bias i b , representing the importance of that input value at the neuron node compared to other input values. the computation takes the summation of all input data values with the weights and biases for each neuron. these weights are set randomly by default at the initial. during the training, the updated weights are computed through the optimization process. then, an activation function is applied to map the input values of a neuron node to the output. the mathematical representation is defined in (3). 1 k i i i i i m w x b = = + (3) where k is the number of input values passing through a neuron. therefore, after mapping the activation, (3) is adjusted to (4). 1 ( ) k i i i i i i y f m f w x b =   = = +     (4) k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 133 3) random forest (rf) random forest, developed in the study [23], is the combination of multiple decision trees referred to as the bagging method. the typical decision tree model classifies the data samples in the training dataset based on their features. the training process starts from the root with the complete dataset, splitting into smaller samples at the different terminal or intermediate nodes based on the values of specific calculation metrics, such as entropy or gini index, of one respective feature. the entropy is the parameter indicating the randomness of the analyzing feature, which decides how the model splits the data into subsets based on that respective feature. then, based on the entropy values, the model calculates the information gain, determining how well the data were split. the decision tree mostly tries to maximize the information gain while keeping the entropy value minimum. the formula for calculating the entropy and information gain is illustrated in (5) and (6). 2 1 log ( ) c i i i e p p = = − (5) 1 ( 1) ( , ) k j ig e t e j t = = − − (6) where c is the number of features, k is the maximum number of subsets divided. the random forest utilizes different inputs with different features corresponding to each decision tree for predictions. multiple prediction outcomes are made to classify the data samples. the final classification step of the random forest model is made based on the majority rule of the outcomes of those decision trees. therefore, the increase in the number of decision trees during the rf model creation helps to increase the accuracy level in classification decisions while avoiding the heavy computation load in the hyperparameter tuning process. iii. results analysis this section presents the dataset and scenarios description, the evaluation metrics, and the discussions of the obtained results from the three machine learning models, being rf, knn, and ann, respectively. a. dataset description in the scope of this research, the dataset vpn – non-vpn (iscxvpn2016) [20] is implemented for the training and testing phases. it was created during an experiment from the new brunswick university, canada, in which the dataset generator created two user accounts to participate in different internet services such as facebook, utorrent, and skype. each class inside the dataset is also divided into two categories: non-vpn and vpn encryption traffic classes. therefore, the total number of labels for classification can be considered up to 14 classes. the dataset nature corresponds to the training, and the testing scheme is divided into two steps. the first step is to classify the two general classes, non-vpn and vpn encrypted traffic flow. afterward, for each class, seven distinguished traffic flows are classified. the detailed process in the classification task is described in figure 2. besides the types of internet traffic-based classification, the data also includes time-based division. therefore, for each step of the classification process, the data is divided into four categories 15, 30, 60, and 120s. b. evaluation metrics in this study, the experiments are conducted on the colab pro-environment with 26 gb ram and gpu nvidia tesla p100. the applied dataset is separated into two subsets: training and test set fig. 2. dataset classification scenarios 134 k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 followed the ratio 80/20, respectively. besides, cross-validation is not applied in this case due to the large dataset. in machine learning and artificial intelligence, one of the most common evaluation means is the usage of the confusion matrix. the confusion matrix is often implemented to evaluate the performance of a supervised learning model and the level of confusion while classifying the classes. the confusion matrix consists of the main four parameters, which are true positive (tp), false positive (fp), true negative (tn), and false-negative (fn). the calculation from these four numbers could be implemented to examine the learning models through the frequently used evaluation metrics: accuracy, precision, recall, and f1-score. among the four metrics, the most famous indicator is the accuracy level, the number of samples classified correctly to their respective labels within the whole dataset. the formula to calculate the accuracy value is demonstrated in (7). tp tn a tp tn fp fn + = + + + (7) even though the accuracy level is frequently used to obtain a basic understanding of the learning models, it is not recommended to neglect the number of falsely classified data samples into incorrect labels. therefore, to perform a complete assessment over given learning models, the combination of other metrics is necessary. precision is the number of samples data classified into a label belonging to that class. on the contrary, recall is the number of samples accurately classified into a class over the total number of samples correctly and incorrectly classified into that respective class. finally, the f1-score is the combination metric from both precision and recall values, in which it is only archived high performance by having high results in those two metrics. through the analysis of f1-score, the assessment process will acquire a thorough evaluation of the efficiency of the learning model. the formulas to calculate the above values are described in (8), (9), and (10), respectively. precision tp tp fp = + (8) recall tp tp fn = + (9) 2 precision recall f1-score precision recall   = + (10) 1) scenario a1 in scenario a1, the primary purpose of classification is to distinguish the internet traffic flow into two categories, non-vpn and vpn encrypted traffic usage. the dataset is divided into four subsets fig. 3. dataset classification scenarios k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 135 of data samples with different amounts of time recorded, which are 15, 30, 60, and 120s. the evaluation metrics in all four subsets of data samples of the three machine learning models are recorded in figure 3. at a glance, the results recorded from the rf model are the highest values, followed by the knn and the lowest results from the ann model. throughout the evaluation metrics accuracy, precision, recall, and f1-score, the rf model consistently produces numbers around the 88-94% range. the recall value of the rf model in the 60s dataset is the only exception at 85%, which is in the higher range than other models. the ann model is the least effective in classification, with the average range staying approximately at 77%. however, compared to the other two models, the ann is the most balanced since all four metrics are almost the same throughout the different time-based datasets. in other words, the time feature does not affect the performance of the ann model. on the other hand, the knn model provides relatively high results, with most metrics being approximately 80-86%. the 60s subset of data is the worst time-based samples toward this model, with 82%, 82.1%, 76.3%, and 79.13% recorded for accuracy, precision, recall, and f1-score, respectively. 2) scenario a2 – non-vpn in contrast with the total domination of the rf model in scenario a1, the remarkable evaluation metrics of the non-vpn subsets divide alternately by the ann and rf model. to be more specific, the rf model scores the highest mostly in the accuracy and precision aspects, whereas the recall and f1-score are greater in the ann model than the other two, as indicated in figure 4. in the accuracy metrics, all of the time-based subsets produce more than 90% results for the rf model, with only the 120s dataset containing an exception of the knn and rf sharing the same 92.8% value. for the precision aspect, the upper range recorded in the rf model is around 86-88%. another point to be noted is that, even though the ann is not always the greatest model, the produced results are greater than 80%. on the other hand, the recall and f1-score metrics mark a big drop in the performance of rf and knn models. all knn models fall below 70%, with the lowest value being the recall in the 60s dataset of only 61%. the drop in evaluation metrics also appears in the rf model as the time feature increases in the time-based subset of samples. the highest values are in the 15s dataset, with recall of 81.5% and f1-score is 84.8%. in the 120s dataset, these values fall to 66.1% and 73.5%, respectively. in contrast, the ann model shows the most stable values, mostly greater than 80%. the only exception is in the 60s data set in which the values are lower, around 2% than the rf model. 3) scenario a2 – vpn in the case of the vpn encrypted subset, the performance of the learning models recorded a significant drop in the precision, recall, and f1-score except for the ann model. the results are illustrated in figure 5. the rf model still provides three over four highest values in the accuracy fig. 4. recorded evaluation metrics for knn, ann, and rf models – scenario a2 – non-vpn 136 k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 metric, all of which are larger than 86%. however, only in the 120s, the peak value belongs to the knn model with 87.8% in classified accuracy. on the contrary, the ann model displays total domination in the three remaining metrics. most of the values in the 15, 30, and 60s dataset were recorded in the range of approximately 80.5% 84%. the trend only reduces in the 120s dataset, with the precision, recall, and f1-score being 78.4%, 74.4%, and 76.3%, respectively. iv. conclusions in this research, different machine learning models are recommended to classify multiple internet traffic flows included in the open-access vpn – non-vpn (iscxvpn2016) dataset. the learning models include the random forest, the k-nearest neighbors, and the artificial neural networks. the models are trained, then perform the classification task in two steps: the non-vpn and vpn classification in scenario a1. subsequently, the models classify each subset into seven different internet traffic classes. based on the obtained results, the random forest is the most suitable training model for this dataset, even though the classification results indicate that it is not accurate in the longtime data samples, such as the results in the 120s subset. in future research, different datasets of more complex internet traffic classification schemes and more effective yet suitable training models such as reinforcement learning models could be considered for further analysis. the iscxvpn2016 dataset is well established with different categories and subscenarios. however, new internet and communication protocols and applications are emerging daily, corresponding to the rapidly increasing internet usage rate all over the world. the encryption protocols are also developed to protect the user’s personal information and secure the internet connection. therefore, appropriate training models fitting in the purpose of the internet flows classification, which is suitable for practical application and development, can be discovered and will be the main target for research in the field. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. fig. 5. recorded evaluation metrics for knn, ann, and rf models – scenario a2 –vpn k. trang and a.h. nguyen / knowledge engineering and data science 2021, 4 (2): 128–137 137 additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] g.r. el said, “how did the covid-19 pandemic affect higher education learning experience? an empirical investigation of learners’ academic performance at a university in a developing country”, advances in humancomputer interaction, vol. 2021, pp. 1–10, feb. 2021. [2] l. yang, d. holtz, s. jaffe, s. suri, s. sinha, j. weston, c. joyce, n. shah, k. sherman, b. hecht, and j. teevan, “the effects of remote work on collaboration among information workers,” nature human behaviour, sep. 2021. [3] l. stewart, g. armitage, p. branch, and s. zander, "an architecture for automated network control of qos over consumer broadband links," tencon 2005 2005 ieee region 10 conference, pp. 1-6, november 2005. [4] t. karagiannis, a. broido, m. faloutsos, and k. claffy, “transport layer identification of p2p traffic,” proceeding of the 4th acm sigcomm conference on internet measurement (imc '04), new york, pp. 121–134, september 2004. [5] p. b. park, y. won, j. chung, m. kim, and j. w.-k. hong, “fine-grained traffic classification based on functional separation,” international journal of network management, vol. 23, no. 5, pp. 350–381, aug. 2013. [6] g. aceto, a. dainotti, w. de donato and a. pescape, "portload: taking the best of two worlds in traffic classification," 2010 infocom ieee conference on computer communications workshops, pp. 1-5, march 2010. [7] z. yuan and c. wang, "an improved network traffic classification algorithm based on hadoop decision tree," 2016 ieee international conference of online analysis and computing science (icoacs), pp. 53-56, may 2016. [8] m. shafiq, x. yu, a. a. laghari, l. yao, n. k. karn and f. abdessamia, "network traffic classification techniques and comparative analysis using machine learning algorithms," 2016 2nd ieee international conference on computer and communications (iccc), pp. 2451-2455, october 2016. [9] z. fan and r. liu, "investigation of machine learning based network traffic classification," 2017 international symposium on wireless communication systems (iswcs), pp. 1-6, august 2017. [10] a. pasyuk, e. semenov and d. tyuhtyaev, "feature selection in the classification of network traffic flows," 2019 international multi-conference on industrial engineering and modern technologies (fareastcon), pp. 1-5, october 2019. [11] y. wang, y. xiang and s. yu, "internet traffic classification using machine learning: a token-based approach," 2011 14th ieee international conference on computational science and engineering, pp. 285-289, august 2011. [12] s. dong and r. jain, “flow online identification method for the encrypted skype,” in journal of network and computer applications, vol 132, pp. 75-85. [13] m. dixit, r. sharma, s. shaikh and k. muley, "internet traffic detection using naïve bayes and k-nearest neighbors (knn) algorithm," 2019 international conference on intelligent computing and control systems (iccs), pp. 11531157, may 2019. [14] f. zhang, y. wang and m. ye, "network traffic classification method based on improved capsule neural network," 2018 14th international conference on computational intelligence and security (cis), pp. 174-178, november 2018. [15] h. lim, j. kim, j. heo, k. kim, y. hong and y. han, "packet-based network traffic classification using deep learning," 2019 international conference on artificial intelligence in information and communication (icaiic), pp. 046-05, february 2019. [16] j. kwon, d. jung and h. park, "traffic data classification using machine learning algorithms in sdn networks," 2020 international conference on information and communication technology convergence (ictc), pp. 1031-1033, october 2020. [17] z. li, z. qin, k. huang, x. yang, and s. ye, “intrusion detection using convolutional neural networks for representation learning,” lecture notes in computer science, pp. 858–866, 2017. [18] a. s. iliyasu and h. deng, "semi-supervised encrypted traffic classification with deep convolutional generative adversarial networks," in ieee access, vol. 8, pp. 118-126, 2020. [19] g. d’angelo and f. palmieri, "network traffic classification using deep convolutional recurrent autoencoder neural networks for spatial–temporal features extraction," journal of network and computer applications, vol. 173, pp. 102890, 2021. [20] g. draper-gil, a. h. lashkari, m. s. i. mamun, and a. a. ghorbani, “characterization of encrypted and vpn traffic using time-related features,” proceedings of the 2nd international conference on information systems security and privacy (icissp2016), pp. 407-414, february 2016. [21] h. a. h. ibrahim, o. r. aqeel al zuobi, m. a. al-namari, g. mohamed ali, and a. a. a. abdalla, "internet traffic classification using machine learning approach: datasets validation issues," 2016 conference of basic sciences and engineering studies (sgcac), pp. 158-166, february 2016. [22] a. moldagulova and r. b. sulaiman, "using knn algorithm for classification of textual documents," 2017 8th international conference on information technology (icit), pp. 665-671, may 2017. [23] j. r. quinlan, “induction of decision trees,” machine learning, vol. 1, no. 1, pp. 81–106, mar. 1986. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1155/2021/6649524 https://doi.org/10.1155/2021/6649524 https://doi.org/10.1155/2021/6649524 https://doi.org/10.1038/s41562-021-01196-4 https://doi.org/10.1038/s41562-021-01196-4 https://doi.org/10.1109/tencon.2005.301139 https://doi.org/10.1109/tencon.2005.301139 https://doi.org/10.1145/1028788.1028804 https://doi.org/10.1145/1028788.1028804 https://doi.org/10.1002/nem.1837 https://doi.org/10.1002/nem.1837 https://doi.org/10.1109/infcomw.2010.5466645 https://doi.org/10.1109/infcomw.2010.5466645 https://doi.org/10.1109/icoacs.2016.7563047 https://doi.org/10.1109/icoacs.2016.7563047 https://doi.org/10.1109/compcomm.2016.7925139 https://doi.org/10.1109/compcomm.2016.7925139 https://doi.org/10.1109/compcomm.2016.7925139 https://doi.org/10.1109/iswcs.2017.8108090 https://doi.org/10.1109/iswcs.2017.8108090 https://doi.org/10.1109/fareastcon.2019.8934169 https://doi.org/10.1109/fareastcon.2019.8934169 https://doi.org/10.1109/fareastcon.2019.8934169 https://doi.org/10.1109/cse.2011.58 https://doi.org/10.1109/cse.2011.58 https://doi.org/10.1016/j.jnca.2019.01.007 https://doi.org/10.1016/j.jnca.2019.01.007 https://doi.org/10.1109/iccs45141.2019.9065655 https://doi.org/10.1109/iccs45141.2019.9065655 https://doi.org/10.1109/iccs45141.2019.9065655 https://doi.org/10.1109/cis2018.2018.00045 https://doi.org/10.1109/cis2018.2018.00045 https://doi.org/10.1109/icaiic.2019.8669045 https://doi.org/10.1109/icaiic.2019.8669045 https://doi.org/10.1109/icaiic.2019.8669045 https://doi.org/10.1109/ictc49870.2020.9289174 https://doi.org/10.1109/ictc49870.2020.9289174 https://doi.org/10.1109/ictc49870.2020.9289174 https://doi.org/10.1007/978-3-319-70139-4_87 https://doi.org/10.1007/978-3-319-70139-4_87 https://doi.org/10.1109/access.2019.2962106 https://doi.org/10.1109/access.2019.2962106 https://doi.org/10.1016/j.jnca.2020.102890 https://doi.org/10.1016/j.jnca.2020.102890 https://doi.org/10.1016/j.jnca.2020.102890 https://doi.org/10.5220/0005740704070414 https://doi.org/10.5220/0005740704070414 https://doi.org/10.5220/0005740704070414 https://doi.org/10.1109/sgcac.2016.7458022 https://doi.org/10.1109/sgcac.2016.7458022 https://doi.org/10.1109/sgcac.2016.7458022 https://doi.org/10.1109/icitech.2017.8079924 https://doi.org/10.1109/icitech.2017.8079924 https://doi.org/10.1007/bf00116251 i. introduction ii. approach a. data pre-processing b. machine learning models 1) k-nearest neighbors (knn) 2) artificial neural network (ann) 3) random forest (rf) iii. results analysis a. dataset description b. evaluation metrics 1) scenario a1 2) scenario a2 – non-vpn 3) scenario a2 – vpn iv. conclusions declarations author contribution funding statement conflict of interest additional information references [1] g.r. el said, “how did the covid-19 pandemic affect higher education learning experience? an empirical investigation of learners’ academic performance at a university in a developing country”, advances in human-computer interaction, vol. 2021, pp.... [2] l. yang, d. holtz, s. jaffe, s. suri, s. sinha, j. weston, c. joyce, n. shah, k. sherman, b. hecht, and j. teevan, “the effects of remote work on collaboration among information workers,” nature human behaviour, sep. 2021. [3] l. stewart, g. armitage, p. branch, and s. zander, "an architecture for automated network control of qos over consumer broadband links," tencon 2005 2005 ieee region 10 conference, pp. 1-6, november 2005. [4] t. karagiannis, a. broido, m. faloutsos, and k. claffy, “transport layer identification of p2p traffic,” proceeding of the 4th acm sigcomm conference on internet measurement (imc '04), new york, pp. 121–134, september 2004. [5] p. b. park, y. won, j. chung, m. kim, and j. w.-k. hong, “fine-grained traffic classification based on functional separation,” international journal of network management, vol. 23, no. 5, pp. 350–381, aug. 2013. [6] g. aceto, a. dainotti, w. de donato and a. pescape, "portload: taking the best of two worlds in traffic classification," 2010 infocom ieee conference on computer communications workshops, pp. 1-5, march 2010. [7] z. yuan and c. wang, "an improved network traffic classification algorithm based on hadoop decision tree," 2016 ieee international conference of online analysis and computing science (icoacs), pp. 53-56, may 2016. [8] m. shafiq, x. yu, a. a. laghari, l. yao, n. k. karn and f. abdessamia, "network traffic classification techniques and comparative analysis using machine learning algorithms," 2016 2nd ieee international conference on computer and communications (i... [9] z. fan and r. liu, "investigation of machine learning based network traffic classification," 2017 international symposium on wireless communication systems (iswcs), pp. 1-6, august 2017. [10] a. pasyuk, e. semenov and d. tyuhtyaev, "feature selection in the classification of network traffic flows," 2019 international multi-conference on industrial engineering and modern technologies (fareastcon), pp. 1-5, october 2019. [11] y. wang, y. xiang and s. yu, "internet traffic classification using machine learning: a token-based approach," 2011 14th ieee international conference on computational science and engineering, pp. 285-289, august 2011. [12] s. dong and r. jain, “flow online identification method for the encrypted skype,” in journal of network and computer applications, vol 132, pp. 75-85. [13] m. dixit, r. sharma, s. shaikh and k. muley, "internet traffic detection using naïve bayes and k-nearest neighbors (knn) algorithm," 2019 international conference on intelligent computing and control systems (iccs), pp. 1153-1157, may 2019. [14] f. zhang, y. wang and m. ye, "network traffic classification method based on improved capsule neural network," 2018 14th international conference on computational intelligence and security (cis), pp. 174-178, november 2018. [15] h. lim, j. kim, j. heo, k. kim, y. hong and y. han, "packet-based network traffic classification using deep learning," 2019 international conference on artificial intelligence in information and communication (icaiic), pp. 046-05, february 2019. [16] j. kwon, d. jung and h. park, "traffic data classification using machine learning algorithms in sdn networks," 2020 international conference on information and communication technology convergence (ictc), pp. 1031-1033, october 2020. [17] z. li, z. qin, k. huang, x. yang, and s. ye, “intrusion detection using convolutional neural networks for representation learning,” lecture notes in computer science, pp. 858–866, 2017. [18] a. s. iliyasu and h. deng, "semi-supervised encrypted traffic classification with deep convolutional generative adversarial networks," in ieee access, vol. 8, pp. 118-126, 2020. [19] g. d’angelo and f. palmieri, "network traffic classification using deep convolutional recurrent autoencoder neural networks for spatial–temporal features extraction," journal of network and computer applications, vol. 173, pp. 102890, 2021. [20] g. draper-gil, a. h. lashkari, m. s. i. mamun, and a. a. ghorbani, “characterization of encrypted and vpn traffic using time-related features,” proceedings of the 2nd international conference on information systems security and privacy (icissp201... [21] h. a. h. ibrahim, o. r. aqeel al zuobi, m. a. al-namari, g. mohamed ali, and a. a. a. abdalla, "internet traffic classification using machine learning approach: datasets validation issues," 2016 conference of basic sciences and engineering studie... [22] a. moldagulova and r. b. sulaiman, "using knn algorithm for classification of textual documents," 2017 8th international conference on information technology (icit), pp. 665-671, may 2017. [23] j. r. quinlan, “induction of decision trees,” machine learning, vol. 1, no. 1, pp. 81–106, mar. 1986. microsoft word 2-3148-endangsetyati-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 2, september 2018, pp. 46–54 eissn 2597-4637 https://doi.org/10.17977/um018v1i22018p46-54 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) digit classification of majapahit relic inscription using glcm-svm tri septianto a, 1, *, endang setyati a, 2, joan santoso a, 3 a department of information technology, sekolah tinggi teknik surabaya, ngagel jaya tengah 73-77, surabaya-60284, indonesia 1 septianto3@gmail.com; 2 endang@stts.edu*; 3 joan@stts.edu * corresponding author i. introduction currently, there are many studies that discuss the classification of images with various objects and methods. image classification is commonly used to create object recognition apps. classification of images is used to create an object recognition applications. for example are text and character recognition on metal-sheets [1], deep learning for handwritten javanese character recognition [2], content-based image retrieval for multi-objects fruits recognition using k-means and k-nearest neighbor [3], license automatic plate recognition based on edge detection [4] and arabic handwriting recognition using sequential minimal optimization [5]. in this study, the author attempted to use the image of the object-year figure (digit of year) on the inscription number of relics of the majapahit kingdom in java, indonesia. classification method used is a support vector machine (svm), whereas the method of gray-level co-occurrence matrix (glcm) is a method used for the extraction of features. the purpose of this study was to perform image classification of a digit of the year whose form has its own uniqueness. this image classification of a digit of the year is used as an object of research to assist the public in recognizing the number of years of an inscription writing stored in the museum trowulan, mojokerto, east java. additionally, it can be used to document the image of a digit of year digitally. image of a digit of the year is classified into 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0 classes. within this present study, the authors expect that the result is worthwhile to the direct contribution to the results of the study which are directly related to the preservation of majapahit kingdom ancient relic in indonesia. because what we are doing here pertained to participate in the preservation of the scope of the history and culture of indonesia. so far, there is no research which employs the image of a digit year object on relic inscription of majapahit kingdom. therefore, the authors were interested in the attempt to build an application to recognize the image of a digit of the year in relic inscription using recent technology. the author also expects that this research could inspire other researchers in the article info a b s t r a c t article history: received 23 march 2018 revised 17 april 2018 accepted 15 june 2018 published online 31 august 2018 a higher level of image processing usually contains some kind of classification or recognition. digit classification is an important subfield in handwritten recognition. handwritten digits are characterized by large variations so template matching, in general, is inefficient and low in accuracy. in this paper, we propose the classification of the digit of the year of a relic inscription in the kingdom of majapahit using support vector machine (svm). this method is able to cope with very large feature dimensions and without reducing existing features extraction. while the method used for feature extraction using the gray-level co-occurrence matrix (glcm), special for texture analysis. this experiment is divided into 10 classification class, namely: class 1, 2, 3, 4, 5, 6, 7, 8, 9, and class 0. each class is tested with 10 data so that the whole data testing are 100 data number year. the use of glcm and svm methods have obtained an average of classification results about 77 %. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: classification features extraction digit of year support vector machine gray-level co-occurrence matrix t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 47 future in utilizing other relic inscription of majapahit kingdom. finally, indonesian culture and history could be preserved well continuously. in this research, svm method was chosen as a classification method. svm is a classifier which utilizes the farthest margin distance from hyperplane [6][7]. svm is also commonly known as a method which is able to process data with high dimension without reducing the dimension of particular data [8][9]. svm frequently provides an identical model and solution. the result of the model could be utilized again in the testing process. svm is able to separate data distribution which its classification is linear or non-linear. meanwhile, this research employed glcm for feature extraction. glcm is one of simple feature since it is obtained from the greyness level of the certain image [10][11][12][13]. there are several similar research which employed svm method for image classification and glcm for extraction of features, such as research conducted by [7], [10], and [13]. however, they utilized different image objects. in the research conducted by [7], the author's utilized image of tumors and the results of trial detection obtained the precision score of 93.3 %. in the research conducted by [10], the authors succeeded to classify between hazy and not-hazy image by obtaining result accuracy of 97.16 % for synthesis database and 85 % for the genuine database. meanwhile, in research conducted by [13], the authors obtained an accuracy of 98.32 % on sub-image a, 84.49 % on subimage b, and 78.96 % on sub-image c from segmentation of oil palm. in accordance with several previous research which succeeded, thus the authors decided to employ svm and glcm as a method within this research. to facilitate understanding of the process of this research, thus the authors employed the article writing structure by dividing into three parts, research methodology, experiment results, and the last part is the conclusion. in research methodology part, the authors provided three subparts as follows: data collection, method, and data analysis. data collection sub-part explains how the authors obtained the set data utilized in this study. method sub-part explains how do svm and glcm work. meanwhile, data analysis sub-part discusses how dataset employed could be implemented by the determined methods. in the experiment results part, it exposes the result of the trial of svm method implementation. svm, in this research, attempted to compare image classification, where extraction of features is without glcm implementation and other trials of image classification where extraction of features had its extraction of features employing glcm. in the conclusion part, it discusses the red line of the utilized methods and provides a comparison of trial results. ii. methods a. data collection the data used in this research is an image of a digit from relic inscription of majapahit kingdom digit of the year. the image of the digit was taken from sampling result and image quantification which was acquired as much as 1000 images. in this research, 900 images of a digit of the year were utilized as data training and 100 images of a digit of the year were utilized as data testing. the utilization of data training in each class is 90 images and as for data testing, 10 images of a digit of the year were utilized in each class (an example is shown in fig. 1). fig. 1. example of images of digit of year 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 48 t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 b. feature extraction a feature extraction stage is intended to extract the characteristics or information of the object in the image that is aimed to be identified or distinguished with other objects. characteristics or features that have been extracted and then is used as parameters or input values to distinguish between objects with one another on the stage of identification or classification. numerous methods can be used in the extraction of features, however, in this study, the authors chose glcm method for extraction of features on the image of a digit of the year, with the reasons as noted in the previous section. the definition of glcm is a tabulation of how often a different combination of pixel brightness values (gray level) that occurs in an image. glcm is a matrix that has dimensions based on the gray level of an image. glcm has 22 features. however, the most important features are 3 elements, namely contrast (con), correlation (corr) and homogeneity (hom) [10]. contrast is a representation of the number of local gray level in an image. correlation is a linear measurement gray level between neighboring pixels. homogeneity is a measure of homogeneity variation of gray level in an image. usually, when the contrast value is small, then the homogeneity value is great. to calculate glcm, cm,n, it can be calculated based on the distance of d and the direction of  [13]. thus, the formula glcm can be seen as written in the following equation (1).       1 0 1 21, }),(&),({ m x n oy ji jdydxiiyxipc  (1) where ci,j is the intensity of the co-occurrence matrix; i and j are a couple of pixels with intensity values i(x, y) = i and i(x ± d1,y ± d2) = j; x = 0, 1, ..., m-1 and y = 0, 1, ..., n-1; while m and n is the number of rows and columns of the matrix. after the intensity of the co-occurrence matrix is formed, then each element of the matrix p{•} needs to be normalized by dividing each element with a number which is the sum total of the pixel pair. the result of the normalization of p{•}, can be seen in the following equation (2).     false isargument theif,0 trueisargument theif,1 }{p (2) the contrasting parameter represents gray level variations in an image file, usually used as a parameter contrast linear dependence on the value of neighboring pixels of gray level. contrast can also be referred to as the variance of the sum of squares (sum of squares variance). the formula for calculating the contrast (con), can be seen in the following equation (3).   i j jicjicon , 2)( (3) the correlation parameters showed a linear dependence of the degree of gray pixels neighboring each other in a gray image. equation correlation (corr), can be seen in the following equation (4), where  is the deviation standard of x and y.    i j yx jiyx cujui corr  ,))(( (4) homogeneity is the parameter in glcm that indicates homogeneity intensity variations in an image. this homogeneity equation is said to represent the roughness in the image field. calculation of homogeneity (hom) can be seen in the following equation (5).    i j jic ji hom ,2)(1 1 (5) c. classification data mining is a process of inferring knowledge from huge data. classification is a major technique in data mining and widely used in various fields. classification is data mining (machine learning) technique used to predict group membership for data instances. by simple definition, the classification t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 49 analyzes a set of data and generate a set of grouping rules which can be used to classify future data. for the classification, we focus on support vector machine (svm) [14]. svm is a powerful algorithm with strong theoretical foundations. svm has strong regularization properties, which refers to the generalization of the model to new data [14][15]. svm is a supervised machine learning algorithm. supervised learning method processed through two steps: training and testing [6]. svm is based on the concept of decision making [7]. a decision-making is based on the separation of feature members of different classes. svm is chosen, because it is able to cope with very large feature dimensions and without reducing existing features. an svm performs classification by constructing an n-dimensional hyperplane [15]. the purpose of svm is to find the largest margin of hyperplane [10]. an svm is a mathematical entity, an algorithm for maximizing a particular mathematical function with respect to a given collection of data. svm can handle data that is linear and non-linear data. simple svm is usually linear in dividing features into both classes. the svm linear kernel function is commonly described as equation (6). yxyxk .),(  (6) where, k(x, y) is inner product of x and y. svm is grouped into two type linear and non-linear classification. the linear svm classifier is worthwhile for the non-linear classifier to map the input pattern into higher dimensional feature space. in here, we used non-linear svm, which is divided into several classes. therefore, svm requires a kernel. gaussian kernel or commonly known as radial basis function (rbf) can be written as equation (7). the rbf kernel, is applied to two samples x and y, which indicate as feature vectors in some input space and it can be defined as,           2 2 2 exp),(  yx yxk (7) d. data analysis in this study, the trial would be conducted using two scenarios. the first experiment was tested by performing data classification without passing glcm extraction of features process, workflow without glcm features can be seen in fig. 2. the second trial was performed by passing glcm extraction of features process, whereby its workflow can be seen in fig. 3. meanwhile in the workflow of fig. 2, after the image was done in pre-processing, then the image will be converted into a matrix of pixels. after converting into a pixel matrix, the images were converted back into a vector. the images in the vector were then forwarded using svm method to perform the classification process. while according to the workflow in fig. 3, the results of glcm features would be utilized to be classified. therefore, the workflows between systems without glcm features and with glcm features are different, so that it can generate different predictions. classification in the training sets by using the image of a digit of years in fig. 4 will produce a model. and this model will then be used to make predictions on the testing sets. fig. 2. workflow without glcm process 50 t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 examples of the image of a digit of the year in ancient relic inscriptions which were used for training set and testing set can be seen in the following fig. 4 and fig. 5, each of which represents a grade 0 to grade 9. at the stage of pre-processing, the image of a digit of the year will be converted to the size of 32x32 pixels. the original image of the rgb format is converted into grayscale. fig. 6 is the grayscale result of digit of year image sample 1 in fig. 4. once the image is converted, it appears that the workflow in fig. 3 would take the process to obtain glcm features which include contrast, correlation, and homogeneity. as an example, the results fig. 3. workflow employing glcm process fig. 4. examples of image of digit of year in training set fig. 5. examples of image of digit of year in testing set fig. 6. convert of digit of year image sample 1 from rgb to grayscale t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 51 obtained in glcm features of the image of a digit of the year in fig. 4 can be seen in the following table 1. the visualization of glcm features con, corr, and hom in table 1, only for the digit of the year image sample 1, can be seen in fig. 7 while fig. 8 is a matrix formed from fig. 7. in fig. 8, this grayscale matrix still has dimensions of 32x32. while after the process of glcm, the grayscale matrix has changed into dimension 1x1. fig. 9 is a visualization of the data distribution on the resulting model of the training process in fig 3. the con, corr and hom values are distributed into 10 classes. from fig. 9, it is seen that between the distributions of values per class almost has the same value as the other classes. iii. experiment results the experiment conducted in this research was employing the dataset which has been explained in the data collection. the employed dataset was 100 images of a digit of the year, which was divided into two types of dataset, namely 900 images were taken for the training set and 100 images were taken as a testing set. it represented each of class 1, 2, 3, 4, 5, 6, 7, 8, 9 and 0. the training set consisted of 90 images for each class and testing set consisted of 10 images for each class. the experiment results employing test without glcm features (fig. 2) are presented in the following table 2. table 1. glcm features examples in fig 5. image sample glcm features con corr hom 1 0.611895161290 0.523533267280 0.763004032258 2 0.908266129032 0.726528935617 0.725681925996 3 0.662298387097 0.838165769294 0.813381591068 4 0.803427419355 0.740220693320 0.737114563567 5 0.936491935484 0.704063655696 0.695985531309 6 0.367943548387 0.825705645161 0.535556417361 7 0.731854838710 0.715120967742 0.604353099653 8 0.735887096774 0.456205609554 0.732459677419 9 0.559475806452 0.766229838710 0.683406298090 0 0.126008064516 0.585684217570 0.939415322581 fig. 7. visualization of con, corr and hom of digit of year image sample 1 fig. 8. grayscale matrix of fig. 7 52 t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 the experimental results of trials in table 2 show that using data from testing of 10 images predicted in class 9 and class 0, has the first highest percentage of 100 %. the order of the second highest percentage of 90 %, the data for testing of 10 images predicted only in class 6. in class 3, it obtained a percentage of 80 %. while in class 1, class 2, class 5, class 7 and class 8 obtained a percentage of 70 %. the lowest prediction was the 4th class with a percentage of 50 %. the reason for low predictions in the 4th class remains unknown since it is in the process of examination and further research. currently, the accuracy improvement will be continued until it reaches the intended target. whilst, the results of a test on the system workflow in fig. 3 which employed glcm features can be seen in table 3. the results of the prediction percentage are very low which obtained an average of 36 %. fig. 9. visualization of the data distribution in fig. 3 table 2. test results employing fig 2. workflow class predictions total image success percentages 1 10 7 70 % 2 10 7 70 % 3 10 8 80 % 4 10 5 50 % 5 10 7 70 % 6 10 9 90 % 7 10 7 70 % 8 10 7 70 % 9 10 10 100 % 0 10 10 100 % total 100 77 average 77 % t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 53 the experimental results of a test on table 3 show that by using 100 data testing, which is divided into 10 classes with each of 10 images per class, the prediction results are as follows: the predicted in class 0 obtained a percentage of 100 %. in class 2, 5 and 6 obtained a percentage of 50 %. in the third class, it obtained 40 %, and the percentage of class 7 obtained a percentage of 30 %. in class 8 and 9, it obtained the percentage of 20 %, while class 0 and 4 did not manage to identify any training set. the reasons for very low predictions in almost all classes remain unknown because it is in the process of examination and further research. currently, the accuracy improvement will be continued until it reaches the intended target. nonetheless, the author has guessed that the cause is due to less specific the results of extraction of a feature, in this case, is also due to the distribution of features in the formation of highly influential decision boundary. therefore, in future research, the results in the process of feature extraction are required to be improved by adding a data set of up to 3000 images. fig. 10 is a graph of the percentage of predicted results for the classification class 0 to class 9, using glcm features and without glcm features. table 3. test results employing fig 3. workflow class predictions total image success percentages 1 10 0 0 % 2 10 5 50 % 3 10 4 40 % 4 10 0 0 % 5 10 5 50 % 6 10 5 50 % 7 10 3 30 % 8 10 2 20 % 9 10 2 20 % 0 10 10 100 % total 100 36 average 36 % fig. 10. comparison results between system test using glcm and not using glcm 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 9 10 pe r ce n ta g e o f pr ed ic ti o n (% ) class of digit of year % without glcm feature % with glcm feature 54 t. septianto et al. / knowledge engineering and data science 2018, 1 (2): 46–54 iv. conclusions classification performed by svm aims to make the decision boundary. decision boundary derived from the model resulting from the testing of the training set into the svm. model of this training set will be used for examining the testing set in order to be able to predict a new record. decision boundary would be good if a data has a specific feature. the test results shows that the svm results are better than glcm-svm. this is because the distribution of the features in the formation of the decision boundary is highly influential. references [1] j. kronenberger, d. malysiak and u. handman, "text and character recognition on metal-sheets," 2017 ieee international conference on information and automation (icia), macau, 2017, pp. 392-397. [2] r. khadijah and a. nurhadiyatna, "deep learning for handwritten javanese character recognition," 2017 1st international conference on informatics and computational sciences (icicos), semarang, 2017, pp. 59-64. [3] erwin, m. fachrurrozi, a. fiqih, b. r. saputra, r. algani and a. primanita, "content based image retrieval for multiobjects fruits recognition using k-means and k-nearest neighbor," 2017 international conference on data and software engineering (icodse), palembang, 2017, pp. 1-6. [4] p. s. ha and m. shakeri, "license plate automatic recognition based on edge detection," 2016 artificial intelligence and robotics (iranopen), qazvin, 2016, pp. 170-174. [5] h. hassen and s. al-maadeed, "arabic handwriting recognition using sequential minimal optimization," 2017 1st international workshop on arabic script analysis and recognition (asar), nancy, 2017, pp. 79-84. [6] k. machhale, h. b. nandpuru, v. kapur and l. kosta, "mri brain cancer classification using hybrid classifier (svm-knn)," 2015 international conference on industrial instrumentation and control (icic), pune, 2015, pp. 6065. [7] m. mohamed fathima, d. manimegalai and s. thaiyalnayaki, "automatic detection of tumor subtype in mammograms based on glcm and dwt feature using svm," 2013 international conference on information communication and embedded systems (icices), chennai, 2013, pp. 809-813. [8] a. patle and t. v. kalyani, "support vector machine with inverse fringe as feature for mnist dataset," 2016 ieee 6th international conference on advanced computing (iacc), bhimavaram, 2016, pp. 123-126. [9] v. wasule and p. sonar, "classification of brain mri using svm and knn classifier," 2017 third international conference on sensing, signal processing and security (icsss), chennai, 2017, pp. 218-223. [10] r. asery, r. k. sunkaria, l. d. sharma and a. kumar, "fog detection using glcm based features and svm," 2016 conference on advances in signal processing (casp), pune, 2016, pp. 72-76. [11] m. imani and g. a. montazer, "glcm features and fuzzy nearest neighbor classifier for emotion recognition from face," 2017 7th international conference on computer and knowledge engineering (iccke), mashhad, 2017, pp. 813. [12] y. l. lei, x. m. zhao and w. d. guo, "cirrhosis recognition of liver ultrasound images based on svm and uniform lbp feature," 2015 ieee advanced information technology, electronic and automation control conference (iaeac), chongqing, 2015, pp. 382-387. [13] s. daliman, s. a. rahman, s. a. bakar and i. busu, "segmentation of oil palm area based on glcm-svm and ndvi," 2014 ieee region 10 symposium, kuala lumpur, 2014, pp. 645-650. [14] g. kesavaraj and s. sukumaran, "a study on classification techniques in data mining," 2013 ieee fourth international conference on computing, communications and networking technologies (icccnt), tiruchengode, india, 2013, pp. 1-7. [15] arti patle and deepak singh chouhan, “svm kernel functions for classification”, 2013 international conference on advances in technology and engineering (icate), mumbai, 23-25 january 2013, paper identification number-102. knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 160–167 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p160-167 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) indonesian language term extraction using multi-task neural network joan santoso a,1,*, esther irawati setiawan a,2, fransiskus xaverius ferdinandus a,3, gunawan a,4, leonel hernandez collantes b,5 a institut sains dan teknologi terpadu surabaya, surabaya, indonesia b institucion universitaria de barranquilla iub, columbia 1 joan@istts.ac.id*; 2 esther@istts.ac.id; 3 ferdi@stts.edu; 4 gunawan@stts.edu; 5 lhernandezc@unibarranquilla.edu.co * corresponding author i. introduction the rapid growth of internet data, mainly text documents, has created a significant opportunity to acquire and store information as computer-based knowledge in our systems. the internet plays a significant function in human life today. daily, all information is obtained from the internet using a computer or mobile device. a portion of knowledge representation is designed to represent data from domain-specific topics. a popular representation of storing information as computer-based knowledge is ontology. ontology employs a concept and every relationship between concepts to represent knowledge. this computer-based knowledge can be utilized in various natural language processing-related studies, including question answering and dialogue systems. the task of term and relation extraction is one approach to addressing this opportunity for ontology development. ontology has been used in query answering in [1], chatbots in [2], and many other natural language processing topics research areas. most ontology construction or building is conducted manually, as mentioned [3], and research in [4] describes the costly nature of the ontology building or construction process. several numerous studies, [3][3][6], are conducted to automate the ontology building or construction process in response to these motivations. term extraction is the process of identifying essential terms within a document. relation extraction is identifying semantic relationships between terms that appear in documents. several methods have been developed for relation extraction in specific domains, such as the newswire domain in [7] and the biomedical domain in [8]. most of the research has focused on the relation extraction domain. meanwhile, our research focuses on the term extraction domain using phrase extraction, especially noun phrases. numerous machine learning algorithms are currently employed for phrase extraction from documents. ramshaw et al. [9] pioneered the noun phrase extraction technique. maximum entropy [10], svm [11], and memory-based [12] are just a few of the article info a b s t r a c t article history: received 29 november 2022 revised 10 december 2022 accepted 19 december 2022 published online 30 december 2022 the rapidly expanding size of data makes it difficult to extricate information and store it as computerized knowledge. relation extraction and term extraction play a crucial role in resolving this issue. automatically finding a concealed relationship between terms that appear in the text can help people build computer-based knowledge more quickly. term extraction is required as one of the components because identifying terms that play a significant role in the text is the essential step before determining their relationship. we propose an end-to-end system capable of extracting terms from text to address this indonesian language issue. our method combines two multilayer perceptron neural networks to perform part-of-speech (pos) labeling and noun phrase chunking. our models were trained as a joint model to solve this problem. our proposed method, with an f-score of 86.80%, can be considered a state-of-the-art algorithm for performing term extraction in the indonesian language using noun phrase chunking. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: term extraction multi-task neural network indonesian language http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ j. santoso / knowledge engineering and data science 2022, 5 (2): 160–167 161 methods that have been used to extract english phrases. in addition, some research has been conducted on extracting phrases from other languages, such as indonesian [13] and chinese [14][15]. we attempt to propose a machine learning model as noun phrase chunking for the indonesian language based on the machine learning previous results. in recent years, joint models have become the algorithm used in many works. several techniques for extracting the named entity and relation, such as the research in [16], are implemented using the joint model. with the increasing use of joint models as an algorithm in specific tasks, we also proposed combining two neural network models to extract noun phrases from documents. our model also incorporates the neural language model as an input representation. numerous neural language models have been created, including word2vec [17] and glove [18]. in addition, several current technological approaches use the neural language model as an input for their system, such as for named entity recognition [19], sentiment classification [20], and end-to-end relation extraction [16]. we will use the word2vec model as our representation because it is one of the most frequently used neural language models in natural language processing research. using the joint model as our machine learning model and word2vec as our features, we believe our proposed method has become a novel method for indonesian language noun phrase chunking. we advance our previous preliminary research by conducting these models and achieving a superior outcome. we are concentrating our research on term extraction because we believe term extractions play a significant role in this field of study. numerous relation extraction tasks, such as [21] and [16], include the term extraction process in their procedures. our research utilizes noun phrase chunking to extract the term because, as defined by chen in [22], the entity or term in a document is typically described by noun phrases. research on noun phrase chunking in indonesian has been conducted previously [23] and also in our preliminary research [13]. extending our previous research, we develop an end-to-end system that autonomously extracts noun phrases using two jointly trained multilayer perceptrons. furthermore, our proposed system integrates the pos tagging into the system as a joint model, as opposed to the conventional approach, which uses the pos tagging as a preprocessing step before the primary process. details of our works can be explained in several sections. section 2 will describe our proposed methodology. section 3 will discuss our experiment scenario and result. finally, section 5 will discuss our conclusion. ii. method this section will discuss the proposed methodology that is used in this study. there are two parts to this section. the first part will discuss the noun phrase in the indonesian language. the second part will discuss the neural network multi-task models for noun phrase chunking models. the details of the architecture systems will be shown in figure 1. the process will be divided into three parts, first is about the data annotation. the second part is for the model training, and the third part will do the testing phase with some evaluation according to the previous state-of-the-art research. the data annotation process will divide these data into two parts. the first part containing 70% of the data, will be used as a training corpus. the second part of the data containing 30%, will be used for testing the corpus. several preprocessing tasks are applied to each corpus before the feature extraction process in the training and testing phases. tokenization in this task is used to identify each token from sentence extraction results. this process is done by using some regular expressions. sentence extraction will be used in this preprocessing task to separate each sentence from the news paragraph. in this research, the sentence extraction process uses a rule created from the indonesian language sentence characteristic process. the training process will take the training corpora or dataset to the joint model in the neural networks. this process will produce models that will be tested in the testing phase using the testing dataset. finally, we evaluate the model with the standard evaluation metrics used in the conll2000. 162 j. santoso / knowledge engineering and data science 2022, 5 (2): 160-167 fig. 1. the system architecture a. noun phrase in the indonesian language phrases are a collection of words consisting of one word or a combination of two or more words that create a new meaning. a noun phrase is a phrase with a noun as the headword. indonesian language noun phrase has the same function as english noun phrase. noun phrases usually describe a subject or object in a sentence. the difference between english phrases and indonesian phrase is in grammatical structure according to each language structure. several examples of indonesian phrases can be seen in table 1. we present our dataset in sequential classifier problems. we take each noun phrase's representation into iob tagset proposed in [24]. the illustration for iob tagging used in this research can be seen in figure 2. the iob tag set that we use consists of 3 parts that describe as follows: b-np denotes the first word of a noun phrase, i-np denotes a non-initial word in a noun phrase, and vbt o denotes a word outside of a noun phrase. fig. 2. dataset and annotation example table 1. indonesian noun phrases example no. indonesian sentence english sentence 1. [saya] memakan [nasi padang] [i] eat [nasi padang] 2. [kompas] digunakan untuk menentukan [arah mata angin]. [compas] is used for determining the [point of compass] 3. [surabaya] adalah ibu kota provinsi [jawa timur] [surabaya] is a [east java] province capital 4. [jakarta] adalah ibu kota [indonesia] [jakarta] is the capital of [indonesia] 5. [apel] dimakan oleh [andi] [apple] was eaten by [andi] j. santoso / knowledge engineering and data science 2022, 5 (2): 160–167 163 the example in figure 2 consists of three words and one noun phrase. each word will be labeled each pos tagset according to the corresponding part of speech of each word in the sentences. in addition, each word in the phrase will be labeled with the iob tag set discussed before. this label is used as an output by the models to identify each noun phrase in this study. b. noun phrases chunker model the model consists of two neural networks. we use a neural language model to represent our word input to the models. mikolov proposes the neural language model used as a word embedding [17]. word2vec model consists of two models, i.e., skip gram and cbow. we use the skip gram model because [25] shows that the result from skip gram with the negative sampling optimization gives a better result than the cbow methods. our word embedding layer trained the word2vec using indonesian wikipedia corpus with a default parameter from word2vec with a dimension size of 200. to represent the part-of-speech (pos) in the second neural network embedding layer, we did not use the pre-trained embedding, but it will be trained together with the model during the learning process. details of our model illustration can be seen in figure 3. fig. 3. neural network architecture example the output of these models consists of two parts: the part of speech and the phrase labels. for our phrase label representation, we will use 3 type target classes from iob tagging for representing the phrase as mentioned in the previous discussion. in addition, we use the indonesian language partof-speech tagset proposed in [26]. we divide our proposed methods into two parts, the first is the pos tagging, and the second one is the noun phrase extraction. both tasks were trained together as joint multi-task neural network models. the first part of this model is part-of speech neural networks. we use this model as a pos tagging. the feature of our models was using a contextual window with a window size of 2, like in [13]. the feature taken as an input in this model is using the word2vec representation of each word. we concatenate each vector into a large vector before passing it through the systems. input representation can be seen as in (1). 164 j. santoso / knowledge engineering and data science 2022, 5 (2): 160-167 𝑋 = [𝑤𝑒𝑖−2, 𝑤𝑒𝑖−1, 𝑤𝑒𝑖, 𝑤𝑒𝑖+1, 𝑤𝑒𝑖+2] (1) where 𝑋 defines the input of the neural networks used in the pos tagging model. the 𝑤𝑒𝑖 defines the word embedding lookup from the word2vec models for each word input to the neural network. the models take an input of 𝑋 with a vector size (2 x 𝑤𝑖𝑛𝑑𝑜𝑤𝑠 𝑠𝑖𝑧𝑒 + 1) 𝑥 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛. the mathematical model of our pos tagging models describes as follows in (2) and (3). ℎ(𝑥)𝑝𝑜𝑠 = 𝑡𝑎𝑛ℎ(𝑊ℎ 𝑝𝑜𝑠 𝑋 + 𝑏ℎ 𝑝𝑜𝑠 ) (2) 𝑦𝑝𝑜𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑜𝑢𝑡 𝑝𝑜𝑠 ∗ ℎ(𝑥)𝑝𝑜𝑠 + 𝑏𝑜𝑢𝑡 𝑝𝑜𝑠 ) (3) 𝑋 is an input of word embedding with contextual features. variable ℎ(𝑥)𝑝𝑜𝑠 it is output from the hidden layer activation function. this activation layer uses the tanh function. 𝑊ℎ 𝑝𝑜𝑠 define the weight of the hidden layer from the pos tagging model. variable 𝑏𝑜𝑢𝑡 𝑝𝑜𝑠 defines bias from a hidden layer in the pos tagging model. meanwhile the 𝑊𝑜𝑢𝑡 𝑝𝑜𝑠 defines the weight of the output layer of the pos tagging model and the 𝑏𝑜𝑢𝑡 𝑝𝑜𝑠 defines the bias from the output layer in the pos tagging model. output from this layer will pass through a pos embedding layer in the noun phrase neural networks and concatenate it with the word embedding of each word as an input to the noun phrase neural networks. the mathematical models of how each word represents an input in the noun phrase neural networks are defined as (4) and (5). 𝑋𝑖 𝑁𝑃 = [𝑤𝑒𝑖, 𝑃𝑒𝑖] (4) 𝑋𝑁𝑃 = [𝑋𝑖−2 𝑁𝑃 , 𝑋𝑖−1 𝑁𝑃 , 𝑋𝑖 𝑁𝑃 , 𝑋𝑖+1 𝑁𝑃 , 𝑋𝑖+2 𝑁𝑃 ] (5) the noun phrase neural network is a second model to predict the correct phrase labels. the input of this model uses a concatenation between word embedding and pos embedding. this posembedding layer will generate a new vector with some specific dimensions of dpos. the dimension size for pos embedding in this study is set to 15. we combine all the word embedding vectors in the contextual feature windows with the pos embedding vector as an input in this layer defined in eq. (5). the noun phrase neural networks consist of three layers input layer, a hidden layer, and an output layer. we will have a 𝑋𝑁𝑃 as an input to the input layer in the model. this input will have a vector with a length of 𝑑𝑁𝑃 that can be computed as in (6). 𝑑𝑁𝑃 = (2 x 𝑤𝑖𝑛𝑑𝑜𝑤𝑠 𝑠𝑖𝑧𝑒 + 1) 𝑥 (𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑒𝑐𝑡𝑜𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 + 𝑝𝑜𝑠 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑒𝑐𝑡𝑜𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛) (6) the hidden model for the noun phrase neural network used a tanh activation function, and the output layer used a softmax function to get the correct phrase label in this study. therefore, the model of our noun phrase neural networks can be computed as in (7) and (8). ℎ(𝑥)𝑁𝑃 = 𝑡𝑎𝑛ℎ(𝑊ℎ 𝑁𝑃 𝑋𝑁𝑃 + 𝑏ℎ 𝑁𝑃 ) (7) 𝑦𝑁𝑃 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑜𝑢𝑡 𝑁𝑃 ∗ ℎ(𝑥)𝑁𝑃 + 𝑏𝑜𝑢𝑡 𝑁𝑃 ) (8) the variable 𝑋𝑁𝑃 defines the input of our models taken from eq. (5), consisting of word embedding and pos embedding were concatenated together. the ℎ(𝑥)𝑁𝑃 is the hidden layer of noun phrase neural networks with the activation function of tanh. 𝑊ℎ 𝑁𝑃 𝑋𝑁𝑃 is the weight of the hidden layer and the 𝑏ℎ 𝑁𝑃 is the bias from the hidden layer. the output layer was defined in the variable 𝑦𝑁𝑃 the softmax functions help normalize the output and give the highest probability of the correct output. the 𝑊𝑜𝑢𝑡 𝑁𝑃 defines the weight from the output layer and 𝑏𝑜𝑢𝑡 𝑁𝑃 defines the bias for the output layer. j. santoso / knowledge engineering and data science 2022, 5 (2): 160–167 165 we train both models using the adam optimizer with a cross-entropy cost function for both joined and trained models. we also use some optimization, such as dropout, introduced in [27] before outputting the hidden layer pass to the output layer for both models. the dropout probability that is used is 0.5. the cost function that we used to train these models can be seen as in (9), (10), and (11). 𝐽𝑝𝑜𝑠 = − ∑ log ( 𝑡𝑝𝑜𝑠 log (𝑦𝑝𝑜𝑠 )) (9) 𝐽𝑁𝑃 = − ∑ log ( 𝑡𝑁𝑃 log (𝑦𝑁𝑃 )) (10) 𝐽 = 𝐽𝑝𝑜𝑠 + 𝐽𝑁𝑃 (11) variables 𝑡𝑝𝑜𝑠 and 𝑡𝑁𝑃 define the one hot vector representation of the correct answer of pos and phrase label. variables 𝑦𝑝𝑜𝑠 and 𝑦𝑁𝑃 are the result of the output layer from pos and noun phrase models. we use this cost function j from (11) with the adam optimizer to train the joint models. iii. results and discussion we try used data from indonesia's online news website taken from [13] as previous research. these data include news from detik, vivanews, surya, and kompas. we crawled this dataset and manually annotated this dataset using two annotators. statistics of these data can be seen in table 2. to measure how good the model we conducted several experiments. for each dataset, we try several experiments and measure using conll 2000 scoring system using f1-score to show how robust our proposed system is. the f1-score used in this study can be calculated as in (12), (13), and (14). 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐ℎ𝑢𝑛𝑘 𝑔𝑖𝑣𝑒𝑛 𝑏𝑦 𝑠𝑦𝑠𝑡𝑒𝑚 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑢𝑛𝑘 𝑔𝑖𝑣𝑒𝑛 𝑏𝑦 𝑠𝑦𝑠𝑡𝑒𝑚 (12) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐ℎ𝑢𝑛𝑘 𝑔𝑖𝑣𝑒𝑛 𝑏𝑦 𝑠𝑦𝑠𝑡𝑒𝑚 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑢𝑎𝑙 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑡𝑒𝑥𝑡 (13) 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 2∗𝑅𝑒𝑐𝑎𝑙𝑙∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (14) the size of each hidden layer that is was taken as half of the input size for each model. the first experiment was done to do the chunking task for each corpus. the second task was comparing the best performance the models can achieve with previous research from [13] using c4.5 and [11] svm. this section will describe our results. our experiments are divided into two parts. the first part is experimenting with indonesian language dataset that already describes before using our models. the second part compares our experiments with previous research [13]. the result of the first part can be seen in table 3, and the second experiment can be seen in table 4. table 2. corpus statistics no. corpus dataset total news total tokens 1. detik training detik testing 208 57374 104 26081 2. kompas training kompas testing 191 51322 83 25489 3. surya training surya testing 211 50244 91 22123 4. vivanews training vivanews testing 152 66991 66 21131 166 j. santoso / knowledge engineering and data science 2022, 5 (2): 160-167 from table 3, the first experiments show that the best performance is achieved by detik corpora with the highest f1-score, about 86.80%. the views corpora show the lowest performance with f1score, about 84.72%. we also evaluate the accuracy of pos tagger in this study as one of the outputs from our joint models. the model's highest accuracy was given by kompas corpora which is 87.44%, and the lowest performance was taken from vivanews, with a pos tagging accuracy of about 86.91%. the second experiments compare our model with the previous state-of-the-art models. the result of the second experiment can be seen in table 4. table 4 shows that our models can improve our previous experiment using c4.5 [13]. however, the result of the state-of-the-art model in phrase chunking proposed in [11] shows a better performance with only differences of about 1.18%. although our model has a lower f1-score than the state-of-the-art model in [11], we have more advantages to eliminate the need for external tools for acquiring pos features. for example, our end-to-end models can label the pos automatically without using external preprocessing tools. iv. conclusion our proposed methods show that they have an improvement from our preliminary result. our methods can improve by about 2.17 from our previous research. however, if the model compares with the state-of-the-art model using svm, our proposed model has a lower f1-score with a difference of about 1.18%. although our model has a lower f1-score than the state-of-the-art models, it has more advantages in eliminating the need for external tools for acquiring pos features. this research can be a new approach for noun phrases extraction in the indonesian language. for further research, we will extend this model using transformer-based approaches and some large pretrained models to help the noun extraction process. we also plan to integrate these models into the relation extraction system to detect a semantic relation between terms that extract using this system to construct computer knowledge based on ontology. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. table 3. indonesian language experiment result no. corpus np chunking f-score pos tagging accuracy 1. detik 86.80% 87.35% 2. kompas 85.59% 87.44% 3. surya 85.22% 87.17% 4. vivanews 84.72% 86.91% table 4. indonesian language experiment result no. models f1-score 1. c4.5[13] 84.63% 2. svm[16] 87.98% 3. our model 86.80% http://journal2.um.ac.id/index.php/keds j. santoso / knowledge engineering and data science 2022, 5 (2): 160–167 167 references [1] d. s. wang, “a domain-specific question answering system based on ontology and question templates,” in software engineering artificial intelligence networking and parallel/distributed computing (snpd), 2010 11th acis international conference on, 2010, pp. 151–156. [2] h. al-zubaide and a. a. issa, “ontbot: ontology based chatbot,” in innovation in information & communication technology (isiict), 2011 fourth international symposium on, 2011, pp. 7–12. [3] a. d. s. jayatilaka and g. wimalarathne, “knowledge extraction for semantic web using web mining,” in advances in ict for emerging regions (icter), 2011 international conference on, 2011, pp. 89–94. [4] b. abdelbasset, k. okba, and m. sofiane, “agent-based approach for building ontology from text,” in computer medical applications (iccma), 2013 international conference on, 2013, pp. 1–6. [5] h. yang and j. callan, “metric-based ontology learning,” in proceedings of the 2nd international workshop on ontologies and information systems for the semantic web, 2008, pp. 1–8. [6] r. snow, d. jurafsky, and a. y. ng, “semantic taxonomy induction from heterogenous evidence,” in proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, 2006, pp. 801–808. [7] m. banko, m. j. cafarella, s. soderland, m. broadhead, and o. etzioni, “open information extraction from the web.,” in ijcai, 2007, pp. 2670–2676. [8] c. giuliano, a. lavelli, and l. romano, “exploiting shallow linguistic information for relation extraction from biomedical literature.,” in eacl, 2006, pp. 401–408. [9] l. a. ramshaw and m. p. marcus, “text chunking using transformation-based learning,” in natural language processing using very large corpora, springer, 1999, pp. 157–176. [10] w. skut and t. brants, “a maximum-entropy partial parser for unrestricted text,” arxiv preprint cmp-lg/9807006, 1998. [11] t. kudoh and y. matsumoto, “use of support vector learning for chunk identification,” in proceedings of the 2nd workshop on learning language in logic and the 4th conference on computational natural language learningvolume 7, 2000, pp. 142–144. [12] e. f. sang, “memory-based shallow parsing,” journal of machine learning research, vol. 2, no. mar, pp. 559–594, 2002. [13] j. santoso, h. v. gani, e. m. yuniarno, m. hariadi, m. h. purnomo, and others, “noun phrases extraction using shallow parsing with c4. 5 decision tree algorithm for indonesian language ontology building,” in communications and information technologies (iscit), 2015 15th international symposium on, 2015, pp. 149–152. [14] h. li, j. j. webster, c. kit, and t. yao, “transductive hmm based chinese text chunking,” in natural language processing and knowledge engineering, 2003. proceedings. 2003 international conference on, 2003, pp. 257–262. [15] g.-h. fu, r.-f. xu, k.-k. luke, and q. lu, “chinese text chunking using lexicalized hmms,” in machine learning and cybernetics, 2005. proceedings of 2005 international conference on, 2005, pp. 7–12. [16] m. miwa and m. bansal, “end-to-end relation extraction using lstms on sequences and tree structures,” arxiv preprint arxiv:1601.00770, 2016. [17] t. mikolov, i. sutskever, k. chen, g. s. corrado, and j. dean, “distributed representations of words and phrases and their compositionality,” in advances in neural information processing systems, 2013, pp. 3111–3119. [18] j. pennington, r. socher, and c. d. manning, “glove: global vectors for word representation.,” in emnlp, 2014, pp. 1532–1543. [19] s. k. sienčnik, “adapting word2vec to named entity recognition,” in proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania, 2015, pp. 239–243. [20] b. xue, c. fu, and z. shaobin, “a study on sentiment computing and classification of sina weibo with word2vec,” in big data (bigdata congress), 2014 ieee international congress on, 2014, pp. 358–363. [21] p. pantel and m. pennacchiotti, “espresso: leveraging generic patterns for automatically harvesting semantic relations,” in proceedings of the 21st international conference on computational linguistics and the 44th annu al meeting of the association for computational linguistics, 2006, pp. 113–120. [22] k. chen and h.-h. chen, “extracting noun phrases from large-scale texts: a hybrid approach and its automatic evaluation,” in proceedings of the 32nd annual meeting on association for computational linguistics, 1994, pp. 234–241. [23] a. a. arman, a. purwarianti, and others, “syntactic phrase chunking for indonesian language,” procedia technology, vol. 11, pp. 635–640, 2013. [24] e. f. sang and j. veenstra, “representing text chunks,” in proceedings of the ninth conference on european chapter of the association for computational linguistics, 1999, pp. 173–179. [25] y. goldberg and o. levy, “word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method,” arxiv preprint arxiv:1402.3722, 2014. [26] a. f. wicaksono and a. purwarianti, “hmm based part-of-speech tagger for bahasa indonesia,” in fourth international malindo workshop, jakarta, 2010. [27] n. srivastava, g. e. hinton, a. krizhevsky, i. sutskever, and r. salakhutdinov, “dropout: a simple way to prevent neural networks from overfitting.,” journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014. https://doi.org/10.1109/snpd.2010.31 https://doi.org/10.1109/snpd.2010.31 https://doi.org/10.1109/snpd.2010.31 https://doi.org/10.1109/isiict.2011.6149594 https://doi.org/10.1109/isiict.2011.6149594 https://doi.org/10.1109/icter.2011.6075031 https://doi.org/10.1109/icter.2011.6075031 https://doi.org/10.1109/iccat.2013.6521963 https://doi.org/10.1109/iccat.2013.6521963 https://doi.org/10.1145/1458484.1458486 https://doi.org/10.1145/1458484.1458486 https://doi.org/10.3115/1220175.1220276 https://doi.org/10.3115/1220175.1220276 https://doi.org/10.3115/1220175.1220276 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewitiubtodj-ahwz-dgghrjgaf4qfnoecaoqaq&url=https%3a%2f%2fweb.eecs.umich.edu%2f~michjc%2fpapers%2fbanko_ijcai07.pdf&usg=aovvaw3wlfddpfjziwt7ygff35yu https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewitiubtodj-ahwz-dgghrjgaf4qfnoecaoqaq&url=https%3a%2f%2fweb.eecs.umich.edu%2f~michjc%2fpapers%2fbanko_ijcai07.pdf&usg=aovvaw3wlfddpfjziwt7ygff35yu https://aclanthology.org/e06-1051 https://aclanthology.org/e06-1051 https://doi.org/10.1007/978-94-017-2390-9_10 https://doi.org/10.1007/978-94-017-2390-9_10 https://arxiv.org/abs/cmp-lg/9807006 https://arxiv.org/abs/cmp-lg/9807006 http://dx.doi.org/10.3115/1117601.1117635 http://dx.doi.org/10.3115/1117601.1117635 http://dx.doi.org/10.3115/1117601.1117635 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewi3sod1onj-ahvixjgghb8ldzwqfnoecakqaq&url=https%3a%2f%2fwww.jmlr.org%2fpapers%2fvolume2%2ftks02a%2ftks02a.ps&usg=aovvaw3s6rkkzixopmdmdgsimdx0 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewi3sod1onj-ahvixjgghb8ldzwqfnoecakqaq&url=https%3a%2f%2fwww.jmlr.org%2fpapers%2fvolume2%2ftks02a%2ftks02a.ps&usg=aovvaw3s6rkkzixopmdmdgsimdx0 https://doi.org/10.1109/iscit.2015.7458329 https://doi.org/10.1109/iscit.2015.7458329 https://doi.org/10.1109/iscit.2015.7458329 https://doi.org/10.1109/nlpke.2003.1275909 https://doi.org/10.1109/nlpke.2003.1275909 https://doi.org/10.1109/icmlc.2005.1526911 https://doi.org/10.1109/icmlc.2005.1526911 http://dx.doi.org/10.18653/v1/p16-1105 http://dx.doi.org/10.18653/v1/p16-1105 https://arxiv.org/abs/1310.4546 https://arxiv.org/abs/1310.4546 http://dx.doi.org/10.3115/v1/d14-1162 http://dx.doi.org/10.3115/v1/d14-1162 https://aclanthology.org/w15-1830 https://aclanthology.org/w15-1830 https://doi.org/10.1109/bigdata.congress.2014.59 https://doi.org/10.1109/bigdata.congress.2014.59 http://dx.doi.org/10.3115/1220175.1220190 http://dx.doi.org/10.3115/1220175.1220190 http://dx.doi.org/10.3115/1220175.1220190 http://dx.doi.org/10.3115/981732.981764 http://dx.doi.org/10.3115/981732.981764 http://dx.doi.org/10.3115/981732.981764 https://doi.org/10.1016/j.protcy.2013.12.239 https://doi.org/10.1016/j.protcy.2013.12.239 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewiz36ain9j-ahvw6jgghzimdr8qfnoecawqaq&url=https%3a%2f%2faclanthology.org%2fe99-1023.pdf&usg=aovvaw0m5ythvxkijrvh9nl8puzj https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewiz36ain9j-ahvw6jgghzimdr8qfnoecawqaq&url=https%3a%2f%2faclanthology.org%2fe99-1023.pdf&usg=aovvaw0m5ythvxkijrvh9nl8puzj https://arxiv.org/abs/1402.3722 https://arxiv.org/abs/1402.3722 https://www.researchgate.net/publication/209387036_hmm_based_part-of-speech_tagger_for_bahasa_indonesia https://www.researchgate.net/publication/209387036_hmm_based_part-of-speech_tagger_for_bahasa_indonesia https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewinlmy5ntj-ahwzowmghtb2bbyqfnoecasqaq&url=https%3a%2f%2fwww.cs.toronto.edu%2f~rsalakhu%2fpapers%2fsrivastava14a.pdf&usg=aovvaw1plkdfp-5sgnzv0erppkio https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewinlmy5ntj-ahwzowmghtb2bbyqfnoecasqaq&url=https%3a%2f%2fwww.cs.toronto.edu%2f~rsalakhu%2fpapers%2fsrivastava14a.pdf&usg=aovvaw1plkdfp-5sgnzv0erppkio knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 150–159 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p150-159 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) can multinomial logistic regression predicts research group using text input? harits ar rosyid a,1,*, aulia yahya harindra putra a,2, muhammad iqbal akbar a,3, felix andika dwiyanto b,4 a department of electrical engineering, universitas negeri malang, malang 65145, indonesia b faculty of computer science, electronics, and telecommunications, agh university of science and technology, 30-059 kraków, poland 1 harits.ar.ft@um.ac.id*; 2 yahya.harindraputra.1905356@students.um.ac.id; 3iqbal.akbar.ft@um.ac.id; 4dwiyanto@agh.edu.pl * corresponding author i. introduction the department of electrical engineering and informatics (deei), universitas negeri malang, has a thesis and final project management site, sisinta um. every student submitting a thesis title must adjust the title and abstract of the thesis to match the research group. based on a short survey of 25 students who have submitted titles and abstracts to sisinta um, the results show that students feel confused and have difficulty adjusting the proposed thesis's title and abstract. most lecturers from the target research group usually respond briefly to any mismatch between the proposal and the research group. this subjective response could lead to more confusion for the students. the traditional solution would be to consult their topic with lecturers or academic supervisors. this approach is somewhat complex and not straightforward. factors like time and place arrangements between students and lecturers are too dynamic. the system should be able to recommend the best research group based on the information referring to a thesis or final project. this approach is adapted from [1], which shows a lexile level within an article posted on a website. this straightforward information will help readers to find the preferable articles. we propose a text classification technique to construct a research group recommendation based on text input: title and/or abstract. the main idea is driven by the abundant text information stored in the sisinta database. once this text data is retrieved, we apply a text mining process initialized by text preprocessing to clean and restructure the text. then, the term weighting stage applies to convert text into a computable form: numbers. subsequently, resampling is essential to tackle the imbalanced distribution of classes. in the next stage, we applied the logistic regression (lr) algorithm [2] that will learn to distinguish research groups based on the title and or abstract. lr is a classification algorithm to predict the probability of the target variable [3]. this algorithm is useful in text article info a b s t r a c t article history: received 11 november 2022 revised 29 november 2022 accepted 9 december 2022 published online 30 december 2022 while submitting proposals in sisinta, students often confuse or falsely submit their proposals to the less relevant or incorrect research group. there are 13 research groups for the students to choose from. we proposed a text classification method to help students find the best research group based on the title and/or abstract. the stages in this study include data collection, preprocessing data, classification using logistic regression, and evaluation of the results. three scenarios in research group classification are based on 1) title only, 2) abstract only, and 3) title and abstract. based on the experiments, research group classification using title-only input is the best overall. this scenario gets the most optimal results with accuracy, precision, recall, and f1-score successively at 63.68%, 64.91%, 63.68%, and 63.46%. this result is sufficient to help students find the best research group based on the text titles. in addition, lecturers can comment more elaborately since the proposals are relevant to the research group’s scope. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: classification logistic regression title abstract research group thesis http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 151 h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150-159 classification, such as sentiment analysis [4]. finally, we evaluate how well the lr predicts the research group based on the text input. ii. method in this research, several stages of the research methodology are described in figure 1. we collected raw data from deei's sisinta database at the data collection stage by dumping the sql data into a microsoft excel file. no personal information such as students, supervisors, grades, or logs was included during data exporting. the main content we retrieved was text information relevant to these and the final projects. data obtained from 16 april 2016 to 4 october 2022 contained 2164 samples, and the sisinta administrator confirmed that these data are accurate. each sample has independent variables: the title, abstract, and research group class. thirteen research groups and their class distributions are shown in table 1. from this table, we can see an imbalanced distribution of research groups. a challenge to be tackled by resampling technique in our proposed method. fig 1. research methodology text preprocessing is carried out to ensure text data is ‘clean’ and the algorithm can learn from it [5]. text preprocessing involves stages to make text information more structured [6], which include text cleaning, removing missing values, removing duplicate rows, tokenization, stopword removal, and stemming. text cleaning consists of four steps. first, tag removal aims to remove html tags contained in the document [7]. many of the text data contains html tags. this often happens when students copy-paste text from the document processor to the sisinta input form. we use regular expression filtering (a.k.a regex) to remove html tags and keep informative text. say, inputtext = “

hello

”. by applying regex = re.compile(r'<[^>]+>'), the function regex.sub('', inputtext) will ouput → hello. second, case folding aims to convert table 1. number of rows in each research group of the data studied research group total pengembangan aplikasi dan media pembelajaran teknologi dan kejuruan 463 strategi pembelajaran teknologi dan kejuruan 395 kurikulum pendidikan teknologi dan kejuruan 200 rekayasa pengetahuan dan ilmu data (knowledge engineering and data science) 174 evaluasi dan pengelolaan pendidikan kejuruan 155 ketenegakerjaan teknologi dan kejuruan 142 teknologi digital cerdas (ubiquitous computing technique) 132 intelligent power and advanced energy system (ipaes) 121 intelligent power electronics and smart grid (ipesg) 104 game technology and machine learning applications 90 telematics lot system and devices 88 biomedic and intelligent assistive technology (tat) 55 sistem dinamis, kendali, dan robotika (dynamic systems, control, and robotics) 45 h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150–159 152 capital letters to lowercase. it is helpful to prevent the computer from interpreting the same word with different meanings [8]. for instance, python case fold (“case”) will output the case. the third stage, trim text, aims to remove white space at the beginning and end of the text [9]. in python, it is achieved by running the strip() function to remove spaces from both ends. the last stage removes punctuation, special characters, double white space, and the number [10]. we apply the regex for this purpose by adding more memorable characters to be removed. the second stage of text preprocessing is to remove missing values. this step is carried out to handle missing data by removing columns or rows whose data is not available or nan (not a number). this deletion's purpose is to reduce data bias [11]. this study's third stage of text preprocessing is to remove duplicates or redundant samples [12]. this will minimize the overfitting effect due to duplicates [13]. we use the natural language toolkit (nltk) for this step specifically the nltk.tokenize package. the goal is to break down sentences into words or tokens [14]. in this study, tokenization applies to the title and abstract into word fragments to identify words and the separators. hence, tokenization helps extract meaning from text. this study's fifth stage of text preprocessing is stopword removal or text filtering. we use nltk.corpus → stopwords, to filter out stop words such as 'diperlukan', 'hendaknya', and 'tapi'. the final text preprocessing stage stems [15]. stemming is used to cut prefixes, suffixes, inserts, combinations of prefixes and endings, and remove affixes [16]. besides that, it can also eliminate word inflection to its basic form. the steaming process can be done using a particular indonesian language streamer library, sastrawi. this process aims to make the computer interpret a word constructed from the same root word with a different meaning [17]. for instance, when stemming is applied, the word “kecepatan” will produce “cepat”. once the text data is clean and ready, term weighting converts data into a numeric form [18]. we apply the term frequency-inverse document frequency (tf-idf) method in this study. tf-idf assigns a weight to each word that frequently appears to quantitatively measure how strong the relationship between the word and the document is [19]. when a word appears more frequently in a document, its weight increases proportionally. in contrast, the weight decreases if the word appears more regularly in many documents [20]. we apply the sci-kit-learn library, sklearn.feature_extraction.text.tfidfvectorizer for this purpose. until the resampling stage, the dataset was distributed unevenly between research groups. although there are significant sample drops within each research group, the distribution is not balanced, as seen in figure 2. the imbalanced dataset can cause bias in the data, where partial data tends to make the classifier performs best only when predicting dominant classes [21]. therefore, we applied the resampling method, the synthetic minority oversampling technique (smote). smote iteratively generates artificial samples based on the original neighboring samples. this phase stops until all classes have the same number of samples, 194 samples each. fig 2. class distribution on the raw dataset 153 h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150-159 this study used multinomial logistic regression (mlr) due to 13 research group classes. before modeling, we separated the dataset into 70% training and 30% test sets. the training set was then used to train and optimize the mlr via grid search cross validation (gscv) method. this tuning method aims to find a combination of parameters from the model that produces the most optimal and effective predictions [22]. the gscv method heuristically constructs and evaluates the mlr model using all parameter value combinations in table 2 in a cross-validated environment (we use 10-fold). the gscv method produces insights into using different parameter combinations regarding classification performances. then, we refitted the mlr using the parameters that produce the highest classification performance. since there are two types of input relevant to the research group: title and abstract, we ran three scenarios of mlr prediction based on: 1) a title, 2) an abstract, and 3) a combination of a title and abstract. the goal is to identify which classifier performs best. hence, the gscv method is applied within each scenario producing 12 model candidates. in total, there are 36 candidates for the research group prediction model. in the evaluation stage, the best model from each scenario was tested using 30% test data. the metrics used were accuracy, precision, recall, and f1-score. the goal was to test how effective the mlr was based on the classification performance or correctness level [23]. from there, we can choose which mlr is best applied for sisinta. iii. results and discussion the retrieved 2164 rows of data were raw text structured into columns: title, abstract, and research group. figure 3 shows the rawness of the dataset. fig 3. example of data collection results the process of tag removal, case folding, small text, and removal of punctuation marks, special characters, double spaces, and numbers is carried out at the next cleaning stage. the processing results of this stage can be seen in figure 4. fig 4. example of text cleaning results table 2. mlr parameters for grid search cv parameter spesification multi_class multinomial solver saga penalty [‘l1’, ‘l2’, ‘none’] c [‘0.1’, ‘1.0’, ‘5’, ‘10’] h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150–159 154 the next step is to remove the missing values. there are four rows of missing values in the title column and 896 rows of missing values in the abstract column, where the number of missing values in the dataset can be seen in figure 5. fig 5. number of missing values in each dataset column furthermore, we identified one data duplication from the title column but none from the abstract. as a result of text preprocessing, the distribution of the dataset falls short, but there are imbalanced distributions of research group classes, see figure 2. the tokenization stage is carried out to separate text into tokens or words [24]. figure 6 and figure 7 show examples of the tokenization result in the title and abstract columns. fig 6. tokenization results in the title column fig 7. tokenization results in the abstract column the stopwords removal stage is carried out to remove words or tokens that appear frequently and have no critical meaning in the text [25]. the results of the stopwords removal process in the title and abstract columns can be seen in figure 8 and figure 9. fig 8. stopwords removal results in the title column fig 9. stopwords removal results in the abstract column 155 h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150-159 the stemming stage is carried out to remove all affixes in words, such as suffixes, inserts, prefixes, and combinations between prefixes and suffixes [26]. the results of the steaming process in the title and abstract columns can be seen in figure 10 and figure 11. fig 10. stemming results in the title column fig 11. stemming results in the abstract column tf-idf produced a matrix in the training set of the title scenario in the form of a vector of 884 samples x 2300 columns. meanwhile, the matrix test set of the title scenario makes a vector of 380 samples x 2300 columns. for the second and third scenarios, the remaining scenarios produced nearly quadrupled columns: 8218 and 8485 columns. an example view of term weighting using tf-idf can be seen in figure 12. fig 12. term weighting examples using tf-idf: (a) title scenario, (b) abstract scenario, and (c) combination of title and abstract we applied the default configuration of the smote in generating synthetic samples (n_neighbors = 5). there are 194 data on each research group after the resampling process using smote. in total, there are 2522 samples ready for model training. in title scenario, using the grid search cross-validation (gscv) method, the best parameter configurations for the mlr were c=0.1 and using a 'none' penalty. fig. 13 depicts the comparison between the candidates’ performances (in dots) that applies various regularization parameters (xaxis) and penalty (colored line). this graph shows that the mlr performs best when the c value is high, ignoring the penalty type. the result of mlr in the green line is suspect of overfitting because the other mlrs (orange and blue lines) underperformed when the c is lowest. this means that regularization is essential for the mlr to perform generically. from figure 13, the l2-type regularization (orange line) should be the best since it performs better even using a low c value compared to the l1-type. the higher the c value, the mlr using l2-type is always on top of the mlr with l1-type. therefore, the mlr was refitted in this scenario using the penalty=l2 with c=5 as the most optimal one. h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150–159 156 fig 13. grid search cv results on title scenario in the abstract scenario, the results of the most optimal combination of parameters can be seen in figure 14. our analysis in this second scenario is similar to the first one. the difference appears only slightly in the resulting scores. from this graph, the mlr using abstract as input is refitted with penalty=l2 and c=5. fig 14. grid search cv results on the abstract scenario gscv results for the third scenario can be seen in figure 15. our analysis in this third scenario is similar to the former two. the difference appears only slightly in the resulting scores. from this graph, the mlr using abstract as input is refitted with penalty=l2 and c=5. fig 15. grid search cv results on the title and abstract scenario 157 h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150-159 from the three scenarios using gscv, there were no significant differences between the effect of input used. even the performances were relatively identical. however, we tested each using the test data to delve deeper into how the three mlr model performs. we measured each scenario's performance metrics; the results can be seen in table 3. the evaluation results show that the title scenario is the best and optimal scenario. although this result is insignificant compared to the other two scenarios, it is more efficient since the input size for mlr is way smaller if using the title only. as such is a way to reduce the curse of dimensionality in research group classification. hence, a minor computation power is available. in addition, there will be a slight chance of repeated words in the titles (except stopwords) compared to the abstract. hence, we argue that using the title is more concise for the classification’s performance. we also pointed out the overall metrics that are below 70%. we identified the causes: typographical error (typo) within the title or abstract, coupled words, and the lack of a validation process to check for these errors. examples of errors contained in the dataset can be seen in figure 16. the words highlighted were only a few in a brief observation. however, these words are not core or root words that highly correlate with the research group. the classification model will lose some accuracy if this word is mistyped while contributing to a particular research group. the solution is applying a policy in the sisinta that any typo entered in the title or abstract will dismiss the students to get comments from the research group. either manual observation or automatic one is feasible. alternatively, by applying additional text preprocessing to identify these typos and decide whether to correct or remove them. fig16. writing errors on the dataset in addition, great topics overlap between research group classes. for instance: the research group "game technology and machine learning" and "knowledge engineering and data science". both research groups contain research with the keywords “machine learning”, “data mining”, “classification”, etc. too many terms were shared between these two examples of research groups. only a few keywords disparate the two research groups, for instance, “game” and “text”. to overcome the problem of shared words by looking at the linked words, we can use n-grams that decompose a text into n-character chunks so that linked words can be parsed. however, using the ngram feature significantly enlarges the dimension. hence, more complex algorithms like deep learning should fit the task. finally, our proposed method is applicable in different departments as long as the digital storage of the student’s research is organized in the research group (web-based information system and the database). based on our findings, the future implementation may only need to structure the data into the title column and research group. then, additional text preprocessing to identify and replace typos in the content is also essential to ensure the dataset's quality for the learning algorithm. other learning table 3. performance comparison no input type accuracy precision recall f1-score 1. title 63.68% 64.91% 63.68% 63.46% 2. abstract 61.05% 61.16% 61.05% 60.73% 3. title+abstract 62.89% 63.17% 62.89% 62.57% h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150–159 158 algorithms are available depending on the target classes and the size of the dataset provided. parameter tuning should be performed using gscv with more combinations since the dataset's target case differs from our research. the remaining stages of research group recommendation are repeatable as is. when sisinta implements a recommendation of a research group based on user input, the initial procedure of the thesis or final project proposal can be done in seconds. this can also help lecturers in the research group to provide more elaborated and comprehensive comments within their scope of knowledge regarding the proposals. if there are revisions required for the proposal are relevant and constructive to make their research go in the right direction. overall, this automatic instruction in sisinta can make it an intelligent information system for educational purposes. not only applicable in deei, but this approach should also be applicable in other departments as long as there are good platforms and data. iv. conclusion this research showed that we successfully applied multinomial logistic regression (mlr) algorithm to predict the research group based on text input, either the title or thesis abstract. the stages we followed in the text mining technique were straightforward, and mlr performed adequately well to classify 13 research groups. the best scenario in this study was the mlr with the input variable from the title. using title data as a model training scenario is considered adequate, optimal, and efficient. this is because there will be rare to write repeated words within a thesis title, except stopwords. with performances just above 63% in overall metrics, we argue that this mlr model with title text input is optimal due to its small dimensionality. however, the relatively low performances below the 70% threshold were limited because research groups shared similar keywords and typos inside the dataset. these typos can become noise or must be extracted from the core word. therefore, additional text preprocessing should consider these typos. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] h. a. rosyid, u. pujianto, and m. r. yudhistira, “classification of lexile level reading load using the k -means clustering and random forest method,” kinet. game technol. inf. syst. comput. network, comput. electron. control, pp. 139–146, may 2020. [2] m. taddy, “multinomial inverse regression for text analysis,” j. am. stat. assoc., vol. 108, no. 503, pp. 755–770, 2013. [3] h. chai, y. liang, s. wang, and h. shen, “a novel logistic regression model combining semi-supervised learning and active learning for disease classification,” sci. rep., vol. 8, no. 1, p. 13009, aug. 2018. [4] w. p. ramadhan, s. t. m. t. astri novianty, and s. t. m. t. casi setianingsih, “sentiment analysis using multinomial logistic regression,” in 2017 international conference on control, electronics, renewable energy and communications (iccrec), sep. 2017, pp. 46–49. [5] s. a. salloum, m. al-emran, a. a. monem, and k. shaalan, “using text mining techniques for extracting information from research articles,” in studies in computational intelligence, 2018, pp. 373–397. http://journal2.um.ac.id/index.php/keds https://doi.org/10.22219/kinetik.v5i2.897 https://doi.org/10.22219/kinetik.v5i2.897 https://doi.org/10.22219/kinetik.v5i2.897 https://doi.org/10.1080/01621459.2012.734168 https://doi.org/10.1080/01621459.2012.734168 https://doi.org/10.1038/s41598-018-31395-5 https://doi.org/10.1038/s41598-018-31395-5 https://doi.org/10.1109/iccerec.2017.8226700 https://doi.org/10.1109/iccerec.2017.8226700 https://doi.org/10.1109/iccerec.2017.8226700 https://doi.org/10.1007/978-3-319-67056-0_18 https://doi.org/10.1007/978-3-319-67056-0_18 159 h.a. rosyid / knowledge engineering and data science 2022, 5 (2): 150-159 [6] v. dogra, a. singh, s. verma, kavita, n. z. jhanjhi, and m. n. talib, “understanding of data preprocessing for dimensionality reduction using feature selection techniques in text classification,” in intelligent computing and innovation on data science, 2021, pp. 455–464. [7] y. hacohen-kerner, d. miller, and y. yigal, “the influence of preprocessing on text classification using a bag -ofwords representation,” plos one, vol. 15, no. 5, p. e0232525, may 2020. [8] p. f. muhammad, r. kusumaningrum, and a. wibowo, “sentiment analysis using word2vec and long short-term memory (lstm) for indonesian hotel reviews,” procedia comput. sci., vol. 179, pp. 728–735, 2021. [9] j. lever et al., “pgxmine: text mining for curation of pharmgkb jake,” pac symp biocomput, no. 25, pp. 611–622, 2020. [10] s. vijayaraghavan et al., “fake news detection with different models,” arxiv, 2020. [11] relearn: a robust machine learning framework in presence of missing data for multimodal stress detection from physiological signals,” in 2021 43rd annual international conference of the ieee engineering in medicine & biology society (embc), nov. 2021, pp. 535–541. [12] p. r. vishnu, p. vinod, and s. y. yerima, “a deep learning approach for classifying vulnerability descriptions using self attention based neural network,” j. netw. syst. manag., vol. 30, no. 1, p. 9, jan. 2022. [13] h. inoue, “multi-sample dropout for accelerated training and better generalization,” arxiv, 2019. [14] g. n. r prasad sr asst professor, “identification of bloom’s taxonomy level for the given question paper using nlp tokenization technique,” turkish j. comput. math. educ., vol. 12, no. 13, pp. 1872–1875, 2021. [15] y. a. alhaj, j. xiang, d. zhao, m. a. a. al-qaness, m. abd elaziz, and a. dahou, “a study of the effects of stemming strategies on arabic document classification,” ieee access, vol. 7, pp. 32664–32671, 2019. [16] m. adriani, j. asian, b. nazief, s. m. m. tahaghoghi, and h. e. williams, “stemming indonesian,” acm trans. asian lang. inf. process., vol. 6, no. 4, pp. 1–33, dec. 2007. [17] m. a. rosid, a. s. fitrani, i. r. i. astutik, n. i. mulloh, and h. a. gozali, “improving text preprocessing for student complaint document classification using sastrawi,” iop conf. ser. mater. sci. eng., vol. 874, no. 1, p. 012017, jun. 2020. [18] j. m.-t. wu, g. srivastava, j. c.-w. lin, and q. teng, “a multi-threshold ant colony system-based sanitization model in shared medical environments,” acm trans. internet technol., vol. 21, no. 2, pp. 1–26, jun. 2021. [19] s. qaiser and r. ali, “text mining: use of tf-idf to examine the relevance of words to documents,” int. j. comput. appl., vol. 181, no. 1, pp. 25–29, jul. 2018. [20] n. s. mohd nafis and s. awang, “an enhanced hybrid feature selection technique using term frequency-inverse document frequency and support vector machine-recursive feature elimination for sentiment classification,” ieee access, vol. 9, pp. 52177–52192, 2021. [21] m. umer et al., “scientific papers citation analysis using textual features and smote resampling techniques,” pattern recognit. lett., vol. 150, pp. 250–257, oct. 2021. [22] g. s. k. ranjan, a. kumar verma, and s. radhika, “k-nearest neighbors and grid search cv based real time fault monitoring system for industries,” in 2019 ieee 5th international conference for convergence in technology (i2ct), mar. 2019, pp. 1–5. [23] b. h. shekar and g. dagnew, “grid search-based hyperparameter tuning and classification of microarray cancer data,” in 2019 second international conference on advanced computational and communication paradigms (icaccp), feb. 2019, pp. 1–8. [24] m. p. geetha and d. karthika renuka, “improving the performance of aspect based sentiment analysis using finetuned bert base uncased model,” int. j. intell. networks, vol. 2, pp. 64–69, 2021. [25] a. w. pradana and m. hayaty, “the effect of stemming and removal of stopwords on the accuracy of sentiment analysis on indonesian-language texts,” kinetik: game technology, information system, computer network, computing, electronics, and control, pp. 375–380, oct. 2019, doi: 10.22219/kinetik.v4i4.912. [26] j. jumadi, d. s. maylawati, l. d. pratiwi, and m. a. ramdhani, “comparison of nazief-adriani and paice-husk algorithm for indonesian text stemming process,” iop conf. ser. mater. sci. eng., vol. 1098, no. 3, p. 032044, mar. 2021. http://dx.doi.org/10.1007/978-981-16-3153-5_48 http://dx.doi.org/10.1007/978-981-16-3153-5_48 http://dx.doi.org/10.1007/978-981-16-3153-5_48 https://doi.org/10.1371/journal.pone.0232525 https://doi.org/10.1371/journal.pone.0232525 https://doi.org/10.1016/j.procs.2021.01.061 https://doi.org/10.1016/j.procs.2021.01.061 https://www.researchgate.net/publication/338301011_pgxmine_text_mining_for_curation_of_pharmgkb https://www.researchgate.net/publication/338301011_pgxmine_text_mining_for_curation_of_pharmgkb https://arxiv.org/abs/2003.04978 https://doi.org/10.1109/embc46164.2021.9630040 https://doi.org/10.1109/embc46164.2021.9630040 https://doi.org/10.1109/embc46164.2021.9630040 https://doi.org/10.1007/s10922-021-09624-6 https://doi.org/10.1007/s10922-021-09624-6 https://arxiv.org/abs/1905.09788 https://www.turcomat.org/index.php/turkbilmat/article/view/8839 https://www.turcomat.org/index.php/turkbilmat/article/view/8839 https://doi.org/10.1109/access.2019.2903331 https://doi.org/10.1109/access.2019.2903331 http://dx.doi.org/10.1145/1316457.1316459 http://dx.doi.org/10.1145/1316457.1316459 https://iopscience.iop.org/article/10.1088/1757-899x/874/1/012017 https://iopscience.iop.org/article/10.1088/1757-899x/874/1/012017 https://iopscience.iop.org/article/10.1088/1757-899x/874/1/012017 https://doi.org/10.1145/3408296 https://doi.org/10.1145/3408296 http://dx.doi.org/10.5120/ijca2018917395 http://dx.doi.org/10.5120/ijca2018917395 https://doi.org/10.1109/access.2021.3069001 https://doi.org/10.1109/access.2021.3069001 https://doi.org/10.1109/access.2021.3069001 https://doi.org/10.1016/j.patrec.2021.07.009 https://doi.org/10.1016/j.patrec.2021.07.009 https://doi.org/10.1109/i2ct45611.2019.9033691 https://doi.org/10.1109/i2ct45611.2019.9033691 https://doi.org/10.1109/i2ct45611.2019.9033691 https://doi.org/10.1109/i2ct45611.2019.9033691 https://doi.org/10.1109/i2ct45611.2019.9033691 https://doi.org/10.1109/i2ct45611.2019.9033691 https://doi.org/10.1016/j.ijin.2021.06.005 https://doi.org/10.1016/j.ijin.2021.06.005 https://doi.org/10.22219/kinetik.v4i4.912 https://doi.org/10.22219/kinetik.v4i4.912 https://doi.org/10.22219/kinetik.v4i4.912 https://iopscience.iop.org/article/10.1088/1757-899x/1098/3/032044/meta https://iopscience.iop.org/article/10.1088/1757-899x/1098/3/032044/meta https://iopscience.iop.org/article/10.1088/1757-899x/1098/3/032044/meta knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 168–178 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p168-178 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) traffic density prediction using iot-based double exponential smoothing rosa andrie asmara a,1,*, noprianto a,2, muhammad ainur ilmy a,3, kohei arai b,4 a informatics engineering study program, information technology department, state polytechnic of malang, jl. soekarno hatta no.9 malang 65141, indonesia b information science department, saga university, 1 honjou 840-0027, saga, japan 1 rosa_andrie@polinema.ac,id *; 2 noprianto@polinema.ac.id; 3 polinema1641720019@gmail.com; 4arai@cc.saga-u.ac.jp * corresponding author i. introduction the total number of automobiles on the road often keeps growing at a high rate year after year. the tremendous population increase results in a high density of traffic [1]. many cars exceed the capacity of the traffic segment, lowering the amount of space that is free of traffic [2] and increasing the number of vehicle lines [3], which may halt or stop the mobility of vehicles. the increase in vehicle flow, which often takes place in response to increased demand for transportation during a specific period, may be used to detect the presence of traffic density when it is seen. congestion on the roads results in substantial losses, including the most significant increase in the amount of time spent in traffic [4], which may result in considerable societal costs [5], including operating expenses [6], wasted time [7], air pollution [8], accident rates [9], noise [10], and discomfort of pedestrians [11]. it is essential to have technology that is able to estimate the number of cars currently in traffic to keep up with the times. this kind of technology may be helpful to motorcyclists, police, the government, and other connected parties as a source of information and assessment data. in order to solve these issues, it is required to develop a system that can identify and forecast the number of cars that will be present within a particular time. forecasting aims to make an educated guess as to what will take place in the future by using pertinent information from previous periods. the construction of a forecasting system uses various methodologies, some of which include moving averages [12], trend projection [13], and exponential smoothing [14]. exponential smoothing is a time series forecasting approach for univariate data that may be expanded to accommodate data with a systematic trend or seasonal component [15]. the method was initially developed for use with univariate data but has now been adapted for use with multivariate data [16]. it is an effective way of forecasting that may be used as an alternative to the widely used box-jenkins arima family of approaches. projections made using this method for longer periods are often highly inaccurate, which is why exponential smoothing is typically reserved for making projections for shorter periods. when smoothing time series data using an exponential function, the article info a b s t r a c t article history: received 1 november 2022 revised 7 december 2022 accepted 11 december 2022 published online 30 december 2022 the number of vehicles and currents that tend to increase causes traffic density. a system is proposed to calculate the number of vehicles and predict real-time traffic density. this research uses haar cascade to detect the number of cars and motorcycles and the double exponential smoothing (des) for forecasting the number of vehicles on the road. mape describes forecasting accuracy as a base for selecting the best smoothing constant (alpha). the best test results from june 13 to 20, 2020, are cars on june 14, 2020 (alpha 0.5, mape 0%) and motorcylecycles on june 18, 2020 (alpha 0.5, mape 0.1134% ). the most significant mape results of the car were on june 15, 2020, with alpha 0.5 and mape 2.1073%. the 3 minutes haar cascade detects 72.58% of cars and 81.90% of motorcycles. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: traffic density haar cascade double exponential smoothing mean absolute percentage error website application http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ r.a asmara / knowledge engineering and data science 2022, 5 (2): 168–178 169 weights given to the most recent observations progressively lower as they move toward the oldest observations [17]. the more time has passed since the data was collected, the less importance (weight it is assigned [18]. the more recent data is given greater weight since it is considered more relevant [19]. the smoothing settings determine the weights assigned to observations [20]. based on previous findings, when dealing with data that exhibit patterns, the single technique is often less trustworthy than the double procedure when managing the data. therefore, this study promotes double exponential smoothing (des) as iot based prediction system. this strategy is beneficial for predicting short-term and medium-term particularly when many outcomes are required. data that follows a linear trend might provide support for this strategy. a system for predicting traffic density may be designed using the des approach. in the future, this factor may decide whether the traffic density will increase or decrease. using this forecasting technique, overcoming these challenges and roadblocks will be possible. ii. method the stages of research in designing a traffic density prediction system using the iot-based double exponential smoothing method can be seen in figure 1. fig. 1. the stages of research a. data collection data collection was carried out according to the source and type required. data collection in this study was carried out using quantitative variables/data instruments. namely, the data obtained was regularly until the end of the study. collecting data on the number of vehicles per 3 seconds is taken by observing the raspberry pi test in the field as a data parameter. this 3 second timeframe are taken to handle the computation needed for the raspberry pi. in making the haar cascade training data, the writer recorded the traffic flow through the webcam according to the test hours, then took the object of the motorcycle, bike, and car through the video frame. the results of data training were carried out on as many as 7x cars and 3x motorcycle bikes. in this study, an application will be made to detect and count the number of vehicles per 3 seconds and display data in the form of forecasting charts on the website. the data used as training and test data are obtained in application testing during the traffic determination process and observations at the research site. the input variables used are webcam coordinates, roi coordinates, trigger line coordinates, and alpha values. the training data collection aims to obtain the types and patterns of vehicle objects at the test site with a ratio of 100: 1000, namely 100 positive and 1000 negative objects. the haar cascade training data results are in the form of xml files for motorcycles, bikes, and cars. several stages of object training in this research are: • take a positive image through the video frame • gather negative image imagery • minimizes positive and negative image pixels • positive and negative image transformations into greyscale • conduct training using the cascade trainer gui application some positive and negative training results can be seen the figure 2 to figure 4. figure 2 shows the positive training images for cars. while figure 3 represents negative image of motorcycles, the figure 4 shows negative samples (figure instead of cars and motorbikes). data collection haar cascade classifier double exponential smoothing system design of the internet of things (iot) evaluation metric 170 r.a asmara / knowledge engineering and data science 2022, 5 (2): 168-178 fig. 2. car samples fig. 3. motorcylecycle samples fig. 4. negative samples r.a asmara / knowledge engineering and data science 2022, 5 (2): 168–178 171 b. haar cascade classifier haar-like feature, also known as haar cascade classifier, is a rectangular (square) feature that gives a specific indication of an image [21]. the haar cascade classifier comes from the idea of paul viola and michael jhon, hence the name viola & jhon method [22]. the idea of the haar-like feature is to recognize objects based on the simple value of the feature but not the pixel value of the object's image [23]. this method has the advantage of being high-speed computation because it only depends on the number of pixels in a square, not every pixel value of an image [24]. this method is a method that uses a statistical model (classifier). the approach to detecting objects in images combines four primary keys, namely haar-like feature, integral image, adaboost learning, and cascade classification can be seen in figure 5. fig. 5. haar cascade classifier the haar cascade process can be seen in figure 6. the haar method requires two types of object images in the training process to detect an object. positive samples contain the image of the object you want to detect. positive samples contain a knife image if you want to detect a knife. negative samples contain images other than the object that you want to recognize. negative samples are generally in the form of background images such as walls and scenery. the resolution for the negative sample is recommended to have a camera resolution. the haar method training uses those two types of samples [25]. the information from the training is then converted into a statistical model parameter. a cascade classifier is a chain of stage classifiers, where each stage classifier is used to detect an object of interest in the image sub-window. fig. 6. haar cascade process c. double exponential smoothing double exponential smoothing (des) is the improvement of the exponential smoothing (es) method. the es (makridakis, 1999) is a continuous procedure focused on the exponential descent of priority to older objects of observation [26]. the des method is a linear model invented by brown. this method performs a twice-smoothing process. the des method is usually used to predict data with trends. moreover, data patterns are likely to rise. 172 r.a asmara / knowledge engineering and data science 2022, 5 (2): 168-178 des method can more efficiently model trends and levels from a time series than other methods. des requires fewer data and uses one parameter to simplify it. however, des requires parameter optimization, finding the most optimal α (alpha) that takes time [27]. the steps in calculating using the des method are as follows. determine the first smoothing value (𝑆𝑡 𝐼 ) as in (1). 𝑆𝑡 𝐼 = 𝛼𝑋𝑡 + (1 − 𝛼)𝑆𝑡−1 𝐼 (1) determine the second smoothing value (𝑆𝑡 𝐼𝐼 ) as in (2). 𝑆𝑡 𝐼𝐼 = 𝛼𝑆𝑡 𝐼 + (1 − 𝛼)𝑆𝑡−1 𝐼𝐼 (2) where 𝑆𝑡 𝐼 is forecast value for period 𝑡, 𝛼 is exponential weighting constant, 𝑋𝑡 is actual value of period 𝑡, 𝑆𝑡−1 𝐼 is forecast value for period 𝑡 − 1, 𝑆𝑡 𝐼𝐼 is value of des period 𝑡, 𝑆𝑡−1 𝐼𝐼 is value of des 𝑡-1. specifying a constant value (a𝑡 ) as in (3). a𝑡 = 2𝑆𝑡 𝐼 − 𝑆𝑡 𝐼𝐼 (3) specifying slope value (𝑏𝑡 ) as in (4). 𝑏𝑡 = 𝑎 1−𝑎 (𝑆𝑡 𝐼 − 𝑆𝑡 𝐼𝐼 ) (4) determining forecasting value as in (5). f𝑡+1 = a𝑡 + b𝑡 𝑚 (5) where f𝑡+1 is forecasting value, and 𝑚 is predicted future period. d. system design of the internet of things (iot) internet of things (iot) is a computing concept that describes the idea of everyday physical objects connected to the internet and can identify themselves to other devices [28]. iot is significant because an object can be presented digitally as more extensive than the object itself [29]. the object is no longer only related to the user; now, it is connected to the object and its surrounding database data. in this research, the system has three stages: monitoring, process automation, and controlling. in the monitoring process, the system will recognize the vehicle object using the haar cascade detection via a webcam installed on the raspberry pi to determine the number of vehicles per 3 seconds. then the data will be transferred to the database on the mysql server via the network. then at the automation stage, the data in the mysql server is taken to carry out the forecasting process using the double exponential smoothing method to produce forecasting or output values on the website. the final stage is controlling, a control system that the user can manage to make arrangements as expected of the system, anticipating the automation system if there is no traffic as desired by the user. for the software and hardware used in this system can be seen in table 1 and table 2 respectively. table 1. software specifications no. device information 1 text editor to make it easier to write programs and develop applications on windows and raspbian. 2 local server as a server consisting of mysql database and as a php language translator 3 remote desktop protocol windows application to perform remote desktop on different device operating systems. 4 cascade trainer gui as an image data trainer application for vehicle detection using the haar cascade 5 raspbian os operating system to traffic all computer activities on the raspberry pi 6 python ide an application for writing unique program code for python programming language. 7 library opencv a software library aimed at real-time dynamic image processing. r.a asmara / knowledge engineering and data science 2022, 5 (2): 168–178 173 table 2. hardware specifications no device picture information 1 raspberry pi 3 model b+ to perform the computational process. 2 usb cable to connect power to the raspberry 3 heatsink raspberry pi to expand the heat transfer of the raspberry processor so that the temperature is not too hot 4 raspberry pi cooling fan type c to speed up the air circulation process on the raspberry processor 5 power bank 10000 mah as a power source for raspberries 6 webcam logitech c290 to carry out the vehicle detection process in real time 7 gorilla tripod support for the webcam makes finding an image capture angle easier. 8 lan cable used as a local network when data retrieval for eight days e. evaluation metric the evaluation metric in this research is using mape. mape measures the accuracy of a prediction. mape is used to evaluate the accuracy of forecasting using errors in percentage terms. the interpretation of the mape value is as follows [30]. a. < 10 % = very accurate forecast b. 10 % 20% = accurate forecast c. 20 % 50% = forecast is quite accurate d. > 50 % = forecast is not accurate calculating mape shows the forecast's accuracy in percentage form by determining the pe (percentage error). pe determines the percentage error of the forecast. here is the formula for calculating pe and mape as in (6) and (7). pe = ( 𝑋𝑡−𝐹𝑡 𝑋𝑡 ) 𝑋 100 (6) mape = ∑ 𝑃𝐸𝑡 𝑛 𝑛 𝑡=1 (7) where 𝑛 is period value, 𝑋𝑡 is true value in period 1, and 𝐹𝑡 is the forecast value in the 1st period. 174 r.a asmara / knowledge engineering and data science 2022, 5 (2): 168-178 iii. results and discussion this system testing is carried out to provide an accurate system in automation and quick and easy settings for users of these systems and applications. in implementing the system, it has been tested at the place it is implemented, namely at the crossing bridge jl. basuki rahmat klojen, malang city starting from 10:00 to 10:30 am. the system can detect cars and motorcycle bikes and inform data per 3 seconds on the website in real time. monitoring the number of vehicles can also be done on the website application in real time, as seen in figure 7. fig. 7. test system the detection test on the haar cascade is carried out by comparing the application calculation and manual calculations for eight days of experiment with a period of 3 minutes to obtain a comparison and an estimate of the accuracy of the calculation. table 3 is a detection test analysis. table 3. haar cascade detection accuracy percentage date condition haar cascade manual calculation car motorbike car motorbike sat, june 13 2020 sunny 26 72 30 61 sun, june 14 2020 cloudy 10 43 26 39 mon, june 15 2020 cloudy 11 12 14 22 tue, june 16 2020 cloudy 13 26 24 72 wed, june 17 2020 sunny 32 60 30 71 thu, june 18 2020 sunny 19 59 23 45 fri, june 19 2020 cloudy 11 39 17 60 sat, june 20 2020 cloudy 13 60 22 83 total 135 371 186 453 table 3 shows that the manual calculation and haar cascade on june 13 to 20, 2020, takes eight days and has similar patterns. it can be concluded that car vehicle detection is outnumbered by motorcycle detection. the total of the detected car is 135, while the total of manually calculated cars is 186. similarly, the number of motorcycles is 371 for the detected and 453 for manual calculations. data communication testing is done by comparing the delivery time on the raspberry pi with the website application to determine the accuracy of data transmission times per 3 seconds. can be seen in table 4. table 4. data communication date sending time receiving time 2020-06-14 10:31:04 10:31:04 2020-06-14 10:31:01 10:31:01 2020-06-14 10:30:57 10:30:57 2020-06-14 10:30:54 10:30:54 r.a asmara / knowledge engineering and data science 2022, 5 (2): 168–178 175 table 5 shows the results of the fps test on the raspberry pi. the result determines the computing capabilities of the raspberry pi and the webcam with a resolution of 427 x 2 40. the fps test in table 5 is obtained every 3 seconds. the highest fps for sending data is 17, while the lowest is 13. table 5. fps test time car data motorcycle data fps 10:10:02 0 0 14 10:10:05 2 1 16 10:10:09 0 0 15 10:10:12 1 1 18 10:10:15 0 1 14 10:10:19 0 0 12 10:10:24 0 0 15 10:10:27 0 2 17 10:10:31 0 3 13 this experiment is carried out by comparing the classification results on each vehicle. can be seen in table 6. table 6. results forecasting applications alpha class category description 0,1 137 car 0,2 137 car 0,3 427 motorcycle 0,4 427 motorcycle 0,5 137 car 0,6 137 car 0,7 427 motorcycle 0,8 427 motorcycle 0,9 137 car table 7 determines the suitability and accuracy of forecasting cars and motorbikes based on mape. table 7. results of application mape alpha car motorcycle mape pe mape pe 0,3 0 2.4444 0 3.6655 2.5 6.0377358490566 4.833333333 4.9586776859505 0,6 0 1.75 0 3.403 1.25 3.0188679245283 4.0000000000001 7.1900826446281 0,9 0 4.9444 0 7.4962 5 12.075471698113 9.833333333 10.413223140496 in discussing and testing the best alpha recommendations, we will discuss the field test result data with the best alpha recommendations in the application to obtain a small percentage of errors in real time on cars and motorbikes. can be seen in table 8. table 8. best alpha recommendations date car motorcycle alpha recommendations pe values alpha recommendations pe values saturday-june 13-2020 0.4 1.0302 0.5 1.3158 sunday-14 june-2020 0.5 0 0.4 8.5205 monday-15 june-2020 0.5 2.1073 0.5 3.2258 tuesday-16 june-2020 0.5 0.3846 0.6 3.1089 wednesday-june 17-2020 0.5 0.463 0.4 1.0338 thursday-18 june-2020 0.4 1.7875 0.5 0.1788 friday-19 june-2020 0.5 1.2712 0.5 0.1134 saturday-20 june-2020 0.5 1.5306 0.5 1.0791 176 r.a asmara / knowledge engineering and data science 2022, 5 (2): 168-178 in table 8, the application recommends the best alpha to get the smallest pe real-time value. the smallest pe car was on sunday, june 14, 2020, at alpha 0.5 with a pe value of 0; on friday, june 18, 2020, at alpha 0.5 with a 0.1134. after conducting tests and trials to compare the calculation results of the system with manual calculations, the des method has been successfully applied to the real-time traffic density detection system. it can be used as information for service planning in overcoming traffic congestion by other related parties. prediction results of the number of vehicles detected per 3 seconds and the number of times per 10 minutes with the category of cars and motorbikes on june 13 to 20, 2020, in the direction of alun-alun malang, get the best prediction results, including cars on june 14, 2020, with alpha 0.5, mape 0% (perfect prediction), and motorcycles on june 18, 2020, with alpha 0.5, mape 0.1134% (perfect prediction). then the most significant mape results for cars were on june 15, 2020, with alpha 0.5, mape 2.1073% (perfect prediction), and motorbikes on june 14, 2020, with alpha 0.4, mape 8.5205% (predictions are perfect). in the test results, the detection of haar cascade by comparing the application and manual calculations during the test is car detection at 72.58% and motorbike detection at 81.90%. this research results give good performance and can be used for traffic prediction systems in low-specification computers such as raspberry pi [31] or other single-board computers. iv. conclusion the findings indicate that real-time traffic density detection based on an iot system can benefit from the application of the haar cascade classifier with the des technique. implementation and testing of a system for detecting and monitoring vehicles at a specific location in real-time using a webcam and raspberry pi. the results of the detection test showed that the haar cascade algorithm was effective in detecting vehicles, but motorcycle detection was better than car detection. the data communication testing showed that the system can transmit data accurately every 3 seconds. the fps test on raspberry pi demonstrated that the system's computing capabilities were sufficient for processing data in real time. the forecasting application results showed that the system's accuracy in predicting vehicles was acceptable, with some small percentage errors. the study recommends using alpha values of 0.9 for cars and 0.6 for motorcycles to reduce the percentage error in real-time monitoring of vehicles. for future research can implement another exponential smoothing like holtwinters or triple exponential smoothing method. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations http://journal2.um.ac.id/index.php/keds r.a asmara / knowledge engineering and data science 2022, 5 (2): 168–178 177 references [1] m. mohsin, q. abbas, j. zhang, m. ikram, and n. iqbal, “integrated effect of energy consumption, economic development, and population growth on co2 based environmental degradation: a case of transport sector,” environ. sci. pollut. res., vol. 26, no. 32, pp. 32824–32835, nov. 2019. [2] m. sala and f. soriguera, “capacity of a freeway lane with platoons of autonomous vehicles mixed with regular traffic,” transp. res. part b methodol., vol. 147, pp. 116–131, may 2021. [3] d. göhlich et al., “integrated approach for the assessment of strategies for the decarbonization of urban traffic,” sustainability, vol. 13, no. 2, p. 839, jan. 2021. [4] d. albalate and x. fageda, “congestion, road safety, and the effectiveness of public policies in urban areas,” sustainability, vol. 11, no. 18, p. 5092, sep. 2019. [5] k. pyatkova, a. s. chen, d. butler, z. vojinović, and s. djordjević, “assessing the knock-on effects of flooding on road transportation,” j. environ. manage., vol. 244, pp. 48–60, aug. 2019. [6] c. k. tang, “the cost of traffic: evidence from the london congestion charge,” j. urban econ., vol. 121, p. 103302, jan. 2021. [7] t. afrin and n. yodo, “a survey of road traffic congestion measures towards a sustainable and resilient transportation system,” sustainability, vol. 12, no. 11, p. 4660, jun. 2020. [8] c. sun, s. xu, m. yang, and x. gong, “urban traffic regulation and air pollution: a case study of urban motor vehicle restriction policy,” energy policy, vol. 163, p. 112819, apr. 2022. [9] c. onyeneke, “modeling the effects of traffic congestion on economic activities accidents, fatalities and casualties,” biomed. stat. informatics, vol. 3, no. 2, p. 7, 2018. [10] i. kaddoura and k. nagel, “simultaneous internalization of traffic congestion and noise exposure costs,” transportation (amst)., vol. 45, no. 5, pp. 1579–1600, sep. 2018. [11] v. shepelev, s. aliukov, k. nikolskaya, and s. shabiev, “the capacity of the road network: data collection and statistical analysis of traffic characteristics,” energies, vol. 13, no. 7, p. 1765, apr. 2020. [12] d. j. pedregal and j. r. trapero, “adjusted combination of moving averages: a forecasting system for medium-term solar irradiance,” appl. energy, vol. 298, p. 117155, sep. 2021. [13] a. c. adamuthe and g. t. thampi, “technology forecasting: a case study of computational technologies,” technol. forecast. soc. change, vol. 143, pp. 181–189, jun. 2019. [14] j. f. rendon-sanchez and l. m. de menezes, “structural combination of seasonal exponential smoothing forecasts applied to load forecasting,” eur. j. oper. res., vol. 275, no. 3, pp. 916–924, jun. 2019. [15] m. b. a. rabbani et al., “a comparison between seasonal autoregressive integrated moving average (sarima) and exponential smoothing (es) based on time series model for forecasting road accidents,” arab. j. sci. eng., vol. 46, no. 11, pp. 11113–11138, nov. 2021. [16] m. a. castán-lascorz, p. jiménez-herrera, a. troncoso, and g. asencio-cortés, “a new hybrid method for predicting univariate and multivariate time series based on pattern forecasting,” inf. sci. (ny)., vol. 586, pp. 611–627, mar. 2022. [17] a. r. s. parmezan, v. m. a. souza, and g. e. a. p. a. batista, “evaluation of statistical and machine learning models for time series prediction: identifying the state-of-the-art and the best conditions for the use of each model,” inf. sci. (ny)., vol. 484, pp. 302–337, may 2019. [18] h. tyralis, g. papacharalampous, and a. langousis, “super ensemble learning for daily streamflow forecasting: largescale demonstration and comparison with multiple machine learning algorithms,” neural comput. appl., vol. 33, no. 8, pp. 3053–3068, apr. 2021. [19] s. smyl, “a hybrid method of exponential smoothing and recurrent neural networks for time series forecasting,” int. j. forecast., vol. 36, no. 1, pp. 75–85, jan. 2020. [20] s. dhamodharavadhani and r. rathipriya, “region-wise rainfall prediction using mapreduce-based exponential smoothing techniques,” in advances in big data and cloud computing. advances in intelligent systems and computing, j. peter, a. alavi, and b. javadi, eds. springer, 2019, pp. 229–239. [21] purwono, a. ma’arif, and a. wulandari, “face shape-based physiognomy in linkedin profiles with cascade classifier and k-means clustering,” in 2021 8th international conference on electrical engineering, computer science and informatics (eecsi), oct. 2021, pp. 347–353. [22] b. fatima, a. r. shahid, s. ziauddin, a. a. safi, and h. ramzan, “driver fatigue detection using viola jones and principal component analysis,” appl. artif. intell., vol. 34, no. 6, pp. 456–483, may 2020. [23] a. arunmozhi and j. park, “comparison of hog, lbp and haar-like features for on-road vehicle detection,” in 2018 ieee international conference on electro/information technology (eit), may 2018, pp. 0362–0367. [24] y. morimoto, a. masaya, and m. ueki, “high-speed 3d shape measurement by one pitch phase analysis method using brightness values in small square area of single-shot image,” opt. lasers eng., vol. 113, pp. 38–46, feb. 2019. [25] h. filali, j. riffi, a. m. mahraz, and h. tairi, “multiple face detection based on machine learning,” in 2018 international conference on intelligent systems and computer vision (iscv), apr. 2018, pp. 1–8. [26] d. kucharavy, d. damand, and m. barth, “technological forecasting using mixed methods approach,” int. j. prod. res., pp. 1–25, jul. 2022. [27] g. airlangga, a. rachmat, and d. lapihu, “comparison of exponential smoothing and neural network method to forecast rice production in indonesia,” telkomnika (telecommunication comput. electron. control., vol. 17, no. 3, p. 1367, 2019. [28] r. mehta, j. sahni, and k. khanna, “internet of things: vision, applications and challenges,” procedia comput. sci., vol. 132, pp. 1263–1269, 2018. [29] r. minerva, g. m. lee, and n. crespi, “digital twin in the iot context: a survey on technical features, scenarios, and architectural models,” proc. ieee, vol. 108, no. 10, pp. 1785–1824, oct. 2020. https://doi.org/10.1007/s11356-019-06372-8 https://doi.org/10.1007/s11356-019-06372-8 https://doi.org/10.1007/s11356-019-06372-8 https://doi.org/10.1016/j.trb.2021.03.010 https://doi.org/10.1016/j.trb.2021.03.010 https://doi.org/10.3390/su13020839 https://doi.org/10.3390/su13020839 https://doi.org/10.3390/su11185092 https://doi.org/10.3390/su11185092 https://doi.org/10.1016/j.jenvman.2019.05.013 https://doi.org/10.1016/j.jenvman.2019.05.013 https://doi.org/10.1016/j.jue.2020.103302 https://doi.org/10.1016/j.jue.2020.103302 https://doi.org/10.3390/su12114660 https://doi.org/10.3390/su12114660 https://doi.org/10.1016/j.enpol.2022.112819 https://doi.org/10.1016/j.enpol.2022.112819 https://www.sciencepublishinggroup.com/journal/paperinfo?journalid=275&doi=10.11648/j.bsi.20180302.11 https://www.sciencepublishinggroup.com/journal/paperinfo?journalid=275&doi=10.11648/j.bsi.20180302.11 https://doi.org/10.1007/s11116-017-9776-0 https://doi.org/10.1007/s11116-017-9776-0 https://doi.org/10.3390/en13071765 https://doi.org/10.3390/en13071765 https://doi.org/10.1016/j.apenergy.2021.117155 https://doi.org/10.1016/j.apenergy.2021.117155 https://doi.org/10.1016/j.techfore.2019.03.002 https://doi.org/10.1016/j.techfore.2019.03.002 https://doi.org/10.1016/j.ejor.2018.12.013 https://doi.org/10.1016/j.ejor.2018.12.013 https://doi.org/10.1007/s13369-021-05650-3 https://doi.org/10.1007/s13369-021-05650-3 https://doi.org/10.1007/s13369-021-05650-3 https://doi.org/10.1016/j.ins.2021.12.001 https://doi.org/10.1016/j.ins.2021.12.001 https://doi.org/10.1016/j.ins.2019.01.076 https://doi.org/10.1016/j.ins.2019.01.076 https://doi.org/10.1016/j.ins.2019.01.076 https://doi.org/10.1007/s00521-020-05172-3 https://doi.org/10.1007/s00521-020-05172-3 https://doi.org/10.1007/s00521-020-05172-3 https://doi.org/10.1016/j.ijforecast.2019.03.017 https://doi.org/10.1016/j.ijforecast.2019.03.017 https://doi.org/10.1007/978-981-13-1882-5_21 https://doi.org/10.1007/978-981-13-1882-5_21 https://doi.org/10.1007/978-981-13-1882-5_21 https://doi.org/10.23919/eecsi53397.2021.9624262 https://doi.org/10.23919/eecsi53397.2021.9624262 https://doi.org/10.23919/eecsi53397.2021.9624262 https://doi.org/10.1080/08839514.2020.1723875 https://doi.org/10.1080/08839514.2020.1723875 https://doi.org/10.1109/eit.2018.8500159 https://doi.org/10.1109/eit.2018.8500159 https://doi.org/10.1016/j.optlaseng.2018.08.023 https://doi.org/10.1016/j.optlaseng.2018.08.023 https://doi.org/10.1109/isacv.2018.8354058 https://doi.org/10.1109/isacv.2018.8354058 https://doi.org/10.1080/00207543.2022.2102447 https://doi.org/10.1080/00207543.2022.2102447 http://doi.org/10.12928/telkomnika.v17i3.11768 http://doi.org/10.12928/telkomnika.v17i3.11768 http://doi.org/10.12928/telkomnika.v17i3.11768 https://doi.org/10.1016/j.procs.2018.05.042 https://doi.org/10.1016/j.procs.2018.05.042 https://doi.org/10.1109/jproc.2020.2998530 https://doi.org/10.1109/jproc.2020.2998530 178 r.a asmara / knowledge engineering and data science 2022, 5 (2): 168-178 [30] a. p. wibawa, z. n. izdihar, a. b. p. utama, l. hernandez, and haviluddin, “min-max backpropagation neural network to forecast e-journal visitors,” in 2021 international conference on artificial intelligence in information and communication (icaiic), apr. 2021, pp. 052–058. [31] r. a. asmara, b. syahputro, d. supriyanto, and a. n. handayani, “prediction of traffic density using yolo object detection and implemented in raspberry pi 3b + and intel ncs 2,” in 2020 4th international conference on vocational education and training (icovet), sep. 2020, pp. 391–395. https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icovet50258.2020.9230145 https://doi.org/10.1109/icovet50258.2020.9230145 https://doi.org/10.1109/icovet50258.2020.9230145 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 4, no 2, december 2021, pp. 69–84 eissn 2597-4637 https://doi.org/10.17977/um018v4i22021p69-84 ©2021 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) keds is sinta 2 journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by indonesian ministry of education, culture, research, and technology parallel approach of adaptive image thresholding algorithm on gpu adhi prahara a, 1 , andri pranolo a, b, 2, *, nuril anwar a, 3 , yingchi mao b, 4 a informatics department, universitas ahmad dahlan jl. prof. dr. soepomo, s.h., janturan, warungboto, umbulharjo, yogyakarta 55164, indonesia b college of computer and information, hohai university 1 xikang road, nanjing, jiangsu 210098, china 1 adhi.prahara@tif.uad.ac.id *; 2 andri@hhu.edu.cn*; 3 nuril.anwar@tif.uad.ac.id; maoyingchi@gmail.com * corresponding author i. introduction segmentation is a process that partitions image into segments [1]. segmentation is useful for changing image representation into something more meaningful and easier to analyze, e.g., finding objects and boundaries. one of the methods to perform image segmentation is image thresholding. the method partitions image into background and foreground using a given threshold. this process is also called binarization because the segmentation result is a binary image that maps “0” pixel as background and “1” pixel as foreground. in order to perform image thresholding, the threshold value can be determined manually by observation or experiment. however, in the adaptive image thresholding method, the threshold is generated using a specific algorithm. the algorithm involves per pixel operation, histogram calculation, and iterative procedure to search the optimum threshold. therefore, it can be costly for a high-resolution image. some well-known adaptive image thresholding algorithms are otsu [2], iterative self-organizing data analysis technique (isodata) [3], and minimum cross-entropy (mcet) [4]. otsu method iteratively searches threshold that minimizes inter-class variance. isodata method iteratively updates the threshold until the average inter-class distance is less than a given threshold or reaches the maximum number of iterations. mcet method searches optimal threshold by calculating the crossentropy for all possible thresholds and selecting the one with minimum cross-entropy. the methods have been used in many image processing applications [5][6][7][8][9][10][11][12] to perform automatic image segmentation. article info a b s t r a c t article history: submitted 25 december 2021 revised 28 december 2021 accepted 30 december 2021 published online 31 december 2021 image thresholding is used to segment an image into background and foreground using a given threshold. the threshold can be generated using a specific algorithm instead of a pre-defined value obtained from observation or experiment. however, the algorithm involves per pixel operation, histogram calculation, and iterative procedure to search the optimum threshold that is costly for high-resolution images. in this research, parallel implementations on gpu for three adaptive image thresholding methods, namely otsu, isodata, and minimum cross-entropy, were proposed to optimize their computational times to deal with high-resolution images. the approach involves parallel reduction and parallel prefix sum (scan) techniques to optimize the calculation. the proposed approach was tested on various sizes of grayscale images. the result shows that the parallel implementation of three adaptive image thresholding methods on gpu achieves 4-6 speeds up compared to the cpu implementation, reducing the computational time significantly and effectively dealing with highresolution images. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: adaptive image thresholding computational time graphics processing unit image processing parallel computing http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p69-84 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 70 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 in image processing, achieving real-time performance is necessary, especially when processing video streaming or image in high resolution. a high-resolution image is a common product of satellite, aerial, biometric, and medical imaging, which is also often used in the verification and segmentation process. it is crucial to analyze the algorithm's complexity to know where it should be optimized to achieve real-time performance. high-performance computing (hpc) advanced technology allows the algorithm to be parallelized on graphics processing unit (gpu). parallel computation can optimize the iterative and serial procedure in an algorithm. researchers have been proposed parallel adaptive image thresholding methods for image segmentation. kanungo et al. [13] proposed a parallel genetic algorithm-based adaptive thresholding for image segmentation in uneven lighting conditions. sandeli and batouche [14] proposed image thresholding using multilevel thresholding based on a parallel generalized island model (gim). nafaji et al. [15] use parallel local adaptive thresholding for binarization of documents. upadhyay et al. [16] proposed an adaptive thresholding approach for image segmentation on gpu. all of them gained significant speedup in computational time than serial implementation. this research proposed a parallel implementation on gpu for three adaptive image thresholding methods: otsu, isodata, and mcet. our contribution lies in the parallel approach of the adaptive image thresholding method on gpu to optimize their computational times to deal with a highresolution image. this paper is organized as follows: section 2 presents the proposed approach of parallel adaptive image thresholding methods, section 3 presents the result and discussion, and section 4 presents the conclusion of this work. ii. method adaptive image thresholding is a method to segment images using a threshold generated from a specific algorithm. the algorithm has the purpose of obtaining an optimal threshold for segmentation. in this research, some well-known adaptive image thresholding algorithms, namely otsu, isodata, and mcet are parallelized to optimize high-resolution image performance. a. otsu method otsu method is proposed by [2] to perform automatic thresholding on the grayscale image. otsu method iteratively searches the threshold that maximizes inter-class variance. the steps to apply otsu threshold is described below: a) an image is converted into a normalized gray-level histogram using (1) and considered as the probability distribution where the number of pixels in 𝑖𝑡ℎ gray-level is 𝑛𝑖 , the total number of pixels is 𝑁, and the probability of 𝑖𝑡ℎ gray-level is 𝑝𝑖 . 𝑝𝑖 = 𝑛𝑖 /𝑁 (1) b) suppose the pixels are distributed into two classes (commonly as background and foreground), for all possible thresholds 𝑖 = 1 … 𝑘, the probability of class occurrence 𝜔𝑖 , the class mean level 𝜇𝑖 , and the inter-class variance 𝜎𝐵 2(𝑘) can be calculated using (2), (3), and (4), respectively. here, 𝜔(𝑘) and 𝜇(𝑘) is the zeroth-order and first-order cumulative moments of the histogram and 𝜇𝑇 = ∑ 𝑖 ∙ 𝑝𝑖 𝐿 𝑖=1 is the total mean level of an image. 𝜔𝑖 = ∑ 𝑝𝑖 𝑘 𝑖=1 = 𝜔(𝑘) (2) 𝜇𝑖 = ∑ 𝑖 ∙ 𝑝𝑖 /𝜔𝑖 𝑘 𝑖=1 = 𝜇(𝑘)/𝜔(𝑘) (3) 𝜎𝐵 2(𝑘) = [𝜇𝑇𝜔(𝑘)−𝜇(𝑘)] 2 𝜔(𝑘)[1−𝜔(𝑘)] (4) c) the select threshold maximizes 𝜎𝐵 2 using (5). this threshold is the optimal threshold.' 𝜎𝐵 2(𝑘) = 𝑚𝑎𝑥 1≤𝑘<𝐿 𝜎𝐵 2(𝑘) (5) a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 71 if 𝐿 is the number of gray levels and 𝑁 is the number of pixels in the image, the computational complexity of otsu method for grayscale image segmentation is given by the following operations: a) histogram initialization and histogram computation have a computational complexity of 𝑂(𝐿) and 𝑂(𝑁), respectively. b) search the optimum threshold by maximizing the inter-class variance has a computational complexity of 𝑂(𝐿). c) implementation of the otsu threshold on the image requires computational complexity of 𝑂(𝑁). b. isodata algorithm iterative self-organizing data analysis technique (isodata) is proposed by [3] to compute the global image threshold. the method uses an iterative procedure to update the threshold. image segmentation using the isodata algorithm is described as follows: a) compute gray-level histogram from the image. b) create initial segments by splitting the histogram into background and foreground segments using the initial threshold value 𝑇0. c) calculate the mean of background pixels 𝜇𝐵 and the mean of foreground pixels 𝜇𝐹 . d) calculate a new threshold 𝑇 by averaging the two means value using (6). 𝑇 = 𝜇𝐵+𝜇𝐹 2 (6) a) repeat the procedures c and d until the threshold value 𝑇 is less than a given threshold or the maximum iteration number is reached. the computational complexity of isodata method for grayscale image segmentation, where 𝐿 is the number of gray levels and 𝑁 is the number of pixels in the image, is given by the following operations: a) histogram initialization and histogram computation have a computational complexity of 𝑂(𝐿) and 𝑂(𝑁), respectively. b) update the threshold until the average inter-class distance is less than a threshold or the maximum number of iterations is reached requires computational complexity of 𝑂(𝑄), where 𝑄 is the number of iteration required by the algorithm. c) isodata threshold implementation on the image requires computational complexity of 𝑂(𝑁). c. minimum cross-entropy method the minimum cross-entropy (mcet) method is proposed by [4] to select an optimal threshold. the method searches the optimal threshold by calculating the cross-entropy for all possible thresholds and selecting the one with minimum cross-entropy. the procedure to apply the minimum crossentropy method for image segmentation is described below: a) compute normalized gray-level histogram from image using (7) where the number of pixels in 𝑖 gray-level is 𝑛𝑖 , the total number of pixels is 𝑁, and the probability of 𝑖 gray-level is 𝑝𝑖 . 𝑝𝑖 = 𝑛𝑖 /𝑁 (7) b) initialize the entropy of gray-level histogram using (8), where 𝑎 and 𝑏 are the minima and maximum gray-level intensity. 𝐻𝐶𝐸 = ∑ 𝑖 ∙ 𝑝𝑖 ∙ log(𝑖) 𝑏 𝑖=𝑎 (8) c) suppose the pixel is distributed into two classes: background and foreground with a threshold 𝑇. if the mean of pixel distribution below the threshold (background) is 𝜇𝐵 and the mean of pixel distribution above the threshold (foreground) is 𝜇𝐹 , then for all possible thresholds, 𝑇 = 𝑎 … 𝑏 calculate the cross-entropy of pixel distribution below and above the threshold using (9). 𝐻𝐶𝐸 (𝑇) = ∑ 𝑛𝑖 𝜇𝐵 (𝑇) log 𝜇𝐵(𝑇) 𝑖 𝑇 𝑖=𝑎 + ∑ 𝑛𝑖 𝜇𝐹 (𝑇) log 𝜇𝐹(𝑇) 𝑖 𝑏 𝑖=𝑇+1 (9) 72 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 d) select the optimal threshold 𝜏corresponding to the minimum of the cross-entropy using (10). 𝜏𝐶𝐸 = arg min 𝑎≤𝑇≤𝑏 𝐻𝐶𝐸 (𝑇) (10) if 𝐿 is the number of gray levels and 𝑁 is the number of pixels in the image, the computational complexity of mcet method for grayscale image segmentation is given by the following operations: a) histogram initialization and histogram computation have a computational complexity of 𝑂(𝐿) and 𝑂(𝑁), respectively. b) select the minimum cross-entropy from all possible thresholds has a computational complexity of 𝑂(𝐿2). c) implementation of mcet threshold on the image requires computational complexity of 𝑂(𝑁). d. parallel computing on gpu gpu (graphics processing unit) is a high-level parallel architecture used to do a fast operation in computer graphics, and now it can be used other than graphics, which is known as gp-gpu (general purpose-graphics processing unit) [17]. the well-known general-purpose parallel computing platform and programming model is compute unified device architecture (cuda) from nvidia. gpu is highly parallel, multithreaded, has many cores processors, and has very high memory bandwidth. the difference between how cpu and gpu process the data is shown in figure 1(a) and figure 1(b). gpu devotes more transistors to data processing than caching and flow control. gpu is built on an array of streaming multiprocessors (sm), and it is organized into grids, blocks, and threads. data-parallel processing maps data elements to parallel processing threads. figure 1(c) shows the parallel processing threads in gpu. a multithreaded program is partitioned into blocks of threads that execute independently from each other. therefore, using gpu, the computation of adaptive image thresholding algorithms will be parallel processed, reducing computational time. using the advantages of gpu's parallel architecture, the adaptive image thresholding methods that involve histogram calculation, cumulative sum, search the minimum or maximum value from an array can be optimized using parallel reduction and parallel prefix sum (scan) algorithms. 1) parallel reduction algorithm a parallel reduction algorithm can optimize the computation of an array's sum, minimum and maximum value. parallel reduction allows iteration from half of the total number of bin histograms processed parallel with a computational complexity of 𝑂(log(𝑁)) in the shared memory. (a) (c) (b) fig. 1. gpu devotes more transistors to data processing [17]; (a) cpu data process; (b) gpu data process; and (c) gpu parallel processing a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 73 every half of the total number of bin histograms is summed (sum reduction) or compared (min or max reduction) to the other half. the process is reduced to half every iteration until all of the element is processed. loop unrolling can optimize the thread when the processed data is within the thread warp. the illustration of the parallel sum reduction algorithm is shown in figure 2. 2) parallel prefix sum (scan) algorithm a parallel prefix sum (scan) algorithm can be used to calculate the cumulative sum of the histogram on shared memory. the procedure of parallel prefix sum (scan) algorithm is described as follow: a) up-sweep (reduction) phase, sum every bin in the histogram with the bin on its right according to its stride. this step has a computational complexity of 𝑂(log(𝑁)). the illustration of the up-sweep (reduction) phase is shown in figure 3. b) set the last bin in the histogram to zero. c) down-sweep phase, sum every bin in the histogram with the bin on its right according to its stride. this step also has a computational complexity of 𝑂(log(𝑁)). the illustration of the down-sweep phase is shown in figure 4. the parallel prefix sum (scan) algorithm has a computational complexity of 𝑂(2 log(𝑁)) where the 𝑂(log(𝑁)) is in the up-sweep phase and the down-sweep phase. fig. 3. illustration of up-sweep (reduction) phase [19] fig. 2. illustration of parallel sum reduction algorithm [18] 74 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 iii. result and discussion the computational time of adaptive image thresholding algorithms on gpu has been tested on fvc2004 (fingerprint verification competition) dataset [20]. the dataset consists of several fingerprint images. selected images in the dataset are resized into various sizes using the bi-cubic interpolation method. the proposed approach is built using c++ with an additional cuda library and runs on intel core i7-7700hq 2.8ghz processor, 16 gb of ram, and nvidia geforce gtx 1050. the gpu has pascal architecture with five streaming multiprocessors and computes capability 6.1. a. adaptive image thresholding implementation in this research, three adaptive image thresholding algorithms are implemented on gpu: otsu, isodata, and mcet. the parallel approach of the three methods is similar except finding the optimum threshold to perform binarization. first, image data must be copied from host to the device memory. several kernels to compute histogram, probability histogram, and cumulative histogram to find the optimal threshold and apply the threshold in the image are used. finally, the binary image result is copied back to the host from device memory. the implementation of otsu, isodata, and mcet methods on gpu is shown in algorithm 1. as shown in algorithm 1, the parallel approach of the adaptive image thresholding method uses several kernels to perform a specific operation, will keep short computation runs on streaming multiprocessors and increase its availability. the number of threads per block and the block per grid can be configured to run the kernel effectively. it is also suitable for error handling because it can be monitored on each kernel execution. algorithm 1. implementation of adaptive image thresholding method on gpu. enum method ← otsu = 1, isodata = 2, mcet = 3 read image data and method copy image data from host (cpu) to device (gpu) set threshold ← 0 histogram ← compute histogram from image data probability histogram ← compute probability histogram from a histogram cumulative histogram ← compute cumulative sum histogram from probability histogram switch (method) case otsu threshold ← find threshold that maximizes inter-class variance from cumulative sum histogram case isodata threshold ← update the threshold until the average inter-class distance is less than a given threshold or the maximum number of iteration is reached case mcet above-threshold and below-threshold means ← compute above-threshold and below-threshold means from cumulative sum histogram fig. 4. illustration of down-sweep phase [19] a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 75 cross-entropy histogram ← compute cross-entropy histogram from above-threshold and belowthreshold means threshold ← compute the index of minimum cross-entropy from cross-entropy histogram end switch binary image ← apply threshold to image data copy binary image from device (gpu) to host (cpu) the highest computational complexity is 𝑂(𝑁) which lies in the histogram computation and image thresholding step. the parallel implementation of these steps will reduce the computational complexity because the work is computed at once and distributed to the total number of threads used for computation. the parallel approach of histogram computation on gpu is shown in algorithm 2. histogram computation uses the atomic addition function from cuda and utilizes shared memory to store the partial histogram, which will reduce the queue at the addition instruction level to the number of threads block. the partial histogram in shared memory is then merged parallel to the histogram in global memory. this operation also uses atomic addition, which will reduce the queue at the addition instruction level to the number of blocks in a grid. without partial histogram computation in shared memory, the histogram computation is likely to have long queues and be forced to perform serial computation. all operations that equal the number of data need to access and performed in addition to one specific bin in the histogram. for the graylevel histogram, the number of histogram bins is fixed to 256. the queue is proportional to the data and their distribution in the image. with partial histogram computation, the queue is reduced to the number of threads and blocks used. algorithm 2. the computation of histogram on gpu. gpu configuration block ← 256 // block size grid ← 256 // grid size function compute the histogram read image data and image size t ← threadidx.x n ← the number of histogram bin // histogram initialization with zeros allocate shared memory (smem) to store the histogram if t < n then the tth index of smem histogram ← 0 end if synchronize the threads p ← threadidx.x + blockidx.x * blockdim.x q ← blockdim.x * griddim.x // compute partial histogram in shared memory while p < image size do r ← the pth index of image data atomic addition of the rth index of smem histogram with 1 p ← p + q end while synchronize the threads // merge the partial histogram in shared memory to histogram in global memory if t < n then atomic addition of the tth index of histogram with the tth index of smem histogram end if end function 76 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 after the histogram is obtained, the probability histogram is computed by simple division. otsu method uses the probability of 0 th order histogram, computed by dividing the value in every bin with the total number of data and the probability of 1 st order histogram computed by multiplying 0 th order histogram with the corresponding gray level. mcet method uses the probability of an entropy histogram computed by multiplying first-order histogram with the gray level log. isodata method uses the 0 th order histogram and 1 st order histogram. the kernel configuration is a block with 256 threads to calculate the 256-bins histogram. the computation of the cumulative sum of histogram uses a parallel prefix sum (scan) algorithm. the computational complexity can be reduced to 𝑂(2 log(𝑁)) from 𝑂(𝑁). to avoid bank conflict, it utilizes half of the histogram bin's total number as thread block and some offsets. bank conflict occurs when two or more threads want to access the same bank memory address, forcing serial access to memory. with proper offsets, bank conflict can be avoided. the computation of the cumulative sum of a histogram is shown in algorithm 3. algorithm 3. the computation of the cumulative sum of the histogram on gpu. gpu configuration block ← 256 / 2 grid ← 1 function compute the cumulative sum of histogram read probability histogram t ← threadidx.x n ← the number of histogram bin offset ← 1 p ← t q ← t + n / 2 offset1 ← p >> 4 offset2 ← q >> 4 // load data to shared memory allocate shared memory (smem) to store the cumulative sum of histogram the (p + offset1)th index of smem cumulative sum of histogram ← the pth index of probability histogram the (q + offset2)th index of smem cumulative sum of histogram ← the qth index of probability histogram synchronize the threads // up-sweep (reduction) phase for d = n >> 1 to d > 0 do synchronize the threads if t < d then p ← offset * (2 * t + 1) – 1 q ← offset * (2 * t + 2) – 1 p ← p + p >> 4 q ← q + q >> 4 the qth index of smem cumulative sum of histogram ← the qth index of smem cumulative sum of histogram + the pth index of smem cumulative sum of histogram end if offset ← offset * 2 d ← d >> 1 end for // set the last element to zero if t = 0 then the (n – 1)th index of cumulative sum of histogram ← the (n – 1 + (n – 1) >> 4)th index of smem cumulative sum of histogram the (n – 1 + (n – 1) >> 4)th index of smem cumulative sum of histogram ← 0 end if // down-sweep phase for d = 1 to d < n do a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 77 offset ← offset >> 1 synchronize the threads if t < d then p ← offset * (2 * t + 1) – 1 q ← offset * (2 * t + 2) – 1 p ← p + p >> 4 q ← q + q >> 4 temp value ← the pth index of smem cumulative sum of histogram the pth index of smem cumulative sum of histogram ← the qth index of smem cumulative sum of histogram the qth index of smem cumulative sum of histogram ← the qth index of smem cumulative sum of histogram + temp value end if d ← d * 2 end for // copy data from shared memory to global memory the pth index of cumulative sum of histogram ← the (p + 1 + (p + 1) >> 4)th index of smem cumulative sum of histogram if q < n – 1 then the qth index of cumulative sum of histogram ← the (q + 1 + (q + 1) >> 4)th index of smem cumulative sum of histogram end if end function computation to find the optimal threshold from the cumulative sum of the histogram is different for each method. however, a block with 256 threads is used to match the number of histogram bins because all methods are based on a histogram. otsu method finds a threshold that maximizes interclass variance can be achieved using a parallel reduction algorithm to find the index of maximum inter-class variance. algorithm 4 shows the computation of inter-class variance on gpu. algorithm 4. the computation of inter-class variances on gpu gpu configuration block ← 256 grid ← 1 function compute the inter-class variances read cumulative sum of 0th order and 1st order probability histogram t ← threadidx.x n ← the number of histogram bin // load data to shared memory allocate shared memory (smem) to store the cumulative sum of 0th and 1st order probability histogram smem cumulative sum of 0th order histogram ← cumulative sum of 0th order probability histogram smem cumulative sum of 1st order histogram ← cumulative sum of 1st order probability histogram smem value ← 0 smem index ← t synchronize the threads // compute inter-class variances numerator ← power of two of (the (n – 1)th index of smem cumulative sum of 1st order histogram * the tth index of smem cumulative sum of 0th order histogram – the tth index of smem cumulative sum of 1st order histogram) denominator ← the tth index of smem cumulative sum of 0th order histogram * (1 – the tth index of smem cumulative sum of 0th order histogram) + epsilon) the tth index of smem value ← numerator / denominator synchronize the threads 78 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 // find the index of maximum value of inter-class variance using parallel reduction algorithm for s = blockdim.x / 2 to s > 0 do if t < s and the (t + s)th index of smem value > the tth index of smem value then the tth index of smem index ← the (t + s)th index of smem index the tth index of smem value ← the (t + s)th index of smem value end if synchronize the threads s ← s >> 1 end for // get the index of maximum value and copy to global memory if t = 0 then threshold ← the 0th index of smem index end if end function at each iteration in the isodata method, the threads compute the average data below and above the threshold, compute the new threshold, and compare the new threshold with the previous threshold. if the difference of the thresholds is less than a given threshold or the iteration is reached the maximum number of iterations, the optimum threshold is obtained. algorithm 5 shows the isodata computation on gpu. algorithm 5. the computation of isodata on gpu gpu configuration block ← 256 grid ← 1 function compute the isodata read cumulative sum of 0th order and 1st order histogram and maximum number of iteration t ← threadidx.x n ← the number of histogram bin // load data to shared memory allocate shared memory (smem) to store the cumulative sum of 0th order and 1st order histogram smem cumulative sum of 0th order histogram ← cumulative sum of 0th order histogram smem cumulative sum of 1st order histogram ← cumulative sum of 1st order histogram smem means below threshold ← 0 smem means above threshold ← 0 smem value ← 0 synchronize the threads // compute all possible means below-threshold and above-threshold if t < n – 1 then the tth index of smem means below-threshold ← floor ((the tth index of smem cumulative sum of 1st order histogram / (the tth index of smem cumulative sum of 0th order histogram + epsilon)) + 0.5) numerator ← the (n – 1)th index of smem cumulative sum of 1st order histogram – the (t + 1)th index of smem cumulative sum of 1st order histogram denominator ← the (n – 1)th index of smem cumulative sum of 0th order histogram – the (t + 1)th index of smem cumulative sum of 0th order histogram + epsilon the tth index of smem means above-threshold ← floor ((numerator / denominator) + 0.5) end if synchronize the threads // compute the average inter-class means the tth index of smem value ← floor (((the tth index of smem means below-threshold + the tth index of smem means above-threshold) / 2) + 0.5) synchronize the threads a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 79 // compute the difference between the current threshold and the previous threshold if t = 0 then iteration ← 0 difference ← 1 t ← floor ((the (n – 1)th index of cumulative sum of 1st order histogram / (the (n – 1)th index of cumulative sum of 0th order histogram + epsilon)) + 0.5) while difference > 0 and iteration < maximum number of iteration do threshold ← the tth index of smem value difference ← absolute of (the tth index of smem value – threshold) t ← the tth index of smem value iteration ← iteration + 1 end while end if synchronize the threads end function the cross-entropy computation uses a parallel sum reduction algorithm to compute the sum abovethreshold and below-threshold entropy from the histogram. the sum is used to compute the entropy histogram. to compute all possible thresholds in parallel (iterates through all possible thresholds while performing parallel sum reduction algorithm to compute the sum above-threshold and belowthreshold), the configuration is set to use a block with 256 threads and a grid with 256 blocks. algorithm 6 shows the cross-entropy computation on gpu. algorithm 6. the computation of cross-entropy on gpu. gpu configuration block ← 256 grid ← 256 function compute the cross-entropy read 0th order probability histogram, cumulative sum of 0th order and 1st order probability histogram b ← blockidx.x t ← threadidx.x n ← the number of histogram bin // load data to shared memory allocate shared memory (smem) to store the sum and entropy below-threshold and abovethreshold data below-threshold ← the bth index of cumulative sum of 1st order probability histogram / (the bth index of cumulative sum of 0th order probability histogram + epsilon) data above-threshold ← (the (n-1)th index of cumulative sum of 1st order probability histogram – the bth index of cumulative sum of 1st order probability histogram) / (the (n-1)th index of cumulative sum of 0th order probability histogram – the bth index of cumulative sum of 0th order probability histogram + epsilon) the tth index of smem below-threshold entropy ← 0 the tth index of smem above-threshold entropy ← 0 synchronize the threads // compute entropy above-threshold and below-threshold if t > b and data above-threshold > 0 then the tth index of smem above-threshold entropy ← (t + 1) * the tth index of 0th order probability histogram * log of (data above-threshold) end if if t <= b and data below-threshold > 0 then the tth index of smem below-threshold entropy ← (t + 1) * the tth index of 0th order probability histogram * log of (data below-threshold) end if synchronize the threads 80 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 // perform parallel sum reduction for s = b / 2 to s > 0 do if t < s then the tth index of smem above-threshold entropy ← the tth index of smem above-threshold entropy + the (t + s)th index of smem above-threshold entropy the tth index of smem below-threshold entropy ← the tth index of smem below-threshold entropy + the (t + s)th index of smem below-threshold entropy end if synchronize the threads end for // compute cross-entropy if t = 0 then the bth index of cross-entropy histogram ← global entropy – the 0th index of smem abovethreshold entropy – the 0th index of smem below-threshold entropy end if end function finding the index of minimum cross-entropy can be done using a parallel reduction algorithm that compares half of the histogram bins with the other half of the histogram bins. the number of histogram bins is reduced for every iteration. algorithm 7 shows the computation to find the index of minimum cross-entropy on gpu. algorithm 7. the computation to find the index of minimum cross entropy on gpu. gpu configuration block ← 256 grid ← 1 function find the index of minimum cross-entropy read cross-entropy histogram t ← threadidx.x // load data to shared memory allocate shared memory (smem) to store the cross-entropy histogram smem cross-entropy histogram ← cross-entropy histogram smem index ← t synchronize the threads // find index of minimum value using reduction for s = blockdim.x / 2 to s > 0 do if t < s and the (t + s)th index of smem cross-entropy histogram < the tth index of smem cross-entropy histogram then the tth index of smem cross-entropy histogram ← the (t + s)th index of smem crossentropy histogram the tth index of smem index ← the (t + s)th index of smem index end if synchronize the threads s ← s >> 1 end for // copy the result to global memory if t = 0 then threshold ← the 0th index of smem index end if end function the implementation of image thresholding is parallelized using thread-level parallelism on gpu. the approach is practical because the operation is independent for each pixel. the result of image a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 81 thresholding is a binary image “1” for pixels above the threshold and “0” for pixels below the threshold. algorithm 8 shows the implementation of image thresholding on gpu. algorithm 8. the implementation of image thresholding on gpu. gpu configuration block ← 256 grid ← 256 function apply the threshold on image read image data, image size and threshold t ← threadidx.x + blockidx.x * blockdim.x s ← blockdim.x * griddim.x // create binary image using image thresholding while t < image size do if the tth index of image data < threshold then the tth index of binary image ← 0 else the tth index of binary image ← 1 end if t ← t + s end while end function b. adaptive image thresholding result the parallel adaptive image thresholding method is tested on selected images from the fvc2004 (fingerprint verification competition) dataset [20]. the result of adaptive image thresholding implementation is the binary image as shown in figure 5 where (a) is the fingerprint image, (b) is the binary image generated by the otsu method with threshold = 154, (c) is the binary image generated by isodata method with threshold = 156 and (d) is the binary image generated by mcet method with threshold = 123. as shown in figure 5, the methods produce a different optimal threshold because the algorithm to search the optimum threshold is also different. c. computational time evaluation the test was conducted on selected images from fvc2004 (fingerprint verification competition) dataset [20]. the images are resized to generate various image sizes, namely 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096. the purpose of this experiment is to measure the computational time of the proposed parallel approach of adaptive image thresholding methods when dealing with a large number of data (pixels). the computational time evaluation on cpu and gpu is shown in figure 6 where (a) otsu method, (b) isodata method, and (c) mcet method. the proposed parallel approach gains speedup 4-6 times than cpu implementation from implementing adaptive image thresholding methods on gpu. (a) (b) (c) (d) fig. 5. the result of adaptive image thresholding implementation; (a) fingerprint image; (b) otsu method with threshold = 154; (c) isodata method with threshold = 156; and (d) mcet method with threshold = 123 82 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 the performance significantly increases when dealing with larger data. the result shows that the parallel approach of the adaptive image thresholding method on gpu allows image segmentation to be processed in real-time, even when dealing with a large resolution of the image. (a) (b) (c) fig. 6. performance evaluation of adaptive image thresholding implementation; (a) performance comparison of otsu method implementation on cpu and gpu; (b) performance comparison of isodata method implementation on cpu and gpu; and (c) performance comparison of mcet method implementation on cpu and gpu 0 2 4 6 8 10 12 14 256x256 512x512 1024x1024 2048x2048 4096x4096 c o m p u ta ti o n a l t im e ( se co n d s) cpu gpu 0 2 4 6 8 10 12 14 256x256 512x512 1024x1024 2048x2048 4096x4096 c o m p u ta ti o n a l t im e ( se co n d s) cpu gpu 0 2 4 6 8 10 12 14 256x256 512x512 1024x1024 2048x2048 4096x4096 c o m p u ta ti o n a l t im e ( se co n d s) cpu gpu a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 83 iv. conclusion image processing applications, for example, perform segmentation, usually requiring highresolution images such as satellite, aerial, biometric, or medical images as the input. the segmentation method, which involves per pixel operation and iterative procedure, can be costly in handling many data/pixels in the high-resolution image. therefore, this research proposed a parallel approach of adaptive image thresholding algorithms, namely otsu, isodata, and minimum cross-entropy on gpu to deal with high-resolution images. the experiment was conducted on selected fingerprint images taken from fvc2004 (fingerprint verification competition) dataset. from the experiment with the various scale of image resolutions, gpu implementation's computational time shows 4-6 times more speed up than cpu implementation. the performance is significantly increased when dealing with larger image resolution. this result shows that the parallel approach allows image segmentation to be processed in real-time, even when dealing with large image resolution. the contributions are shown in the analysis result of the adaptive image thresholding algorithms that can be optimized using the parallel approach to produce a significant speedup in a computational time when dealing with a high-resolution image. in future work, the proposed parallel approaches will be further optimized using multi-gpus and implemented in more complex cases such as the segmentation of aerial or medical images. acknowledgment this research is supported by lppm universitas ahmad dahlan research grant no. pf062/sp3/lppm-uad/vi/2018. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] r. c. gonzalez and r. e. woods, digital image processing. prentice hall, 2008. [2] n. otsu, "a threshold selection method from gray-level histograms," ieee trans. syst. man. cybern., vol. 9, no. 1, pp. 62–66, jan. 1979. [3] g. h. ball, d. j. hall, and s. r. institute, isodata: a method of data analysis and pattern classification. stanford research institute, 1965. [4] c. h. li and c. k. lee, "minimum cross entropy thresholding," pattern recognit., vol. 26, no. 4, pp. 617–625, apr. 1993. [5] a. m. a. talab, z. huang, f. xi, and l. haiming, "detection crack in image using otsu method and multiple filtering in image processing techniques," optik (stuttg)., vol. 127, no. 3, pp. 1030–1033, feb. 2016. [6] z. he and l. sun, "surface defect detection method for glass substrate using improved otsu segmentation," appl. opt., vol. 54, no. 33, p. 9823, nov. 2015. [7] y. feng, h. zhao, x. li, x. zhang, and h. li, "a multi-scale 3d otsu thresholding algorithm for medical image segmentation," digit. signal process., vol. 60, pp. 186–199, jan. 2017. [8] p. zhang et al., "multi-component segmentation of x-ray computed tomography (ct) image using multi-otsu thresholding algorithm and scanning electron microscopy," energy explor. exploit., vol. 35, no. 3, pp. 281–294, may 2017. [9] s. sarkar, s. das, and s. s. chaudhuri, "a multilevel color image thresholding scheme based on minimum cross entropy and differential evolution," pattern recognit. lett., vol. 54, pp. 27–35, mar. 2015. http://journal2.um.ac.id/index.php/keds https://www.pearson.com/us/higher-education/program/gonzalez-digital-image-processing-4th-edition/pgm241219.html https://doi.org/10.1109/tsmc.1979.4310076 https://doi.org/10.1109/tsmc.1979.4310076 https://books.google.co.id/books/about/isodata_a_novel_method_of_data_analysis.html?id=ti3bgwaacaaj&redir_esc=y https://books.google.co.id/books/about/isodata_a_novel_method_of_data_analysis.html?id=ti3bgwaacaaj&redir_esc=y https://doi.org/10.1016/0031-3203(93)90115-d https://doi.org/10.1016/0031-3203(93)90115-d https://doi.org/10.1016/j.ijleo.2015.09.147 https://doi.org/10.1016/j.ijleo.2015.09.147 https://doi.org/10.1364/ao.54.009823 https://doi.org/10.1364/ao.54.009823 https://doi.org/10.1016/j.dsp.2016.08.003 https://doi.org/10.1016/j.dsp.2016.08.003 https://doi.org/10.1177/0144598717690090 https://doi.org/10.1177/0144598717690090 https://doi.org/10.1177/0144598717690090 https://doi.org/10.1016/j.patrec.2014.11.009 https://doi.org/10.1016/j.patrec.2014.11.009 84 a. prahara et al. / knowledge engineering and data science 2021, 4 (2): 69–84 [10] d. oliva, s. hinojosa, v. osuna-enciso, e. cuevas, m. pérez-cisneros, and g. sanchez-ante, “image segmentation by minimum cross entropy using evolutionary methods,” soft comput., pp. 1–20, aug. 2017. [11] t. kaur, b. s. saini, and s. gupta, "optimized multi threshold brain tumor image segmentation using two dimensional minimum cross entropy based on co-occurrence matrix," springer, cham, 2016, pp. 461–486. [12] s. hemalatha and s. m. anouncia, "unsupervised segmentation of remote sensing images using fd based texture analysis model and isodata," int. j. ambient comput. intell., vol. 8, no. 3, pp. 58–75, 2017. [13] p. kanungo, p. k. nanda, and a. ghosh, "parallel genetic algorithm based adaptive thresholding for image segmentation under uneven lighting conditions," in 2010 ieee international conference on systems, man and cybernetics, 2010, pp. 1904–1911. [14] m. sandeli and m. batouche, "multilevel thresholding for image segmentation based on parallel distributed optimization," in 2014 6th international conference of soft computing and pattern recognition (socpar), 2014, pp. 134–139. [15] m. h. najafi, a. murali, d. j. lilja, and j. sartori, "gpu-accelerated nick local image thresholding algorithm," in 2015 ieee 21st international conference on parallel and distributed systems (icpads), 2015, pp. 576–584. [16] p. k. upadhyay, s. chandra, and a. sharma, "a novel approach of adaptive thresholding for image segmentation on gpu," in 2016 fourth international conference on parallel, distributed and grid computing (pdgc), 2016, pp. 652–655. [17] j. fung and s. mann, "using graphics devices in reverse: gpu-based image processing and computer vision," in 2008 ieee international conference on multimedia and expo, 2008, pp. 9–12. [18] m. harris, "optimizing cuda," sc07 high perform. comput. with cuda, p. 18, 2007. [19] harris, s. sengupta, and j. d. owens, "parallel prefix sum (scan) with cuda," gpu gems, vol. 3, no. 39, pp. 851– 876, 2007. [20] d. maltoni, d. maio, a. k. jain, and s. prabhakar, handbook of fingerprint recognition. springer science & business media, 2009. https://doi.org/10.1007/s00500-017-2794-1 https://doi.org/10.1007/s00500-017-2794-1 https://doi.org/10.1007/978-3-319-33793-7_20 https://doi.org/10.1007/978-3-319-33793-7_20 https://doi.org/10.4018/978-1-5225-7033-2.ch028 https://doi.org/10.4018/978-1-5225-7033-2.ch028 https://doi.org/10.1109/icsmc.2010.5642269 https://doi.org/10.1109/icsmc.2010.5642269 https://doi.org/10.1109/icsmc.2010.5642269 https://doi.org/10.1109/socpar.2014.7007994 https://doi.org/10.1109/socpar.2014.7007994 https://doi.org/10.1109/socpar.2014.7007994 https://doi.org/10.1109/icpads.2015.78 https://doi.org/10.1109/icpads.2015.78 https://doi.org/10.1109/pdgc.2016.7913203 https://doi.org/10.1109/pdgc.2016.7913203 https://doi.org/10.1109/pdgc.2016.7913203 https://doi.org/10.1109/icme.2008.4607358 https://doi.org/10.1109/icme.2008.4607358 http://www.enseignement.polytechnique.fr/profs/informatique/eric.goubault/cours09/cuda/sc07_cuda_5_optimization_harris.pdf https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda https://doi.org/10.1007/978-1-84882-254-2 https://doi.org/10.1007/978-1-84882-254-2 i. introduction ii. method a. otsu method b. isodata algorithm c. minimum cross-entropy method d. parallel computing on gpu 1) parallel reduction algorithm 2) parallel prefix sum (scan) algorithm iii. result and discussion a. adaptive image thresholding implementation b. adaptive image thresholding result c. computational time evaluation iv. conclusion acknowledgment declarations author contribution funding statement conflict of interest additional information references knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 188–196 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p188-196 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) predicting heart disease using logistic regression mochammad anshori 1,*, m. syauqi haris 2 department informatics, institute of health and science technology rs. dr. soepraoen malang, jl. s. supriyadi no. 22, malang, 65147, indonesia 1 moanshori@itsk-soepraoen.ac.id*; 2 haris@itsk-soepraoen.ac.id; * corresponding author i. introduction heart disease (hd), or cardiovascular disease, is a major cause of death worldwide. based on world health organization (who) report, there are 17.9 million deaths yearly, and almost 32% of all are passed away [1][2]. according to the who page, the cause of heart disease is a heart attack, stroke, and rheumatic. everyone has the potential for heart disease, especially men compared to the woman. unhealthy lifestyles, such as smoking, cholesterol, high blood pressure, obesity, alcohol, and hereditary history, become the most critical risk of heart disease [3]. not all sufferers of heart disease end in death. a controlled lifestyle, such as eating habits and physical activity, can prevent the risk. symptoms indicate heart disease, such as shortness of breath [4], physical fatigue [5], and pain in the chest, arms, shoulders, or back [6]. heart disease can attack the sufferer and is not easy to cure because it needs special treatment. as a vital organ, heart health care must be highly guarded. the most effortless action to take as a preventive measure is to reduce smoking habits, have a healthy diet, be active in physical activities and stop consuming alcohol [7]. the various causes of heart disease may increase the prediction complexity. with the development of medical data sourced from the patient's health record, there is a great opportunity as a basic material in developing patient health. currently, the use of computers has been applied in various fields. in health, it can be used to improve the decision-support system in medicine [8]. especially, implementing machine learning as an analytical tool can find hidden patterns in the data [9]. this development follows up a high degree of prediction in terms of proper prevention. prior studies on predicting and classifying heart disease using machine learning techniques are offered. these studies explore various features, methods, and their corresponding accuracies. some of the notable findings include research that used k-nearest neighbors (knn) with an accuracy of article info a b s t r a c t article history: received 11 september 2022 revised 25 october 2022 accepted 23 december 2022 published online 30 december 2022 a common risk of death is caused by heart disease. it is critical in the field of medicine to be able to diagnose cardiac disease in order to adequately prevent and treat patients. the most accurate method of prediction has the potential to both extend the patient's life and reduce the severity of their cardiac disease. the use of machine learning is one approach that may be taken to generate predictions. in this study, patient medical record information was used in conjunction with an algorithm for logistic regression in order to make heart disease diagnoses. the outcomes of the logistic regression have been utilized to achieve a high level of accuracy in the prediction of heart disease. to get the model coefficients needed for the equation, the experiment uses an iterative form of the logistic regression test. iteration 14 produced the best results, with an accuracy of 81.3495% and an average calculation time of 0.020 seconds. the best iteration was reached at that point. the percentage of space that lies beneath the roc curve is 89.36%. the findings of this study have significant implications for the field of heart disease prediction and can contribute to improved patient care and outcomes. accurate predictions obtained through logistic regression can guide healthcare professionals in identifying individuals at risk and implementing preventive measures or tailored treatment plans. the computational efficiency of the model further enhances its applicability in real-time decision support systems. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: heart disease cardiovascular disease classification machine learning logistic regression http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ mailto:moanshori@itsk-soepraoen.ac.id https://creativecommons.org/licenses/by-sa/4.0/ 189 m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 74% [10], information gain combined with knn achieving 99.65% accuracy [11], decision tree (dt) method with 99.62% accuracy [12], and gcsa-dcnn model with 95.34% accuracy [5]. additionally, feature selection and classification methods such as chi-squared combined with bayesnet achieved 85% accuracy [13], the fcmim-support vector machine (svm) method attained an accuracy of 92.37% [14], pca combined with random forest (rf) achieved 98.7% accuracy [8]. other methods like logistic regression (lr) achieved accuracies of 92.58% [15] and 92.76% [9], while a machine learning framework utilizing pso and support vector machine (svm) classifier achieved 84.36% accuracy [16]. ensemble classification techniques, including naive bayes (nb), bayesian network (bn), random forest (rf), and multilayer perceptron (mlp), achieved an accuracy of 85.48% [11]. these prior studies contribute to the understanding and development of machine learning approaches for heart disease prediction and classification. however, machine learning techniques are useful for predicting heart disease. implementing the machine learning technique may be more advantageous and effective in terms of cost [17]. various methods are used to predict heart disease accurately and with maximum accuracy. the methods used range from simple to hybrid methods with other methods aimed at increasing the accuracy of the classifier model. several methods have been used, including nb [18], bn [19], rf [20], mlp [21], svm [22], knn [23], lr [24], dt [25], and deep convolutional neural network (dcnn) [26]. the method for preprocessing uses principal component analysis (pca), chi-squared, and information gain. optimization methods include particle swarm optimization (pso), and ant colony optimization (aco). this research applied a machine learning algorithm called logistic regression to predict heart disease risk based on risk factors from the patient health records. the logistic regression used is simple logistic regression without any optimization. with this reliability, this study offers the use of logistic regression in classifying heart disease. previous studies use the same dataset with 14 features, which has resulted in an accuracy of 92.76% [9] and a total of 13 with an accuracy of 92.58% [13]. based on the result above, logistic regression can provide high accuracy. the difference between the research conducted with previous research is based on the dataset used. this study uses a dataset with a number of features = 9. for the comparison to get the best model, a comparison method is implemented. the model comparisons are based on function classifiers, such as svm (support vector model) and lda (linear discriminant analysis). the aim of this study is to know the model of log regression while implemented in this dataset. the fundamental difference between this study and previous research lies in the dataset used. in this research, we used a new dataset that covers symptoms of heart disease that have a total feature less than previous research. the motivation behind this research stems from the pressing need to improve the accuracy of heart disease prediction models, given the significant impact of heart disease on global health. accurate and reliable prediction models can aid healthcare professionals in identifying high-risk individuals and implementing timely preventive measures. by leveraging machine learning algorithms and exploring various features and methods, we aim to contribute to the development of more effective and efficient heart disease prediction models. the findings of this research can potentially enhance medical decision-making processes, improve patient outcomes, and ultimately reduce the burden of heart disease on individuals and healthcare systems. this research contributes to the existing body of knowledge on heart disease prediction by focusing on a specific dataset with a reduced number of features. while previous studies have achieved high accuracies using more comprehensive datasets, this research explores the potential of logistic regression with a limited feature set. by evaluating the performance of logistic regression and comparing it with other classifiers, such as svm and lda, we aim to provide insights into the effectiveness of logistic regression in predicting heart disease using a more compact dataset. the findings of this study can shed light on the trade-offs between feature selection and predictive accuracy, offering valuable guidance for future research and the development of practical heart disease prediction models. the remaining sections of this paper are organized as follows. section ii provides a detailed explanation of the methodology used, including data collection, data preparation, and the m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 190 implementation of logistic regression, svm, and lda classifiers. section iii presents the experimental results and performance evaluation metrics, comparing the accuracies of different classifiers. additionally, a discussion of the findings and their implications will be provided in this section. finally, section iv concludes the paper, by summarizing the key findings and their significance in the field of heart disease prediction, the limitations of the study, and potential areas for future research. ii. method in this research, a systematic methodology consisting of four stages represent in figure 1. figure 1 provides an overview of these stages, which include dataset loading, dataset preparation, model creation using the selected method, and result evaluation. fig. 1. research methodology the initial stage involves preparing the dataset for analysis. the dataset used in this research was obtained from the mendeley dataset [1]. this dataset contains information on observable characteristics and risk factors associated with heart attacks. the data instances were collected from electronic health records of patients. in total, the dataset comprises 1319 data instances, each representing a patient's information. the data comparison with positive and negative labels can be seen in figure 2. fig. 2. target class demographics figure 2 provides a visualization of the distribution of positive and negative labels in the dataset. based on figure 2, 61% of the data was labeled positive, and the remaining 39% was labeled negative. from the figure, the instance data with a positive class has more quantity than those with a negative label. the dataset has features unlocked 9. the details of the features in the dataset are shown in table 1. if observed, all data types of each feature are numeric. it indicates that the nominal data has been converted to numeric, making it easier for the model to perform calculations. on the other hand, it makes it easier for researchers to process data because there is no need to convert nominal data types. 191 m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 table 1. dataset details no feature data type range description 1 age numeric 14, 103 age of patient 2 gender numeric 0, 1 1 = male, 0 = female 3 impulse numeric 20, 1111 heart rate 4 pressure high numeric 42, 223 systolic blood pressure 5 pressure low numeric 38, 154 diastolic blood pressure 6 glucose numeric 35, 541 blood sugar 7 kcm numeric 0.321, 300 ck-mb 8 troponin numeric 0.001, 10.3 test troponin 9 class nominal 0, 1 positive/negative the second stage is separating the data between training data and test data. the data shared is used to build a classifier model. the scheme used in data sharing is the k-fold cross-validation method. this method is applied because the resulting model is more general and can avoid overfitting [27]. cross-validation works based on the value of the parameter k. the value of k here determines how many data segments are shared between test data and training data. the illustration of cross-validation can be seen in figure 3. fig. 3. illustration of cross-validation with k-fold = 10 figure 3 shows cross-validation for this research with a value of k = 10. the gray cells will be the test data for each section and run iteratively for the value of k. the parameter k used in this study is 10-fold cross-validation, meaning the data is divided into 10 subsets. each subset is used as the test set once, while the remaining nine subsets are combined to form the training set. this iterative process ensures that the model is evaluated on different combinations of training and test data, providing a more robust assessment of its predictive capabilities. by utilizing the k-fold cross-validation technique, this research aims to build a classifier model that can generalize well to unseen data. this approach helps to assess the model's performance and determine its ability to accurately predict heart disease in new and unseen cases the third stage is creating a lr classification model. lr is a mathematical model that uses probability estimation for each class [28]. lr is one of the supervised learning methods. in this case, lr uses to overcome the binary classification. however, generally, lr is also reliable in the case of multi-label classification. the advantages of lr are that it does not require a lot of parameter optimization and is easy to implement [29]. the lr model operates similarly to linear regression, as seen in (1). however, the primary distinction lies in the function used. in lr, the sigmoid function, shown in (2), is employed within the equation. by substituting the sigmoid function into (1), (3) is derived. equation (4) represents the formulation of logistic regression as a logit, known as the log probability function. the term inside the brackets is referred to as the odds, representing the ratio of the probability of success to the m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 192 probability of failure. the lr coefficients are estimated using the iteratively reweighted least squares (irls) method [30]. in each iteration, the dependent variable is adjusted to obtain the optimal lr coefficient. �̂� = 𝐸(𝑦|𝑥) = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑛 𝑥𝑛+∈ (1) 𝜎(𝑍) = 1 1+𝑒−𝑧 (2) 𝐸(𝑦|𝑥) = 𝑠𝑖𝑔𝑚𝑎(𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑛 𝑥𝑛 ) (3) 𝐸(𝑦|𝑥) = 1 1+𝑒−(𝛽0 +𝛽1 .𝑥1+⋯+ 𝛽𝑛.𝑥𝑛) (4) where �̂� represents the predicted value of the dependent variable y given the independent variables 𝑥1, 𝑥2, ..., 𝑥𝑛. the coefficients 𝛽0, 𝛽1, ..., 𝛽𝑛 are estimated parameters that determine the relationship between the independent variables and the dependent variable. the term ∈ represents the error term or residual. 𝑍 represents the linear combination of the coefficients and independent variables. comparison is needed to obtain the best method. the model comparison that will be used is svm and lda. svm generally works by splitting data class based on the hyperplane. the svm function is shown in (5). 𝐿𝐷 = ∑ ∝𝑖 − 1 2⁄ 𝑛 𝑖=1 ∑ ∑ ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑇𝑥𝑗 𝑛 𝑗=1 𝑛 𝑖=1 (5) 𝐿𝐷 represents the svm function, ∝𝑖 and ∝𝑗 are the weights assigned to the data points, 𝑦𝑖 and 𝑦𝑗 are the class labels, and 𝑥𝑖 and 𝑥𝑗 are the feature vectors. the objective of svm is to find the optimal weights that maximize the margin between the classes. on the other hand, lda works by projecting all data vectors linearly. lda optimize the distance between class and minimize the distance between inner class. the lda formula is shown in (8). the equation is formed from covariance at (6) and pooled covariance at (7). 𝑐𝑖 = (𝑥𝑖 0)𝑇𝑥𝑖 0 𝑛𝑖 (6) 𝑐(𝑟,𝑠) = 1 𝑛 ∑ 𝑛𝑖𝑐𝑖 (𝑟, 𝑠) 𝑔 𝑖=1 (7) 𝑓𝑖 = 𝜇𝑖 𝐶 −1𝑥𝑘 𝑇 − 1 2⁄ 𝜇𝑖 𝐶 −1𝑥𝑖 𝑇 + ln (𝑝𝑖) (8) where 𝑐𝑖 represents the covariance for each class, 𝑛𝑖 is the number of instances in class 𝑖, 𝑥𝑖 0 denotes the centered data for class 𝑖, 𝑔 is the total number of classes, 𝜇𝑖 is the mean vector for class 𝑖, 𝐶 −1 is the inverse of the covariance matrix, 𝑥𝑘 𝑇 is the transpose of the centered data, and 𝑝𝑖 is the prior probability of class 𝑖. the researcher uses a performance reference as an accuracy value as a benchmark in comparing the results in the fourth stage. the formula for calculating accuracy is shown in (9) below. it also uses tpr (true positive rate) and fpr (false positive rate) to get the roc curve value [31]. roc here is valid for modeling errors/errors from the built classification model. fpr and tpr can be seen in (10) and (11) below for the accuracy formula. tp means that it is correct and predicted correctly, tn is correct, but the prediction is wrong, fp is wrong but predicted right, and fn is wrong and predicted wrong. 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁 𝑡𝑜𝑡𝑎𝑙 𝑑𝑎𝑡𝑎 (9) 𝑇𝑃𝑅 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 100% (10) 𝐹𝑃𝑅 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 100% (11) 193 m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 iii. results and discussion the results of this study are by observing the results of logistic regression performance. the application of logistic regression uses the weka application [32]. there is no data preprocessing here because the data obtained is considered clean. the irls iteration test carried out to obtain the logistic regression coefficient. the parameter values tested are 2 to 30 with multiples of 2. the iteration test results can be seen in figure 4. fig. 4. iteration parameter testing figure 4 shows a graph of the change in accuracy of each iteration test. when the iteration is low, the accuracy obtained is also low. the greater the iteration value, the higher the accuracy value. at iteration = 10, there is a decrease in the accuracy value compared to the accuracy at iteration 8. it shows that iteration = 10 is the optimal locale because the accuracy increases and decreases again. furthermore, at iteration = 14, it produces an accuracy that tends to be high, namely 81.35%. during this iteration, the logistic regression model can produce the best accuracy because when the accuracy is increased again, it decreases accuracy, and there tends to be no change in the increase or decrease in accuracy. based on these findings, it can be concluded that the logistic regression model achieved the best accuracy at iteration = 14. this information is crucial for selecting the optimal logistic regression coefficients and maximizing the predictive power of the model. the accuracy of logistic regression was obtained, then the model was compared. the comparison is shown in table 2. the table shows the evaluation measure such as accuracy, tpr, fpr, and computational time. the time value is second and obtained from ten times rials. the table shows the accuracy of log regression = 81.35%, svm with linear kernel = 78.17%, and lda gives accuracy = 69.75%. these results give the highest accuracy from the log regression model. linearly, the tpr value is also rising to the increase in the accuracy value. unlike the fpr value, which is inversely proportional, the value will be smaller if the tpr value increases. for the computational time, lda gives the worst time equal to 0.17 seconds. svm reach about 0.06 second, better than lda. the best computational is gained from log regression, which only needs 0.02 seconds to do classification. table 2. classification results based on the lr model and its comparison. evaluation log regression svm (linear) lda accuracy 81.3495 78.1653 69.7498 tpr 81.3 78.2 69.7 fpr 18.7 23.4 39.5 time (s) 0.02 0.06 0.17 in table 2 several evaluations of the performance of the logistic regression obtained from the confusion matrix. based on these results, it can be said that logistic regression can be used to predict heart disease with high accuracy. the tpr (sensitivity) was correctly calculated, and the calculated fpr was incorrectly identified [33]. computational time is also included in the calculation. the computational time obtained resulted in 10 times of testing to get the average. the average value of m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 194 computing time is 0.02 seconds. based on the computational time generated, the prediction model with logistic regression has a relatively fast computation time. next, consider figure 5. table 3. confusion matrix actual predicted positive negative positive 660 150 negative 96 413 since we know if log regression is the best model in this case, let us see the confusion matrix. using iteration = 14, the results of the evaluation of the implementation of logistic regression are shown in the confusion matrix table in table 3. the confusion matrix/error matrix is used to visualize the performance of the logistic regression algorithm. the confusion matrix represents the result between the actual and predicted values. the table shows the value of tp = 660, tn 413, fp = 96, fn = 150. table 3 shows that the classifier cannot predict all the data accurately. from the confusion matrix table above, there are still misclassifications. next, consider figure 5. the picture represents roc of the model performance. the roc is generated based on the log regression model. fig. 5.roc curve figure 5 shows the roc curve, which is a combination of the x and y axes, tpr occupies the xaxis and fpr on the y-axis. by being able to visualize the performance of the classifier in making predictions [33]. roc the value of the roc curve in figure 5 is 89.36. this value is good because it is close to 1, which is the best value of the roc curve. a good curve has a value between 0.5 up to 1 it means that the curve produced by logistic regression is close to its best value. it is proven that the classifier's performance is suitable for predicting heart disease. accurately predicting heart disease risk is crucial for developing effective decision-support systems in healthcare. the findings of this research contribute to the development of such systems by providing insights into the performance and feasibility of logistic regression as a predictive model. integrating logistic regression-based algorithms into decision support systems can assist healthcare professionals in identifying individuals at high risk of heart disease and making informed decisions regarding prevention and treatment strategies. these findings highlight the effectiveness of logistic regression as a predictive model for heart disease. despite misclassifications, the model exhibited high accuracy, relatively fast computational time, and a good roc curve. these results donate to understanding logistic regression's potential in heart disease prediction and can inform the development of more accurate and efficient prediction models. 195 m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 iv. conclusion referring to the results and discussion, the machine learning method, namely logistic regression, can predict heart disease based on the patient's electronic medical record. a dataset used in this study has a total feature = 9 and 1319 instances of data. based on the iteration parameter test results, the increase in the iteration value affects the accuracy value of the classifier model. it was found that the best iteration that can produce the highest accuracy at iteration = 14. the given accuracy is 81.3495%. the difference in iteration values affects the performance of logistic regression, as evidenced by the increasing iteration value providing an increase in accuracy until finding the optimal point. log regression is proven more reliable in making predictions with relatively high accuracy and relatively fast computation time. further research for this study by comparing some machine learning models, namely svm and lda. feature selection can be made in further research from this study to get a better model. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] s. s. maghdid and t. a. rashid, “an extensive dataset for the heart disease classification system,” mendeley data, 2022. [2] who, “cardiovascular diseases,” world health organization, 2020. https://www.who.int/healthtopics/cardiovascular-diseases#tab=tab_1 (accessed aug. 08, 2022). [3] c. b. c. latha and s. c. jeeva, “improving the accuracy of prediction of heart disease risk based on ensemble classification techniques,” informatics med. unlocked, vol. 16, p. 100203, 2019. [4] a. alshukry et al., “clinical characteristics of coronavirus disease 2019 (covid-19) patients in kuwait,” plos one, vol. 15, no. 11, p. e0242768, nov. 2020. [5] s. m. nagarajan, v. muthukumaran, r. murugesan, r. b. joseph, m. meram, and a. prathik, “innovative feature selection and classification model for heart disease prediction,” j. reliab. intell. environ., vol. 8, no. 4, pp. 333–343, dec. 2022. [6] s.-j. kim, “global awareness of myocardial infarction symptoms in general population,” korean circ. j., vol. 51, no. 12, p. 997, 2021. [7] r. ndejjo, g. musinguzi, f. nuwaha, h. bastiaens, and r. k. wanyenze, “understanding factors influencing uptake of healthy lifestyle practices among adults following a community cardiovascular disease prevention programme in mukono and buikwe districts in uganda: a qualitative study,” plos one, vol. 17, no. 2, p. e0263867, feb. 2022. [8] a. k. gárate-escamila, a. hajjam el hassani, and e. andrès, “classification models for heart disease prediction using feature selection and pca,” informatics med. unlocked, vol. 19, p. 100330, 2020. [9] s. m. m. hasan, m. a. mamun, m. p. uddin, and m. a. hossain, “comparative analysis of classification approaches for heart disease prediction,” in 2018 international conference on computer, communication, chemical, material and electronic engineering (ic4me2), feb. 2018, pp. 1–4. [10] m. anshori, f. mar’i, and f. a. bachtiar, “comparison of machine learning methods for android malicious software classification based on system call,” in 2019 international conference on sustainable information engineering and technology (siet), sep. 2019, pp. 343–348. [11] p. thombare, m. ghalme, s. raut, n. dhakne, and p. r. dholi, “prediction of heart disease using machine learning techniques,” int. res. j. mod. eng. technol. sci., vol. 04, no. 06, pp. 1099–1102.2022. http://journal2.um.ac.id/index.php/keds https://data.mendeley.com/datasets/65gxgy2nmg https://data.mendeley.com/datasets/65gxgy2nmg https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1 https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1 https://doi.org/10.1016/j.imu.2019.100203 https://doi.org/10.1016/j.imu.2019.100203 https://doi.org/10.1371/journal.pone.0242768 https://doi.org/10.1371/journal.pone.0242768 https://doi.org/10.1007/s40860-021-00152-3 https://doi.org/10.1007/s40860-021-00152-3 https://doi.org/10.1007/s40860-021-00152-3 https://doi.org/10.4070/kcj.2021.0320 https://doi.org/10.4070/kcj.2021.0320 https://doi.org/10.1371/journal.pone.0263867 https://doi.org/10.1371/journal.pone.0263867 https://doi.org/10.1371/journal.pone.0263867 https://doi.org/10.1371/journal.pone.0263867 https://doi.org/10.1016/j.imu.2020.100330 https://doi.org/10.1016/j.imu.2020.100330 https://doi.org/10.1109/ic4me2.2018.8465594 https://doi.org/10.1109/ic4me2.2018.8465594 https://doi.org/10.1109/ic4me2.2018.8465594 https://doi.org/10.1109/siet48054.2019.8985998 https://doi.org/10.1109/siet48054.2019.8985998 https://doi.org/10.1109/siet48054.2019.8985998 https://www.irjmets.com/uploadedfiles/paper/issue_6_june_2022/25520/final/fin_irjmets1654871784.pdf https://www.irjmets.com/uploadedfiles/paper/issue_6_june_2022/25520/final/fin_irjmets1654871784.pdf m. anshori / knowledge engineering and data science 2022, 5 (2): 188–196 196 [12] h. gulfam ahmad and m. jasim shah, “prediction of cardiovascular diseases ( cvds ) using machine learning techniques in health,” azerbaijan j. high perform. comput., vol. 4, no. 2, pp. 267–279, dec. 2021. [13] s. d. desai, s. giraddi, p. narayankar, n. r. pudakalakatti, and s. sulegaon, “back-propagation neural network versus logistic regression in heart disease classification,” in advances in intelligent systems and computing, 2019, pp. 133–144. [14] w. książek, m. gandor, and p. pławiak, “comparison of various approaches to combine logistic regression with genetic algorithms in survival prediction of hepatocellular carcinoma,” comput. biol. med., vol. 134, p. 104431, jul. 2021. [15] j. p. li, a. u. haq, s. u. din, j. khan, a. khan, and a. saboor, “heart disease identification method using machine learning classification in e-healthcare,” ieee access, vol. 8, pp. 107562–107582, 2020. [16] j. vijayashree and h. p. sultana, “a machine learning framework for feature selection in heart disease classification using improved particle swarm optimization with support vector machine classifier,” program. comput. softw., vol. 44, no. 6, pp. 388–397, nov. 2018. [17] c. m. bhatt, p. patel, t. ghetia, and p. l. mazzeo, “effective heart disease prediction using machine learning techniques,” algorithms, vol. 16, no. 2, p. 88, feb. 2023. [18] l. ali et al., “a feature-driven decision support system for heart failure prediction based on x2 statistical model and gaussian naive bayes,” comput. math. methods med., vol. 2019, pp. 1–8, nov. 2019. [19] a. elsayad and m. fakhr, “diagnosis of cardiovascular diseases with bayesian classifiers,” j. comput. sci., vol. 11, no. 2, pp. 274–282, 2015. [20] s. asadi, s. roshan, and m. w. kattan, “random forest swarm optimization-based for heart diseases diagnosis,” j. biomed. inform., vol. 115, p. 103690, mar. 2021. [21] k. subhadra and b. vikas, “neural network based intelligent system for predicting heart disease,” int. j. innov. technol. explor. eng., vol. 8, no. 5, pp. 484–487, 2019. [22] l. ali et al., “an optimized stacked support vector machines based expert system for the effective prediction of heart failure,” ieee access, vol. 7, pp. 54007–54014, 2019. [23] r. tr, u. k. lilhore, p. m, s. simaiya, a. kaur, and m. hamdi, “predictive analysis of heart diseases with machine learning approaches,” malaysian j. comput. sci., pp. 132–148, mar. 2022. [24] s. i. ayon, m. m. islam, and m. r. hossain, “coronary artery heart disease prediction: a comparative study of computational intelligence techniques,” iete j. res., vol. 68, no. 4, pp. 2488–2507, jul. 2022. [25] m. m. ghiasi, s. zendehboudi, and a. a. mohsenipour, “decision tree-based diagnosis of coronary artery disease: cart model,” comput. methods programs biomed., vol. 192, p. 105400, aug. 2020. [26] t. k. sajja and h. k. kalluri, “a deep learning method for prediction of cardiovascular disease using convolutional neural network,” rev. d’intelligence artif., vol. 34, no. 5, pp. 601–606, nov. 2020. [27] s. nusinovici et al., “logistic regression was as good as machine learning for predicting major chronic diseases,” j. clin. epidemiol., vol. 122, pp. 56–69, jun. 2020. [28] d. maulud and a. m. abdulazeez, “a review on linear regression comprehensive in machine learning,” j. appl. sci. technol. trends, vol. 1, no. 4, pp. 140–147, 2020. [29] z. huang and d. chen, “a breast cancer diagnosis method based on vim feature selection and hierarchical clustering random forest algorithm,” ieee access, vol. 10, pp. 3284–3293, 2022. [30] a. swift, r. heale, and a. twycross, “what are sensitivity and specificity?,” evid. based nurs., vol. 23, no. 1, pp. 2–4, jan. 2020. [31] e. frank, m. a. hall, and i. h. witten, the weka workbench. morgan kaufmann, 2016. [32] k. kirasich, t. smith, and b. sadler, “random forest vs logistic regression: binary classification for heterogeneous datasets,” smu data sci. rev., vol. 1, no. 3, p. 9, 2018. [33] l. de s. rodrigues, e. t. matsubara, and b. m. nogueira, “learning a fast bipartite ranker for text documents using lexicographical rankers and roc curves,” in 2017 14th iapr international conference on document analysis and recognition (icdar), nov. 2017, pp. 1307–1312. https://doi.org/10.32010/26166127.2021.4.2.267.279 https://doi.org/10.32010/26166127.2021.4.2.267.279 https://doi.org/10.32010/26166127.2021.4.2.267.279 https://doi.org/10.1007/978-981-13-0680-8_13 https://doi.org/10.1007/978-981-13-0680-8_13 https://doi.org/10.1007/978-981-13-0680-8_13 https://doi.org/10.1016/j.compbiomed.2021.104431 https://doi.org/10.1016/j.compbiomed.2021.104431 https://doi.org/10.1016/j.compbiomed.2021.104431 https://doi.org/10.1109/access.2020.3001149 https://doi.org/10.1109/access.2020.3001149 https://doi.org/10.1134/s0361768818060129 https://doi.org/10.1134/s0361768818060129 https://doi.org/10.1134/s0361768818060129 https://doi.org/10.3390/a16020088 https://doi.org/10.3390/a16020088 https://doi.org/10.1155/2019/6314328 https://doi.org/10.1155/2019/6314328 https://doi.org/10.1155/2019/6314328 https://www.researchgate.net/profile/alaa-elsayad/publication/283028563_diagnosis_of_cardiovascular_diseases_with_bayesian_classifiers/links/57b2f1c308aeac3177847e6e/diagnosis-of-cardiovascular-diseases-with-bayesian-classifiers.pdf https://www.researchgate.net/profile/alaa-elsayad/publication/283028563_diagnosis_of_cardiovascular_diseases_with_bayesian_classifiers/links/57b2f1c308aeac3177847e6e/diagnosis-of-cardiovascular-diseases-with-bayesian-classifiers.pdf https://doi.org/10.1016/j.jbi.2021.103690 https://doi.org/10.1016/j.jbi.2021.103690 https://www.researchgate.net/profile/vikas-boddu/publication/332035370_neural_network_based_intelligent_system_for_predicting_heart_disease/links/601f7b36299bf1cc26ac05de/neural-network-based-intelligent-system-for-predicting-heart-disease.pdf https://www.researchgate.net/profile/vikas-boddu/publication/332035370_neural_network_based_intelligent_system_for_predicting_heart_disease/links/601f7b36299bf1cc26ac05de/neural-network-based-intelligent-system-for-predicting-heart-disease.pdf https://doi.org/10.1109/access.2019.2909969 https://doi.org/10.1109/access.2019.2909969 https://doi.org/10.22452/mjcs.sp2022no1.10 https://doi.org/10.22452/mjcs.sp2022no1.10 https://doi.org/10.1080/03772063.2020.1713916 https://doi.org/10.1080/03772063.2020.1713916 https://doi.org/10.1016/j.cmpb.2020.105400 https://doi.org/10.1016/j.cmpb.2020.105400 https://doi.org/10.18280/ria.340510 https://doi.org/10.18280/ria.340510 https://doi.org/10.1016/j.jclinepi.2020.03.002 https://doi.org/10.1016/j.jclinepi.2020.03.002 http://dx.doi.org/10.38094/jastt1457 http://dx.doi.org/10.38094/jastt1457 https://doi.org/10.1109/access.2021.3139595 https://doi.org/10.1109/access.2021.3139595 https://doi.org/10.1136/ebnurs-2019-103225 https://doi.org/10.1136/ebnurs-2019-103225 https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=e.+frank%2c+m.+a.+hall%2c+and+i.+h.+witten%2c+%e2%80%9cthe+weka+workbench%2c%e2%80%9d+data+min.%2c+pp.+553%e2%80%93571%2c+2017%2c+doi%3a+10.1016%2fb978-0-12-804291-5.00024-6.&btng= https://core.ac.uk/download/pdf/216913541.pdf https://core.ac.uk/download/pdf/216913541.pdf https://doi.org/10.1109/icdar.2017.215 https://doi.org/10.1109/icdar.2017.215 https://doi.org/10.1109/icdar.2017.215 microsoft word 3-3976-darusalam-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 2, september 2018, pp. 55–63 eissn 2597-4637 https://doi.org/10.17977/um018v1i22018p55-63 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) profiling and identifying individual users by their command line usage and writing style darusalam a, 1, *, helen ashman b, 2, a department of technology, policy and management, delft university of technology building 31 jaffalaan 5, delft and 2628 bx delft, netherlands b school of information technology and mathematical sciences, university of south australia mawson lakes campus(d3-13), adelaide, south australia 5095, australia 1 d.darusalam@tudelft.nl*; 2 helen.ashman@unisa.edu.au * corresponding author i. introduction profiling is a way of grouping things or individuals into categories or groups based on characteristics such as situation, appearance, traits [1]. the term profiling means to get information about user’s activities, and it is possible to perform anomaly detection over a user profile to allow user identification. there are many researches on computer science that use social networks for user profiling. social networking is one of the applications that engage the user to be more active and permit user to create and maintain their own web pages, maia et al [2]. according to vosecky et al. [3] varieties of social networking have different manners to display and store information user profile on user’s web profile. social network has become one of the applications to identify user profile. other work involves the social networking technologies used for identifying user behavior [2]. pannell and ashman [4] evaluate an intrusion detection system (ids) to analyze user’s activities, and propose to help an administrator to identify and quickly respond to the intrusion. other work discusses the use of behavioral biometrics for intrusion detection applications [5]. another interesting research by pepyne et al. [6], analyses user profiling for computer security on particular users such as insurance adjusters and bank tellers another work investigates user profiling based on tag based in social media recommendation [7]. a further work outlines the purpose of user profile as gather related information based on user interests [8]. the ontology-based semantic similarity method is used to extend and sustain a user profile based on web access behaviour of user article info a b s t r a c t article history: received 31 may 2018 revised 10 july 2018 accepted 10 july 2018 published online 31 august 2018 profiling and identifying individual users is an approach for intrusion detection in a computer system. user profiles are important in many applications since they record highly user-specific information profiles are basically built to record information about users or for users to share experiences with each other. this research extends previous research on re-authenticating users with their user profiles. this research focuses on the potential to add psychometric user characteristics into the user model so as to be able to detect unauthorized users who may be masquerading as a genuine user. there are five participants involved in the investigation for formal language user identification. additionally, we analyze the natural language of two famous writers, jane austen & william shakespeare, in their written works to determine if the same principles can be applied to natural language use. this research used the n-gram analysis method for characterizing user’s style, and can potentially provide accurate user identification. as a result, n-gram analysis of a user's typed inputs offers another method for intrusion detection as it may be able to both positively and negatively identify users. the contribution of this research is to assess the use of a user’s writing styles in both formal language and natural language as a user profile characteristic that could enable intrusion detection where intruders masquerade as real users. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: profiling user identifying intrusions detection identification n-gram 56 darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 in music domain. however the lack of data support to justify whether the method is effective and no future work is discussed to develop the research. one of the researches is to identify individual user by process profiling [9]. another research from [1] examined the unix operating system to identify the user based on the login host, the login time, the command set and command set execution time of the profiled user. another piece of related research in regarding identifying user in social network that concern in trust and privacy [10]. another interesting research focuses on the connection of network topology and semantic similarity of user keywords [11]. categories of keyword and the notion of the distance among multiple categories trees and keyword across were used in a forest model. work by takeda et al [12] outline characteristic expression in literary work. their problem is, take literary work as positive examples (first writer) and negative examples of works by another writer especially in japanese poems (waka poems and prose texts). the method is to create a sequence list of substring of writer goodness. there is also research about the misuse of social networks for automated user profiling [13]. they analyze users’ weakness when registered to the social networking such as facebook by used their email address. other research investigates how to identify a user based on similar profile [3]. they used social networking such as msn and facebook to collect user profiles. user’s profiles are used to create tools especially for a profile comparison, to decide if similar profile is belonging to the same person or not. the use of vector-based comparison algorithm is a method to compare each user profile. this research will focus on evaluating the potential of two psychometric user characteristics, namely writing style in both natural language (jane austin & william shakespeare) and formal language (command line histories). to evaluate this particular characteristic this research will use the n-gram analysis method, and will aim to identify users in two ways, positive user identification and negative user identification. the work will however not implement the use of these user characteristics in an intrusion detection system. however, it establishes whether they can be used in such a system. profiling and identifying can help recognize intrusions. according to [14], user profiling is already a necessary part of the personalization of information delivery, and they propose it as an approach for identifying attacks to a computer system by profiling program and user behaviors [15]. anomaly detection over user profile can detect when an intruder is masquerading as a genuine user. the research extends previous research that implemented an intrusion detection system based on biometric characteristics such as keystroke analysis and mouse use and psychometrics characteristic such as user prose style and favorite web pages [14]. however, this research will focus on one potential psychometric user characteristic and will consider whether user’s writing styles in two different scenarios can be assessed with n-gram analysis in order to identify the user. users’ writings may be in the form of text in a novel, books, blogs, tweets and emails, and this is a form of natural language. on the other hand, users’ writings also occur in the way they interact with computers, issuing commands through a command line interface, and this is a form of formal language. this research will perform the same analysis on data of both types, using exactly the same analysis, and will determine firstly whether either form can be used successfully for user identification, and if so, the research will then determine which is the more effective of the two. this research will analyse the two different forms of data in two ways, firstly to check whether it can detect when the current user does not match the user profile and is hence an intruder – this is a negative identification. the second way is to detect whether the current user can unquestionably be verified as the true user – this is positive identification. most intrusion detection systems assume that the user is genuine until anomalies or broken rules show otherwise, that is, they only make use of negative identification. however, it might be useful to constrain a user’s activities until they positively identify themselves, perhaps not allowing the user to make significant changes until their current login session has been positively identified. in this research, the default position will be that the system has no evidence about the user’s identity, other than the fact that the user managed to log in. analyzing their activity after logging in should either give positive information that correlates strongly with the user’s profile and confirms their identity, or it should mismatch the profile, and the user would then be rejected from the system. darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 57 ii. methods this research aims to identify a user especially in natural language (jane austen and william shakespeare writing style) and formal language (command line history). the implementation part is explained about how the application produced the n-gram frequency. this software application was written in java programming language. there are two classes in this software, “n-gram.java” and “data.java”. the program running with the command “java n-gram [n]” n is the value of n-gram. this software will produce the n-gram frequency that placed in the comma separated value “csv” folder and distributed to microsoft excel. in ‘csv’ folder contain the history of data which txt. formatted and can be read by microsoft excel or other that equivalent and can be ready to use for n gram analysis. we will use this software for counting the n-gram of history of data from the user writing style. we use the software to perform four types of n-gram analysis, namely 3-gram, 5-gram, 11-gram and 15-gram. a. n-gram analysis an n-gram is a contiguous sequence of n letters, words or phonemes. for example size 1 of n-gram refer to unigram, size 2 of n-gram refer to bigram, size 3 of n-gram refer to trigram, size 4 refer to four-gram and in the general case is called an n-gram. an n-gram analysis is able to count the frequency of n-grams in a given file. for example, in the binary string level 3-gram such as 1110010000101010010000 has the following character-level trigrams 111, 110, 100, 001, 010, 100, 000, 000, 001, 010, 101, 010,………………000 and in the sentences “in this work we aim to get the certain knowledge”, has the following wordlevel 3-grams: in this work this work we work we aim we aim to aim to get to get the get the certain the certain knowledge, and for the phrase “in this work”, has the following character-level 3-grams: in, n t, th, thi, his, is , s w, wo, wor, ork this project uses varying sizes of n-gram such as 3-gram, 5-gram, 11-gram and 15-gram. firstly, we will evaluate the use of n-gram analysis of user generated formal language such as their command line histories to profile users’ command usage in their command line histories. secondly, we will evaluate the use of n-gram analysis of natural language to profile users and to ensure the accurate user identification. after that, we will compare each writing style from each user and see how different or significance of their pattern in term of natural language and formal language. next, we will visualize their n-gram patterns graphically to view their frequency pattern. b. t-test the t-test is a method that can be performed to decide whether two data sets (samples) are similar or dissimilar and to conclude whether they could have come from the same population. it assesses whether the means of two groups are statistically different from each other. this analysis is useful when we want to determine whether the means of two groups are similar or different. we will use ttests to assess both natural language and formal language samples, between two samples from the same user (for positive identification purposes) and between two samples from different users (for negative identification purposes). 58 darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 we next consider which form of t-test is appropriate to this research. • one sample t-test the one-sample t-test is used to decide whether a specific sample comes from a specific population. for example, when we want to know about a specific sample of university students is similar to or different from university students in general. in the current research we are comparing series of words or commands, and while it may later be feasible to identify a user from a single n-gram value, at this early stage, it is more appropriate to decide whether individuals can be identified from larger quantities of their writings. • independent t-test the independent t-test, or two sample t-test, is used to determine whether two samples are statistically similar or different to each other between the means in two unrelated groups. for example, when we want to know between university students female and male are different or similar to some psychological characteristics. in this research, the samples may not be unrelated, especially when comparing two samples from the same user. • dependent t-test the dependent t-test, also called the paired-group t-test, correlated-group t-test, matched groups t-test or dependent-group t-test. this t-test is used to compare two related samples (matched or related in the same way) that are both measured once or the same sample measured on two separate occasions. for example, when we want to know how the effect of using a particular drug for insomnia, for the patient is similar or different after consuming the drug. in this case, we will see what the effects of the drug on the patient are before and after consuming the drug. this is highly suitable to this research as we need to positively identify a user by comparing a current sample of the user’s writing to an older sample of their writing. from the explanation above we conclude that the dependent t-test or paired group t-test is the most suitable method to test our investigation. we use the t-test by proposing the following competing hypotheses: • the test hypothesis is the means of population behind the different of two samples. • the null hypothesis is the means of population behind the similarity of two samples. a probability value p is output by the t-test. the result of probability value is a comparison to the chosen level of significance α to conclude the test result. a common default is α = 0.05: • if the probability value is equal to or less than the level of significance, we can reject the null hypothesis and conclude that the two samples of writing are different • if the probability value is more than the level of significance, we can accept the hypothesis and conclude that the two samples of writing style are the same. • normalization of samples before performing any t-test, we will see the distribution of data collection from each gram whether the distribution of data is normal or non-normal. if the data non-normal we will transform the data to the normal data. this is because the samples we are analyzing are of radically different size. while it would be possible to choose subsamples from each sample so that each subsample is an identical size, we elected to normalize each whole sample instead, as command lines users, in particular, may have different tasks at different times, and the subsamples may not accurately reflect the user’s command line habits in a subsample. by normalizing the samples, we most accurately preserve each user’s writing styles, but at the same time cast them into the same numerical range so that different sample sizes do not confound the results. fig. 1 shows three types of normalization, we use to make normal distribution: a percentage normalization, max-min and z score. firstly, percentage normalization is counting each value of the n-gram and divided by total value of all n-gram and time to one hundred. secondly, max-min normalization is counting the total value all the n-gram and divided by a total number from the reduction of maximum number and minimum number of n-gram. lastly, the z score normalization is counting each value of the gram reduction by average total all the n-gram and divided by standard deviation. we will assess all three normalization methods in this research to determine which is most appropriate for the task of identifying users. darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 59 iii. results and discussions a. natural language fig. 2 shows how we compare both authors’ writing styles. firstly, we will see the result of one author. we compare each of jane austen’s writings to each other, using 3-gram, 5-gram, 11-gram and 15-gram analyses. we then use the t-test to measure their similarity and if the t-test for both pairs in each comparison shows they are from the same author, we have successfully performed a positive identification. secondly, we will do the same procedure for william shakespeare’s writings. we will then compare the writing styles of each of jane austen’s works with each of shakespeare’s and if the t-tests indicate they are different, then we will have successfully performed a negative identification. b. formal language there were five users involved in this experiment (fig. 3). one example of formal language is command line history, where the user interacts with the computer through a command line. by use ngram analysis we will identify those user’s ‘writing style’, namely their command line usage habits. the figure below shows how we compare each of our formal language samples: we will follow the same procedure for formal language as for natural language. namely we will analyze each sample, and compare samples by the same user to see if we can achieve positive identification. we will then compare the samples from different users to see if we can achieve negative identification. n-gram ( 3,5,11,15) outcome fig. 1. normalization process emma romeo and juliet sense and sensibility pride and prejudice hamlet julius caesar sonnet william shakespeare jane austen mansfield park fig. 2 method for comparison of natural language samples 60 darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 c. summary of formal language 1) positive identification table 1 shows positive identification summary for formal language, for all six possible pairings of samples, and for the four different n-gram lengths, calculated for each of the three normalization methods. for all four n-gram lengths, we find that the percentage and z score normalization methods correctly identify that the user is the same in each case. however, the max-min normalization method fails to identify that the user is the same in all but one case for each n-gram length. these results suggest that positive identification can be reliably achieved using the n-gram analysis method for formal language, using either the percentage or z score normalization methods. however, it indicates also that the max-min normalization method is not useful for positive identification in formal language samples. 2) negative identification table 2 is a negative identification summary for formal language, for all possible pairings of samples, and for the four different n-gram lengths, calculated for each of the three normalization methods. the results are less clear than were observed in the positive identification table. the maxuser1-history1 user1-history2 user1-history3 user1-history4 user2-history1 user3-history1 user4-history1 user5-history1 user1 formal language fig. 3. method for comparison of formal language samples table 1. positive identification summary of formal language n-gram normalization type correct identification incorrect identification rate percentage 3 gram percentage 6/6 0/6 100 % max-min 1/6 5/6 16.6 % z score 6/6 0/6 100 % 5 gram percentage 6/6 0/6 100 % max-min 1/6 5/6 16.6 % z score 6/6 0/6 100 % 11 gram percentage 6/6 0/6 100 % max-min 1/6 5/6 16.6 % z score 6/6 0/6 100 % 15 gram percentage 6/6 0/6 100 % max-min 1/6 5/6 16.6 % z score 6/6 0/6 100 % darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 61 min normalization method is correct between 60.71 % (out of 100 %) and 92.86 % (out of 100 %) of the time, showing an improvement over its use in positive identification. the other two normalization methods were not as reliable as in the positive identification tests d. summary of natural language 1) positive identification table 3 is a positive identification summary for natural language, for all possible pairings of sameauthor samples, and for the four different n-gram lengths, calculated for each of the three normalization methods. firstly, for the 3-gram percentage normalization and z score the percentage rate is 100 % success. however, max-min’s percentage rate only 11.11 % (out of 100 %) similarity for the paired comparison in user1’s command line history for a different machine. it means that maxmin normalization fails to identify positive identification. secondly, for 5-gram, 11-gram and 15gram the percentage normalization and z score are 100 % (out of 100 %) success for positive identification. on the other hand, the max-min gives a different result for each gram. for instance 5gram show 11.11 % (out of 100 %) same as 3-gram, 11-gram is 33.33 % (out of 100 %) and 15-gram is 50 % (out of 100 %). 2) negative identification table 4 is a negative identification summary for natural language, for all possible pairings of different-author samples, and for the four different n-gram lengths, calculated for each of the three normalization methods. the table above is the summary of negative identification shows the table 2. negative identification summary of formal language n-gram normalization type correct identification incorrect identification rate percentage 3 gram percentage 23/28 5/28 82.14 % max-min 20/28 8/28 71.43 % z score 19/28 9/28 67.86 % 5 gram percentage 13/28 15/28 46.43 % max-min 17/28 11/28 60.71 % z score 14/28 14/28 50.00 % 11 gram percentage 16/28 12/28 57.14 % max-min 26/28 2/28 92.86 % z score 16/28 14/28 57.14 % 15 gram percentage 23/28 5/28 82.14 % max-min 24/28 4/28 85.71 % z score 24/28 4/28 85.71 % table 3. positive identification summary of natural language n-gram normalization type positive result (correct identification) negative result (false identification ) rate percentage 3 gram percentage 18/18 0/18 100 % max-min 2/18 16/18 11.11 % z score 18/18 0/18 100 % 5 gram percentage 18/18 0/18 100 % max-min 2/18 16/18 11.11 % z score 18/6 0/18 100 % 11 gram percentage 18/18 0/18 100 % max-min 6/18 12/18 33.33 % z score 18/18 0/18 100 % 15 gram percentage 18/18 0/18 100 % max-min 9/18 9/18 50 % z score 18/18 0/18 100 % 62 darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 unsatisfactory result for each gram. the table 4 negative identification summary shows how negative identification for natural language fails for user identification. however, for max-min normalization especially for 3-gram, 5-gram, and 11-gram show that we success 100 % (out of 100 %) to identify the negative identification. however, we cannot trust max-min normalization since in both formal language and natural language max-min normalization show the result always different. iv. conclusion in this research, we investigate user writing styles which aim to be able to identify users positively and negatively. we investigate formal language and natural language by use n-gram methodology. there are five participants in formal language and two famous writers for natural language. we compare the result of n-gram analyses from each participant and assess how successful this comparison by using a t-test for paired two samples for means. the result shows that formal language can identify users in term of positive and negative identification. however, for natural language, the n-gram analysis is successful for positive identification but not for negative identification. thus, formal language is shown to be more generally accurate. further experiment has to be made to continue the investigation, as follows. firstly, for the formal language, we can investigate by dividing the period of time for instance per month or week, rather than compare on different machines in the different work place. secondly, we should try another gram, such as 1, 2, 4, 6, 7, 8, 9, 10, 12, 13, since every gram length appears to show a different result, and another gram length could give a more accurate result for formal and natural language. references [1] v. n.p.dau, et al., "profiling users in the unix os environment," 2000. [2] m. maia, et al., "identifying user behavior in online social networks," presented at the proceedings of the 1st workshop on social network systems, glasgow, scotland, 2008. [3] j. vosecky, et al., "user identification across multiple social networks," in networked digital technologies, 2009. ndt '09. first international conference on, 2009, pp. 360-365. [4] g. pannell and h. ashman, "user modelling for exclusion and anomaly detection: a behavioural intrusion detection system," berlin, heidelberg, 2010, pp. 207-218. [5] a. a. e. ahmed and i. traore, "detecting computer intrusions using behavioral biometrics," 2005. [6] d. l. pepyne, et al., "user profiling for computer security," in proceedings of the 2004 american control conference, 2004, pp. 982-987 vol.2. [7] c. c. hung, et al., "tag-based user profiling for social media recommendation," 2008. [8] m. reformat and s. k. golmohammadi, "updating user profile using ontology-based semantic similarity," in fuzzy systems, 2009. fuzz-ieee 2009. ieee international conference on, 2009, pp. 1062-1067. [9] s. mckinney and d. s. reeves, "user identification via process profiling: extended abstract," presented at the proceedings of the 5th annual workshop on cyber security and information intelligence research: cyber security and information intelligence challenges and strategies, oak ridge, tennessee, 2009. table 4. negative identification summary of natural language n-gram normalization type positive result (correct identification) negative result (false identification ) rate percentage 3 gram percentage 0/16 16/16 0 % max-min 16/16 0/16 100 % z score 0/16 16/16 0 % 5 gram percentage 0/16 16/16 0 % max-min 16/16 0/16 100 % z score 0/16 16/16 0 % 11 gram percentage 0/16 16/16 0 % max-min 16/16 0/16 100 % z score 0/16 16/16 0 % 15 gram percentage 0/16 16/16 0 % max-min 2/16 14/16 12.5 % z score 0/16 16/16 0 % darusalam and h. ashman / knowledge engineering and data science 2018, 1 (2): 55–63 63 [10] c. dwyer, et al., "trust and privacy concern within social networking sites: a comparison of facebook and myspace," 2007. [11] p. bhattacharyya, et al., "social network model based on keyword categorization," in social network analysis and mining, 2009. asonam '09. international conference on advances in, 2009, pp. 170-175. [12] m. takeda, et al., "discovering characteristic expressions from literary works: a new text analysis method beyond n-gram statistics and kwic," berlin, heidelberg, 2000, pp. 112-126. [13] m. balduzzi, et al., "abusing social networks for automated user profiling," in recent advances in intrusion detection. vol. 6307, s. jha, et al., eds., ed: springer berlin / heidelberg, 2010, pp. 422-441. [14] g. pannell and h. ashman, "user modelling for exclusion and anomaly detection: a behavioural intrusion detection system," in user modeling, adaptation, and personalization. vol. 6075, p. de bra, et al., eds., ed: springer berlin / heidelberg, 2010, pp. 207-218. [15] w. wei, et al., "profiling program and user behaviors for anomaly intrusion detection based on non-negative matrix factorization," in decision and control, 2004. cdc. 43rd ieee conference on, 2004, pp. 99-104 vol.1. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 15–23 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p15-23 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) ant colony optimization for resistor color code detection slamet wibawanto a,1,*, kartika candra kirana a,2, hani ramadhan b,3 a department of electrical engineering and informatics, universitas negeri malang, malang 65145, indonesia b data department, pusan national university, busan 46241, south korea 1 slamet.wibawanto.ft@um.ac.id*; 2 kartika.candra.ft@um.ac.id; 3 hani042@pusan.ac.kr * corresponding author i. introduction resistors are components that are often found in electronic circuits. resistors contain a resistance value or resistance designed to regulate voltage and electric current [1]. based on the eia (electronic industries association) rules, the resistance value is shown by a color band [2]. twelve colors have different value representations depending on the color position [3]. many combinations of color bands raise the need for technology that can automatically measure resistor values visually. various combinations of automatic resistor measurement methods have been proposed in previous studies. machine learning is a popular method used. gao et al extract characters from on-chip resistors using traditional segmentation and classify the segmentation results using artificial neural networks [4]. wu proposed gravity features and classified stroke lines using the decision tree recognition method to recognize characters on-chip resistors [5]. li et al developed a color band recognition method using the retinex algorithm and a back propagation neural network on a resistor image acquired using a black-and-white industrial camera [6]. muminovic and sokic developed a resistor color band classification using the support vector machine (svm) [7]. chen and wang cluster the main body color and the extracted band color using k-nearest neighbour (k-nn) [8]. in addition, color-based segmentation approaches and statistical analysis are also popularly proposed. yan et al refined the traditional segmentation results on pcb resistors using local gray-level distributions [9]. jadon et al proposed morphological operation using binarization and mean a shift to cluster resistor values [10]. li et al proposed a pcb recycling system using information retrieval based on the color of the resistors, capacitors, and integrated circuits (ics) [11]. abdallah et al implement a weighting resistor matrix (wrm) to detect resistor lines [12]. li et al proposed calculating the symmetrical kullback-leibler distance to measure the difference in the class distribution of resistor rings [13]. the previously proposed computed all image matrix values include the non-resistor area, which is more than the resistor area. furthermore, most of their methods work iteratively. it triggers computational complexity [14]. heuristic algorithms can be applied to select commercial matrix article info a b s t r a c t article history: received 29 february 2023 revised 31 march 2022 accepted 06 april 2023 published online 30 april 2023 in the early stages of learning resistors, introducing color-based values is needed. moreover, some combinations require a resistor trip analysis to identify. unfortunately, a resistor body color is considered a local solution, which often confuses resistor coloration. ant colony optimization (aco) is a heuristic algorithm that can recognize problems with traveling a group of ants. aco is proposed to select commercial matrix values to be computed without preventing local solutions. in this study, each explores the matrix based on pheromones and heuristic information to generate local solutions. global solutions are selected based on their high degree of similarity with other local solutions. the first stage of testing focuses on exploring variations of parameter values. applying the best parameters resulted in 85% accuracy and 43 seconds for 20 resistor images. this method is expected to prevent local solutions without wasteful computation of the matrix. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: resistor color code best parameter ant colony optimization http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 16 s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 values to be computed, thus, it is expected to reduce computational complexity [15]. the results of the comparison of algorithms show that the ant colony optimization performs better than other heuristic methods [16][17]. furthermore, the use of the ant system in various segmentation cases has proven to be superior, such as mitotic cells [18], word [19], and traveling salesman problems [20]. in this study, resistor value estimation using ant colony optimization is proposed. the ant colony optimization algorithm is a metaheuristic inspired by the foraging behavior of ants [21]. in this context, the ants represent individual agents that traverse the resistor rods, seeking to find the other resistor rod. as they move along the nodes of the resistor rings, the ants leave behind a pheromone trail, mimicking the pheromone deposition of real ants. the concentration of pheromone on each node serves as a measure of its attractiveness or desirability. the proposed method leverages the power of ant colony optimization to estimate the resistor values. each ant selects its path through the resistor rings based on a combination of pheromone trails and heuristic information. the pheromone trails guide the ants towards the nodes with higher pheromone concentration, which are likely to correspond to the locations of the resistor rods. meanwhile, the heuristic information provides additional guidance by incorporating domain-specific knowledge or constraints into the decision-making process [22]. by iteratively applying the ant colony optimization algorithm, the pheromone trails are updated dynamically, allowing the ants to progressively refine their paths. this iterative process encourages the exploration of different paths initially and gradually favors the exploitation of the most promising paths based on the accumulated pheromone levels [23]. as a result, the ants effectively converge towards the optimal paths that lead to accurate estimation of the resistor values. one key aspect considered in this study is the selection of the node with the highest pheromone concentration in each resistor ring for distance measurement. this node is expected to be closer to the corresponding resistor rod, providing valuable information for accurate estimation. by focusing on the nodes with higher pheromone levels, the proposed method intelligently prioritizes the exploration of regions likely to contain the resistor rods, improving the efficiency and effectiveness of the estimation process. the use of ant colony optimization in resistor value estimation offers several advantages. it is a flexible [24] and adaptive approach [25] that can handle different resistor network configurations and accommodate variations in resistor characteristics. the algorithm's ability to leverage collective intelligence and distributed decision-making enables robust estimation even in the presence of noise or uncertainties in the circuit [26]. furthermore, the method can potentially overcome the limitations of traditional techniques by providing more accurate estimations and reducing the dependency on explicit mathematical models. ii. methods this study employed eia images as the training data for the proposed method. the training dataset consisted of a diverse range of eia images, which were collected from various sources and curated for this study. a total of 20 test images were randomly selected from google to evaluate the performance of the method on unseen data. to provide a comprehensive understanding of the dataset, table 1 summarizes the characteristics and properties of the training data [27], including the number of samples, their associated labels or annotations, and relevant metadata. the distribution of the training data is visualized in figure 1 (b), where each data point represents a specific eia image along with its corresponding label. for the test data, figure 1 (a) showcases a subset of the randomly chosen test images. these images were carefully selected to cover a wide range of scenarios and variations in resistor configurations. the test data is crucial for assessing the generalization capabilities of the proposed method and its ability to accurately estimate resistor values in real-world settings. s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 17 table 1. training data color 1st band 1st band 1st band multiplier tolerance(%) black 0 0 0 1 ohm brown 1 1 1 10 ohm +1% (f) red 2 2 2 100 ohm +2% (g) orange 3 3 3 1 kohm yellow 4 4 4 10 kohm green 5 5 5 100 kohm +0.5% (d) blue 6 6 6 1 mohm +0.25% (c) violet 7 7 7 10 kohm +0.10% (b) grey 8 8 8 100 kohm +0.05% white 9 9 9 1 gohm gold 0.1 ohm +5% (j) silver 0.001 ohm +10%(k) (a) (b) (c) fig. 1. images (a) testing data (b) training data (c) uncounted resistor body in order to ensure a consistent and reliable analysis, certain preprocessing steps were applied to both the training and test data. firstly, all resistor positions were standardized to a vertical orientation, with an allowable angle deviation of less than 10 degrees. this orientation normalization step helps in reducing potential variations caused by the rotation or tilt of the resistors in the images. additionally, the ends of the resistor iron were cropped to remove any irrelevant or distracting elements that could interfere with the accurate estimation of resistor values. this cropping process helps to focus solely on the essential region of interest, allowing the method to concentrate its analysis on the relevant components of the resistors. furthermore, during the preprocessing phase, a thorough detection and removal process was implemented to eliminate any uncounted resistor body templates from the training data. figure 1 (c) provides a visual representation of these identified templates that were excluded from the training dataset. by eliminating such templates, the method avoids potential biases or distortions in the estimation process, ensuring the accuracy and reliability of the results. overall, the use of eia images as training data, supplemented with the randomly selected test images, provides a robust and diverse dataset for evaluating the proposed method's performance. the careful preprocessing steps, including orientation normalization, cropping of resistor ends, and removal of uncounted resistor body templates, contribute to the accuracy and reliability of the resistor value estimation process. 18 s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 in this case, the ant colony algorithm can choose the best ring value representation based on pheromone. the pseudocode of the ant colony algorithm is shown in pseudocode 1. the ants will alternately explore the resistor color nodes in the dimensions of the matrix until the ants reach the farthest resistor bar or color code (ringmax). based on resistor theory, the maximum ring reaches 5 rings. ants leave pheromones at the nodes they pass through, as shown in (1). pseudocode 1: acoresistor() init pheromone(𝜏𝑖𝑗), population, ringmax input: resistor matrix(𝜇𝑖𝑗) output: colourbandbest while ~ the rod do for k=1 to population do for l= 1 to ringmax do colourbandkl  construct solution() if reach fitness () colourbandbest  colourbandkl end end update 𝜏𝑖𝑗 end end return colourbandbest 𝜏𝑖𝑗 = 𝜏𝑖𝑗 𝑎𝜇𝑖𝑗 𝛽 ∑ 𝜏𝑖𝑗 𝑎𝜇𝑖𝑗 𝛽 (1) equation (1) shows pheromones. 𝜏𝑖𝑗 are influenced by heuristic information. in this case, the heuristic information-𝜇𝑖𝑗 is the rbg matrix on the resistor image. 𝛼 and 𝛽 indicate the amount of pheromone and heuristic information in influencing the movement of ants. the solution based o n pheromone and heuristic information is computed in the construct solution function shown in pseudocode 2. pseudocode 2: solution() while ~ the rod of ring do if node selected do update 𝜏𝑖𝑗  eq (1) rangecolourbandkl  𝜇𝑖𝑗 end end if 𝜏𝑖𝑗 ≠ 0 do meancolourbandkl  mean(rangecolourbandkl) colourbandkl  mindistance(meancolourbandkl, 𝜇𝑡𝑟𝑎𝑖𝑛 ) end if l is last do set tolerance elseif l is loast-1 do set 10^ colourbandkl else set colourbandkl end based on pseudocode 2, the pheromone is a marker of areas passed and not passed by ants, whereas heuristic information passed by ants is a temporary solution. the solutions formed in each ring are computed based on the mean function. the distance between training and testing data is calculated to get the ring value. in addition, the position is also used to get the precision resistor value. the ring shows the tolerance value if it is in the last position. s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 19 meanwhile, if the ring is in the last-1 position, then the ring shows 10 to the power of the ring's value. in addition, the value of the ring shows the value of the ring itself, both in tens and hundreds. thus, an ant creates a solution set of combined color ring values. the best solution is selected based on the fitness function shown in pseudocode 3. pseudocode 3: solution() for each (colourbandkl) calculate similarity of colourbandkl if similarity colourbandkl is maximum do colourbandbest  colourbandkl end the pheromone update function is applied every time the ant changes to prevent local solutions. the local update pheromone is shown in (2). 𝜏𝑖𝑗 = (1 − 𝜌)𝜏𝑖𝑗 + 𝜌0𝜏0 , (2) where 𝜌 and 𝜌0 is the parameter that is set to prevent the ant from passing through the same node as the previous ant. we tested parameters to form the best ant colony architecture. the trial variations of initialization are shown in table 2. table 2. the trial variations of initialisation variable the set of members 𝛼 {0,0.5,0.75,1} 𝛽 {1,0.5,0.25,0} population {1,2,3} 𝜌, 𝜌0 {1,0},{0,1},{1,1},{0,0} after getting the best parameter values, we tested the accuracy percentage as in (3). 𝐴𝐶𝐶 = 𝐶/𝐴 × 100% (3) where the accuracy (acc) is the correct total-c against all of data-a. iii. result and discussion ant colony optimization is proposed to select commercial matrix values to be computed without preventing local solutions. in this study, each explores the matrix based on pheromones and heuristic information to generate local solutions. global solutions are selected based on their high degree of similarity with other local solutions. in order to achieve a global solution, many parameters need to be initialized. for this reason, the first testing stage focuses on exploring variations of parameter values. in the first trial we evaluated the use of a and b, shown in table 2. the results show that heuristic information plays a major role in the classification results. based on equation (1), pheromones are influenced by heuristic information, thus if the heuristic information is omitted, then the pheromone loses information to select the point that is considered a ring or not. based on table 3, the selected values are αand β are 0 and 1, respectively. table 3. testing of α and β variable accuracy(%) 𝛼 = 0, 𝛽 = 1 85 𝛼 = 0.5, 𝛽 = 0.5 65 𝛼 = 0.75, 𝛽 = 0.25 55 𝛼 = 1, 𝛽 = 0 30 20 s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 table 4. testing of ρ and ρ_0 variable accuracy(%) 𝜌 = 0, 𝜌0 = 0 60 𝜌 = 1, 𝜌0 = 0 85 𝜌 = 0, 𝜌0 = 1 50 𝜌 = 1, 𝜌0 = 1 60 table 5. testing of population variable accuracy(%) duration (second) 1 (right) 85 43 1(left) 85 43 1(random) 80 328 2 85 133 table 6. the parameter setting variable accuracy(%) 𝛼 0 𝛽 1 population 1 𝜌, 𝜌0 {1,0} in the second test, we evaluate the effect of 𝜌0 of the initial pheromone and 𝜌 of the additional pheromones. the best values of 𝜌 and 𝜌0 are 1 and 0 respectively. setting 𝜌 =1 at (1𝜌) in equation 2 eliminates the effect of additional pheromones, while the initialization of 𝜌=0 eliminates the effect of the initial pheromones. thus, all pheromone values change to zero, causes the ant path to be unaffected by the previous ant path. if the previous path is wrong, the next ant path does not repeat the same error. worst results are obtained in setting values 𝜌=0, 𝜌0=1. setting this value increases the pheromone value, allowing the ants to explore the same area. when the first ant explores the wrong way, the next ant will also fall into the wrong path. we tested with three variations of values with five conditions, consisting of (1) an ant explored the right edge, (2) an ant explored the left edge, (3) an ant are set randomly, (4) two ants explored both edges, (4) 3 ants explored the middle and two edges of resistor (5) an ant explored the right edge. table 5 shows no difference in using 1 or 2 ants placed on the edge area shown at the same accuracy value. this is due to the edges being unaffected by the acquisition light. however, the middle part triggers a misrepresentation of the resistor code value caused by the light effect. things are different when the ants are set at random. ants are often set in locations instead of resistors, ants circle around to find the first ring. this triggers a longer computation time, even longer than 3 ants. thus the, further research can be allocated for setting the location of ants. the set parameter values were determined in table 6 based on the previous test. fig. 2. result s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 21 figure 2 shows the test results with the best value parameter settings. 3 errors occur in the same characteristics, namely when the resistor is white. the error result is shown in figure 3. the third error is caused by the ring's color, which resembles the background, so the ring's value is incorrectly detected as a background. fig. 3. error result meanwhile, the path traversed by the ants is shown in figure 4. the ants explore part of the ring, both on the right edge and ants on the left edge. however, applying the best parameters resulted in 85% accuracy and 43 seconds times for 20 images. this method is expected to prevent local solutions without wasteful computation of the matrix. (a) (b) fig. 4. ant path (a) right path initialization (b) left path initialization a resistor color detector could be used in the classroom as a teaching tool if it is developed. the instructor will explain the resistor color code with the help of the notes and handouts. the student will attend the teacher's demonstration, engage in question and answer sessions, and conduct analysis. the students will review the color code handouts and consult with the instructor about any questions. as a last step, students will follow the information provided by the proposed detector. this scenario may prove to be instructive for students just beginning their electrical engineering education. iv. conclusion in this study, ant colony optimization is proposed to select commercial matrix values to be computed without preventing local solutions. pheromones and heuristic information in the form of rgb matrices contribute greatly to the movement of ants. global solutions are selected based on their high degree of similarity with other local solutions. in order to prevent the local solutions, the local updated pheromones are employed. in the testing, we explored the various variations of parameter values, such as: 𝛼, 𝛽, 𝜌, 𝜌0, and population. we get the best accuracy by applying 𝛼 = 0, 𝛽 = 1, 𝜌 = 1, 𝜌0 = 0 and population = 1. applying the best parameters resulted in 85% accuracy and 43 seconds for 20 images. it can be concluded that the proposed method prevents local solutions without exploring all the matrix values. the future implementation of the color detector at school could benefit electrical engineering students. 22 s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] e. murdani and s. sumarli, “student learning by experiment method for analyzing the dynamic electrical circuit and its application in daily life,” j. phys. conf. ser., vol. 1153, p. 012119, feb. 2019. [2] p. ctibor, j. sedlacek, r. musalek, t. tesar, and f. lukac, “structure and electrical properties of yttrium oxide sprayed by plasma torches from powders and suspensions,” ceram. int., vol. 48, no. 6, pp. 7464–7474, mar. 2022. [3] g. j. brouwer and d. j. heeger, “categorical clustering of the neural representation of color,” j. neurosci., vol. 33, no. 39, pp. 15454–15465, sep. 2013. [4] s. gao, t. qiu, g. wang, a. huang, and j. yu, “printing characters recognition of chip resistors based on the combination of image segmentation and artificial neural network,” in 2021 16th international conference on computer science & education (iccse), 2021, pp. 643–647. [5] t. wu, “a degraded character of printed number recognition algorithm,” in 2016 8th international conference on intelligent human-machine systems and cybernetics (ihmsc), 2016, vol. 01, pp. 156–159. [6] x. li, z. zeng, m. chen, and s. che, “a new method of resistor’s color rings detection based on machine vision,” in 2017 chinese automation congress (cac), 2017, pp. 241–245. [7] m. muminovic and e. sokic, “automatic segmentation and classification of resistors in digital images,” in 2019 xxvii international conference on information, communication and automation technologies (icat), 2019, pp. 1– 6. [8] y.-s. chen and j.-y. wang, “reading resistor based on image processing,” in 2015 international conference on machine learning and cybernetics (icmlc), 2015, vol. 2, pp. 566–571. [9] h. yan, z. chen, m. liu, l. liu, and y. liu, “prior knowledge for coarse to fine pcb resistor segmentation,” in 2021 international conference on computer information science and artificial intelligence (cisai), 2021, pp. 985– 988. [10] a. jadon, a. varshney, n. g. varshney, and m. s. ansari, “simple and efficient non-contact technique for resistor value estimation,” in 2018 international conference on communication and signal processing (iccsp), 2018, pp. 938–941. [11] w. li, b. esders, and m. breier, “smd segmentation for automated pcb recycling,” in 2013 11th ieee international conference on industrial informatics (indin), 2013, pp. 65–70. [12] a. abdallah, d. felici, g. aielli, and r. cardarelli, “fpga implementation of resistor network for fast segment line detector,” in 2017 29th international conference on microelectronics (icm), 2017, pp. 1–4. [13] n. li, f. liu, l. qiu, and x. su, “a geometric active contour model using symmetrical kullback-leibler distance for sar image segmentation,” in igarss 2018 2018 ieee international geoscience and remote sensing symposium, 2018, pp. 6983–6986. [14] o. goldreich, “computational complexity: a conceptual perspective,” sigact news, vol. 39, no. 3, pp. 35–39, sep. 2008. [15] f. neumann and c. witt, “bioinspired computation in combinatorial optimization: algorithms and their computational complexity,” in proceedings of the 15th annual conference companion on genetic and evolutionary computation, 2013, pp. 567–590. [16] e. fejzagić and a. oputić, “performance comparison of sequential and parallel execution of the ant colony optimization algorithm for solving the traveling salesman problem,” in 2013 36th international convention on information and communication technology, electronics and microelectronics (mipro), 2013, pp. 1301–1305. [17] l. haoguang, y. yunhua, and s. xuefeng, “load parameter identification based on particle swarm optimization and the comparison to ant colony optimization,” in 2016 ieee 11th conference on industrial electronics and applications (iciea), 2016, pp. 545–550. [18] b. xu, m. lu, j. shi, j. cong, and b. nener, “a joint tracking approach via ant colony evolution for quantitative cell cycle analysis,” ieee j. biomed. heal. informatics, vol. 25, no. 6, pp. 2338–2349, 2021. [19] g. tambouratzis, “using an ant colony metaheuristic to optimize automatic word segmentation for ancient greek,” ieee trans. evol. comput., vol. 13, no. 4, pp. 742–753, 2009. [20] m. dorigo and c. blum, “ant colony optimization theory: a survey,” theor. comput. sci., vol. 344, no. 2, pp. 243– 278, 2005. [21] s. a. sari and k. m. mohamad, “recent research in finding the optimal path by ant colony optimization,” bull. electr. eng. informatics, vol. 10, no. 2, pp. 1015–1023, apr. 2021. [22] r. ahahmad and k. n. mishra, “analysis of intelligent approaches for discovery and management of knowledge: a review,” ssrn electron. j., 2022. http://journal2.um.ac.id/index.php/keds https://iopscience.iop.org/article/10.1088/1742-6596/1153/1/012119/meta https://iopscience.iop.org/article/10.1088/1742-6596/1153/1/012119/meta https://doi.org/10.1016/j.ceramint.2021.11.291 https://doi.org/10.1016/j.ceramint.2021.11.291 https://doi.org/10.1523/jneurosci.2472-13.2013 https://doi.org/10.1523/jneurosci.2472-13.2013 https://doi.org/10.1109/iccse51940.2021.9569404 https://doi.org/10.1109/iccse51940.2021.9569404 https://doi.org/10.1109/iccse51940.2021.9569404 https://doi.org/10.1109/ihmsc.2016.92 https://doi.org/10.1109/ihmsc.2016.92 https://doi.org/10.1109/cac.2017.8242770 https://doi.org/10.1109/cac.2017.8242770 https://doi.org/10.1109/icat47117.2019.8939034 https://doi.org/10.1109/icat47117.2019.8939034 https://doi.org/10.1109/icat47117.2019.8939034 https://doi.org/10.1109/icmlc.2015.7340616 https://doi.org/10.1109/icmlc.2015.7340616 https://doi.org/10.1109/cisai54367.2021.00197 https://doi.org/10.1109/cisai54367.2021.00197 https://doi.org/10.1109/cisai54367.2021.00197 https://doi.org/10.1109/iccsp.2018.8524266 https://doi.org/10.1109/iccsp.2018.8524266 https://doi.org/10.1109/iccsp.2018.8524266 https://doi.org/10.1109/indin.2013.6622859 https://doi.org/10.1109/indin.2013.6622859 https://doi.org/10.1109/icm.2017.8268837 https://doi.org/10.1109/icm.2017.8268837 https://doi.org/10.1109/igarss.2018.8517380 https://doi.org/10.1109/igarss.2018.8517380 https://doi.org/10.1109/igarss.2018.8517380 https://doi.org/10.1145/1412700.1412710 https://doi.org/10.1145/1412700.1412710 https://doi.org/10.1145/2464576.2466738 https://doi.org/10.1145/2464576.2466738 https://doi.org/10.1145/2464576.2466738 https://ieeexplore.ieee.org/abstract/document/6596460 https://ieeexplore.ieee.org/abstract/document/6596460 https://ieeexplore.ieee.org/abstract/document/6596460 https://doi.org/10.1109/iciea.2016.7603644 https://doi.org/10.1109/iciea.2016.7603644 https://doi.org/10.1109/iciea.2016.7603644 https://doi.org/10.1109/jbhi.2020.3032592 https://doi.org/10.1109/jbhi.2020.3032592 https://doi.org/10.1109/tevc.2009.2014363 https://doi.org/10.1109/tevc.2009.2014363 https://doi.org/10.1016/j.tcs.2005.05.020 https://doi.org/10.1016/j.tcs.2005.05.020 https://doi.org/10.11591/eei.v10i2.2690 https://doi.org/10.11591/eei.v10i2.2690 https://doi.org/10.2139/ssrn.4161379 https://doi.org/10.2139/ssrn.4161379 s. wibawanto et al. / knowledge engineering and data science 2023, 6 (1): 15–23 23 [23] e. singh and n. pillay, “a study of ant-based pheromone spaces for generation constructive hyper-heuristics,” swarm evol. comput., vol. 72, p. 101095, jul. 2022. [24] m. stighezza, v. bianchi, and i. de munari, “fpga implementation of an ant colony optimization based svm algorithm for state of charge estimation in li-ion batteries,” energies, vol. 14, no. 21, p. 7064, oct. 2021. [25] s. mishra, s. roy, s. c. swain, and a. routray, “underground cable fault tracking by ant colony optimization,” in 2022 ieee delhi section conference (delcon), feb. 2022, pp. 1–5. [26] r. k. behara and a. k. saha, “artificial intelligence methodologies in smart grid-integrated doubly fed induction generator design optimization and reliability assessment: a review,” energies, vol. 15, no. 19, p. 7164, sep. 2022. [27] pustekkom bpm semarang, “resistor,” kemdikbud, 2007. https://m-edukasi.kemdikbud.go.id/medukasi/produkfiles/kontenonline/online2007/resistor/kodewarnagelang.htm. (access on 13 january 2023) https://doi.org/10.1016/j.swevo.2022.101095 https://doi.org/10.1016/j.swevo.2022.101095 https://doi.org/10.3390/en14217064 https://doi.org/10.3390/en14217064 https://doi.org/10.1109/delcon54057.2022.9752778 https://doi.org/10.1109/delcon54057.2022.9752778 https://doi.org/10.3390/en15197164 https://doi.org/10.3390/en15197164 file:///f:/elektro/keds/layout/revisi%20volume%206/%5b1%5d%09https:/m-edukasi.kemdikbud.go.id/medukasi/produk-files/kontenonline/online2007/resistor/kodewarnagelang.htm https://m-edukasi.kemdikbud.go.id/medukasi/produk-files/kontenonline/online2007/resistor/kodewarnagelang.htm https://m-edukasi.kemdikbud.go.id/medukasi/produk-files/kontenonline/online2007/resistor/kodewarnagelang.htm keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 27-40 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p27-40 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) automatic 3d cranial landmark positioning based on surface curvature feature using machine learning putu hendra suputra a, b, 1 , anggraini dwi sensusiati c, 2 , myrtati dyah artaria d, 3 , gijsbertus jacob verkerke e, 4 , eko mulyanto yuniarno a, f, 5, *, i ketut eddy purnama a, f, 6 a department of electrical engineering, institut teknologi sepuluh nopember gedung b, c & aj kampus its, keputih, sukolilo, surabaya, 60111, indonesia b department of informatics, universitas pendidikan ganesha jalan udayana no.11 singaraja bali 81116, indonesia c department of radiology, faculty of medicine, universitas airlangga kampus a universitas airlangga, jl. mayjen. prof. dr. moestopo 47, surabaya 60131, indonesia d department of anthropology, universitas airlangga room 211, building a, fisip, campus b, jl. airlangga no.4 6, gubeng, surabaya 60115, indonesia e university medical center groningen, university of groningen hanzeplein 1, 9713 gz groningen, netherlands f department of computer engineering, institut teknologi sepuluh nopember gedung b & c, kampus its sukolilo, surabaya 60111, indonesia 1 hendra.suputra@undiksha.ac.id; 2 anggraini-d-s@fk.unair.ac.id; 3 myrtati.artaria@fisip.unair.ac.id; 4 g.j.verkerke@med.umcg.nl; 5 ekomulyanto@ee.its.ac.id*; 6 ketut@te.its.ac.id * corresponding author i. introduction craniofacial reconstruction is the art of reconstructing soft tissue that is lost from the skull to present a visual appearance of a face that is lifelike. soft tissues are estimated based on the anthropometric features of the skull. forensic identification also requires landmark positioning for correspondence with reference facial landmarks [1][2]. an expert must determine the position of particular anatomical features (landmarks) as the soft tissue reference. article info a b s t r a c t article history: received 21 april 2022 revised 28 july 2022 accepted 8 august 2022 published online 7 november 2022 cranial anthropometric reference points (landmarks) play an important role in craniofacial reconstruction and identification. knowledge to detect the position of landmarks is critical. this work aims to locate landmarks automatically. landmarks positioning using surface curvature feature (scf) is inspired by conventional methods of finding landmarks based on morphometrical features. each cranial landmark has a unique shape. with the appropriate 3d descriptors, the computer can draw associations between shapes and landmarks using machine learning. the challenge in classification and detection in three-dimensional space is to determine the model and data representation. using three-dimensional raw data in machine learning is a serious volumetric issue. this work uses the surface curvature feature as a three-dimensional descriptor. it extracts the local surface curvature shape into a projection sequential value (depth). a machine learning method is developed to determine the position of landmarks based on local surface shape characteristics. classification is carried out from the top-n prediction probabilities for each landmark class, from a set of predictions, then filtered to get pinpoint accuracy. the landmark prediction points are hypothetically clustered in a particular area, so a cluster-based filter is appropriate to isolate them. the learning model successfully detected the landmarks, with the average distance between the prediction points and the ground truth being 0.0326 normalized units. the cluster-based filter is implemented to increase accuracy compared to the ground truth. thus, scf is suitable as a 3d descriptor of cranial landmarks. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: anatomical feature cranial landmark machine learning morphological approach surface curvature feature three dimensions 28 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 cranial landmarks are reference points in adding soft tissue thickness. a three-dimensional perspective poses a challenge in placing landmarks compared to positioning on two-dimensional images (images superimposition). the popular facial reconstruction methods, the american [3]. and gerasimov [4], are the most intuitive method for a computer-based approach. both simulate soft tissues guided by the placement of landmarks. the current developments of computerized craniofacial reconstructions generally use a threedimensional virtual model that mimics the idea of the manual reconstruction process [5][6]. research related to computer-based craniofacial reconstruction is widely developed. however, most attention has been paid to reconstructing facial features based on soft tissue thickness. determination of the position of the skull landmark is still determined manually. knowledge and ability to detect the position of anatomical features are critical [7][8][9][10][11][12][13][14]. the golden rules in determining the position of anatomical features are based on its surface shape. the use of surface curvature feature (scf) for classifying skull surface landmarks is inspired by conventional methods of determining anthropometric reference points. the scf is introduced as a morphometric feature model representing a local surface's shape [15]. nonetheless, several studies of 3d automatic detection have been carried out. anatomical landmark positioning by jacinto [16] proposed a multi-atlas approach to orthopedic knee landmarks. the surface local shape is used as a reference for the registration of pre-defined landmark positioning. another approach to the three-dimensional annotation system is also carried out by lee [17] and lindner [18]. however, they use a 2d image or a shadowed-2d image. the methods used did not measure directly to the skull model geometrically like the original conventional method. the challenge of using a three-dimensional image modality is the input dimension which involves a very large volume of voxels. instead of employing a full 3d model, this study interprets the threedimensional shape information in context by a 3d descriptor. in other words, context-based information retrieval in three-dimensional form. each landmark has a distinctive shape. by understanding a landmark's shape characteristics, the landmark's position on the skull is known. a regular surface's shape can be defined using a quantifiable model or a feature representation (3d descriptor). this modelling will ease the computation of surface shape features. the curvature model must be able to describe a surface's shape quantitatively. a surface is concave or convex by looking at the curvature value. in this work, curvature features are used as the input to obtain characteristic descriptors of the physical shape of skull surfaces as landmarks. by solving the volumetric issue of 3d data in machine learning, the learning work can be solved with a simple multi-layer perceptron to draw relationships between surface shapes as landmarks. the network model uses an activation function that accommodates the range of scf sequential values to maintain the three-dimensional information context. the model can classify a point on the surface of the skull, whether it is a landmark or not. a novel method was proposed in this work, oriented towards the geometrical measurement in 3d spatial space. a three-dimensional feature representation, surface curvature feature (scf), is used to describe the local morphological shape of the landmarks. machine learning is based on detecting and classifying the skull surface landmarks. the contribution of the proposed method is as follows:  the first study of the use of scf as a three-dimensional feature representation (descriptor), in detecting anatomical features of the skull surface (landmarks)  an intuitive method for recognizing landmarks based on the shape of the surface is inspired by conventional methods.  a simple mlp is engaged in drawing correlations on the implicit characteristics of a surface shape as a skull landmark. ii. method a. related works working with 3d objects for shape recognition cannot be separated from the 3d shape representation process. generally, 3d representation is done using a feature-based approach, a facade-based approach (view), and a graphical information-based approach [19]. the feature-based approach uses shape models in transactions or classifications. the view-based approach renders a p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27–40 29 3d view into 2d at a certain point of view. the final approach is to use the information on the surface topology. the feature-based approach relies on the geometric characteristics of 3d objects. geometry characters can also be applied to global and local features, where global features are most widely used for classification work. meanwhile, the view-based approach eliminates the volumetric load of 3d objects. the computational cost of entering 3d data is much higher than that of a twoor singledimensional format. therefore, several ways have been developed to convert the structure of 3d shapes into 2d. conversion of 3d data into 2d must have information that is sacrificed as long as it maintains the desired features. the third approach emphasizes the relationships and connections of the various components of the object representation model. computationally, the graph-based approach is considered inefficient compared to the other two approaches. if the first approach geometrically retrieves the shape of an object, the second approach must look at the object from various views to obtain the intended feature. the work related to the local shape approach of a 3-dimensional surface is the automatic positioning proposed by jacinto [16]. local rigid registration between the pre-defined landmark 3d model and the patient structure model is detected of anatomical landmarks. the three-dimensional annotation system by lee [17] and lindner [18] uses a 2d image approach. cephalometric annotation performed by lee uses a shadowed-2d image approximated by the landmark coordinates using machine learning. meanwhile, what is done by lindner uses lateral cephalogram images. this method accurately detects landmarks' location in the 2d image. however, the detection results cannot be used directly for facial reconstruction because they do not provide 3d spatial coordinate information. orthopaedic landmark positioning proposed by jacinto [16] outlines the multi-atlas approach. he proposed a method for determining the position of landmarks on a 3d orthopaedic atlas model of the patient's bone structure. there are two three-dimensional models, namely the patient and expert models. both structure model consists of a triangulated mesh. the positioning of the landmarks is obtained by registering the patient model to the expert model. an expert identifies landmarks positions on the expert model as a pre-defined landmark model. it is then registered to the patient model. the process includes two stages: computing the global mesh registration and local registration. the initial fitting stage of the patient model to the expert model is carried out using the iterativeclosest-point (icp) algorithm. this stage aims to attach the patient model to the expert model. the next step is refining the fitting using adaptive local rigid registration. this stage places (transfers) the positions of the landmarks in the expert model to the patient model. the final position of the landmarks is refined using an automatic selection of a set of best-positioned landmarks. this approach increases the efficiency of unsupervised landmark positioning. lindner et al. [18] proposed a method of automatic landmark annotation on lateral cephalometric images. the positioning of these landmarks is a standard procedure in orthodontic diagnosis and treatment. the fully automatic landmark annotation system (fala) was developed works by utilizing machine learning to determine the landmark position based on the landmark position data annotated manually by experts. two stages were deployed, namely random forest regression, to make the system robust against various variations in the image, and the second stage was the constrained local model framework (rfrv-clm) which specifically determined the position of landmarks. the data and the results of the landmark annotation by lindner [18] were performed on the lateral cephalogram image. in other words, a two-dimensional image from the side of the object. the landmark detection was carried out on the bone contour and isolated the desired landmark position. the challenge is to distinguish a landmark that is on the contour of the bone or not. for this reason, adding a patch size to the training sample allows the system to learn more about the environment where the landmark is located in the learning process. a three-dimensional modality for detecting the position of landmarks based on machine learning was carried out by lee et al. [17], is driven by a shift in the trend of using three-dimensional images in surgical simulations. however, this study does not directly process three-dimensional images to detect landmarks. shaded two-dimensional rendering of three-dimensional imagery was used as the 30 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 modality. the position of the landmark is annotated on the image to find its relationship with machine learning. lee sees the problem in three-dimensional processing data is the number of voxels. this challenge was overcome by working on two-dimensional shaded objects with lighting and viewing angles variations. the method uses vgg-net to draw the correlation of landmark positions based on a manually marked dataset. another work on cephalometric landmark detection was carried out [20]. the challenges faced are also related to the position of landmarks in three-dimensional space. they propose multistage deep reinforcement learning in detecting landmarks. the modality used is similar to lee [17], using 3d images rendered as 2d shaded images from various views. the work consists of two stages, namely the learning stage and the inferencing stage. the use of multi-stage is due to landmarks whose position in 3d empty space (such as the foramen magnum or sella) cannot be depicted by simple 2d images. both single and multi-stage approaches can detect other landmarks. the challenge in using three-dimensional data in machine learning is the enormous data that must be processed. several works, such as lee [17], jacinto [16], and lindner [18], perform alternative three-dimensional contour processing with a two-dimensional approach. however, machine learning does not require raw image data as input. refined or extracted information from the image can also be used if it represents the expected features. the surface curvature feature method proposed by yuniarno [15] works as a three-dimensional surface descriptor that translates the shape of the surface curvature into simpler context information for machine learning. unlike those works that use a view-based approach, yuniarno [15] uses a feature-based method for point cloud registration. he proposed an iterative closest point (icp) algorithm as used by jacinto but by utilizing surface curvature feature estimation (scf), is a feature-based approach that works on local features. this method performs fitting k-nn local points to the hyperbolic paraboloid equation. this method was originally used to perform point cloud matching. in general, scf acts as a model that describes the shape of curvature. the iterative closest point-surface curvature feature (icp-scf) algorithm can describe a local surface so that point cloud matching can be performed. from the related works, two issues need to be addressed in completing the task of classifying and detecting landmarks in 3d space. unlike previous works that use two-dimensional modalities. first, this work focuses on positioning landmarks at the coordinate level. second, apply machine learning to classify and detect the position of a landmark. the first issue is solved by representing the local surface shape context using scf. unlike yuniarno, in this work, scf is used naively as a surface curvature descriptor. the extraction results are then used as the learning input. we hypothesize that an adjacent point will have a similar surface shape for the second issue. if they are a landmark, then several points will be predicted as the same class with a similar probability level. therefore, the top probability prediction points are taken and then filtered based on the most significant cluster. b. proposed method considering the previous two issues, we propose a method aimed at detecting, classifying, and determining the position of cranial landmarks (figure 1). the modality used is multislice-ct with dicom format. cranial sections are extracted as point clouds and stored as cartesian coordinates. next, the local surface context for each point cloud is extracted with scf. this descriptor solves the problem of volumetric input by converting 3d data into feature representation values. associations between surface shapes and landmarks are obtained by deep learning. points with top-n probabilities are then filtered to isolate the largest cluster as a landmark area. c. surface curvature feature each landmark can be recognized based on the characteristics of its shape. a data representation model and a three-dimensional surface descriptor model are needed to achieve this hypothesis. the cranial is represented as a point cloud. spatial data processing using a local coordinate system can provide a better perspective and simplify measurements. the surface curvature feature (scf) is used as the three-dimensional descriptor. surface curvature is the curvature of a curve on the surface that passes through a certain point [15]. surface curvature must be able to provide consistent information about the characteristics of a curve with a certain center point . the coordinate system as a parameter frame in which the p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27–40 31 curvature function works must be uniform. for spatial work in geometric spaces, local coordinate systems are often used. the local coordinate system is built on a point , where the normal vector of the surface at point is used as coordinate axis. meanwhile, the and coordinate axes are obtained from the principal direction of the point surface. d. local coordinate system the local coordinate system (figure 2) simplifies the marking and measurement process. this coordinate system is best used for smaller area extents. the conversion is done by finding a composite transformation that transforms point clouds so that the normal vector of coincides with the azimuth [0 0 1] as the local principal axis . the transformation matrix is a composite of translation and rotation. the center of rotation must be on a point ( ), so the translation is . for the z-axis direction of the local coordinate system to be parallel to the surface’s normal, the fig. 1. landmark positioning process based on surface curvature feature fig. 2. local coordinate system on . the normal vector 𝑛𝑝 of 𝑃 coincides with the azimuth as the local principal axis 𝑧 32 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 point cloud needs to be rotated. rodrigues's rotation formula applied to find the rotation , where . it gives an efficient method for computing the rotation matrix corresponding to a rotation by an angle about a fixed axis specified by the unit vector . after the surface point cloud is transformed with a translation and rotation , the next task is determining the principal directions. e. principal directions estimation a regular surface has two principal directions. the principal directions of a surface are used to determine two vectors as the and axes of the local coordinate. they are perpendicular to each other before the -axis, which is parallel to the normal. the principal direction of surface of is a tangent direction at point on a regular surface in which the normal curvature of the surface at the point reaches an extremal value. the principal directions define a direction in which the surface of the curve must travel to obtain the minimum and maximum curvature. the minimum and maximum curvature conditions are obtained from the eigenvalues of the shape operators. meanwhile, the principal directions are obtained from eigenvectors which correlate with the eigenvalues. the maximum curvature and the minimum curvature of the curve is referred as the principal curvature [15]. the principal curvatures are the determinant of the weingarten matrix: [ ] [ ] (1) if and are the eigenvalues of weingarten matrix , then the minimum curvature is and the maximum curvature is with | | | |. the vectors associated with principal curvatures are called principal directions. a local coordinate combines the principal direction and the normal surface vector. they construct a coordinate frame (darboux frame). f. surface curvature feature extraction the surface curvature feature is a sequence value that describes the curvature of a surface at a certain radius. thirty-six sequence projection starting points were prepared (figure 3). these points are in the planar ring perpendicular to the normal point (centre) within ten degrees circular z-axis (normal of ). the measurement starts from the first (zero degrees) projection point whose position is in the radius on the x-axis. the next points are every 10 degrees counterclockwise. one of the direct uses of scf is to determine surface similarity. two surfaces can be said to be similar in geometry, having similar features of the curvature of the surface. the similarity of surface geometry at two points can be compared by looking at the sequence values of the scf. the deep learning approach can also be applied, considering the difficulty of finding a direct correlation between scfs. by using deep learning, machines can relate the implicit definition of a classification fig. 3. surface curvature feature extraction. this returns a sequential value for each surface with centre p. the sequential value is the projection distance of the planar ring to the surface. a planar ring has a radius as the specified parameter, with the centre at p p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27–40 33 based on a given feature. the machine will detect points on the surface with a certain surface shape. the surface shape is characterized by several scfs which are similarly categorized (figure 4). principal direction is applied to ensure that similar surface shapes are measured at a uniform starting point. in other words, the scf value will appear with a similar histogram pyramid for a surface with a similar shape. g. data preparation the data sample consisted of multi-slice computed tomography (ct) of the skulls. the contours of the skulls were separated from soft tissue by a hounsfield unit scale. ct is great for density-based segmentation [21]. the data are retrospective-anonymous, with an accuracy of 0.625 mm between slices of 512×512 pixels. each skull-face sample consists of 400 to 500 slices. multi-slice ct of the head was then extracted from the skull surface based on tissue and soft tissue density differences. the surface is stored as a point cloud. each local surface on the skull was extracted its surface shape using a three-dimensional descriptor as a surface curvature feature (scf). the dataset consists of 458,000 local surfaces, extracted as scf values as training-validating-testing data with 60:20:20 ratio. for predicting landmark's positions, points will be taken at the part that is considered the area of the landmarks. they are frontomalare temporale-right (fmtr), zygion-right(zygr), glabella (glb), nasion (nas), zygoorbitale-right(zygoor), zygoorbitale-left(zygool), pogonion (pog), frontomalare temporale-left (fmtl), and zygion-left (zygl) [2][8][22][23]. the points are samples with a radius of 0.3 units from the centre of the desired landmarks (figure 5). fig. 4. examples of four local surface, extracted by scf 3d descriptor onto scf values that are categorized as surface similar. a group of scf categorized as similar will be trained to detect certain surface shapes. (a) (b) (c) fig. 5. nine landmarks: (a) frontomalare temporale-right (fmtr) and zygion-right (zygr); (b) glabella (glb), nasion (nas), zygoorbitale-right(zygoor), zygoorbitale-left(zygool), and pogonion (pog); and (c) frontomalare temporale-left (fmtl) and zygion-left(zygl). 34 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 h. point clouds normalisation a point cloud sample { } is a collection of points extracted from multi-slice computed tomography (ct). the surface of the face and the skull are obtained by the outer contour extraction method [24] as point clouds, stored in an array of cartesian coordinates ([x y z]). the outer points of the skull are recorded in a cartesian format. each skull point cloud was normalized to obtain a uniform scale. centre point is determined as the average of the point cloud as the center of the coordinate system ([0 0 0]). each skull is scaled so that the average point cloud distance to is one. with this normalization, the new coordinates are computed using the equation (2) where | |. normalization reduces the dependence of gradients on the scale of the initial values on the training process. this approach results in higher learning rates without the risk of divergence. ( ∑ ) (2) i. feature extraction the scf sequential value is a series of values that represent the shape of a surface by measuring the distance of the ring projection to the surface. the ring radius and the point cloud sample radius are operator-defined parameters. measurements were made on a normalized skull point cloud. in this study, the ring radius and sample radii were 0.25 and 0.5 normalized units (normalized units after application of equation 1). the principal direction axis determines the starting point of measurement, which moves counterclockwise every ten degrees. figure 6 shows the measurement of the scf's sequential value as a surface's three-dimensional descriptor. a total of 36 user-defined projection starting points were prepared. these points are in the planar plane perpendicular to the normal point within 10 degrees circular z-axis (also normal of ). the measurement starts from the first (zero degrees) projection point, whose position is in the radius on the x-axis. there are cases where the projection of the ring does not touch the surface to reach the predefined value. scf is generally in a small range of values. the dominant value reduces the accuracy of the training, is not an actual projection value, fig. 6. surface curvature feature extraction on point cloud sample. point cloud coordinate system is converted into local coordinate. the scf sequential value as the result of projection measurements of scf ring to the local surface. p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27–40 35 but a maximum given value if the surface projection is not achieved. to eliminate the dominance of large , this is set to zero. j. neural network deployment input data is in the form of scf, which is already a feature. naturally, it can be processed directly using a multi-layer perceptron. a model is developed to detect landmarks into ten classes (glb, fmtr, fmtl, zygr, zygl, zygoor, zygool, pog, nasion, and non-landmark). it consists of layers which include an input layer, hidden layers, and output layer, is trained to acquire the probability value of membership of an scf into ten landmark classes. the probability value shows how certain the data is predicted to be a class member. the scf sequential value is the projection distance of the ring with radius on the surface of the skull, work using 36 equidistant projection points. for example, on a convex surface, all scf sequences will be positive in extreme cases. however, with variations in surface shape, they could be in the range of negative to positive. it takes an appropriate activation function to accommodate the range of values. the activation function is considered because it can maintain a negative magnitude in the training input. using the activation function in combination with the gives significantly better results than the linear function. the loss function used in this classification model is , as it is commonly used in multi-label classification. iii. results and discussions an artificial neural network model has been created to classify nine landmarks into ten classes (figure 7). non-landmark points are treated as separate labels beside the nine landmarks. the dataset consists of scf values of 458.000 local surfaces labeled with nine landmark classes plus one non-landmark class. a non-landmark class is included to increase confidence in the prediction results. most of the skull surface is not a landmark. so, adding a class that says "not a landmark" really helps improve accuracy. the neural network model was trained for 200 epochs to ensure the training results (figure 8). it can be seen that around the 22nd epoch, there has been convergence. in some epochs, insignificant spikes occur, and the learning returns to convergence. an untrained cranial point cloud was tested on the neural network model. the point cloud consists of 200,000 evenly distributed samples (local surfaces). predictions for each landmark are scattered in several areas but mostly clustered in the designated target. the class prediction in figure 9 was fired by the top hundred highest predicted probability values for each landmark class. fig. 7. model summary with categorical cross entropy loss 36 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 an expert on the cranial surface annotates ground truths. the coordinates of the annotation points are extracted to be compared with the predicted results. the accuracy of each prediction point is measured as a three-dimensional euclidean distance (equation 3) between the ground truth and the prediction point . seven of the nine landmarks give promising results compared to the ground truth. the prediction for zygoor and zygool have multiple clusters, with the largest are closer to ground truth. accuracy is then improved with the implementation a distance radius-based filter and cluster-based filter. √ ( ) (3) radius-based filters aim to select the points closest to the centre of mass ( of the prediction points. the mass centre is determined by the average coordinates of the predicted points. each prediction points ( ) is then calculated its distance (equation 3) from the centre ( and sorted in ascending order. the filter aims to leave candidate prediction points close to the ground truth. the hypothesis is that the prediction points should converge close to the ground truth position. the radius-based filter assumes that the center of mass is at the ground truth position. this filter works well on several landmark classes such as fmtl, glb, pog, nas, zygl, and zygl. but for other classes, this filter does not give consistent results. the problem is that the prediction points are spread over a wide range. this filter is also not able to handle multi-cluster predictions. filter by cluster is carried out to cope with multi-cluster prediction. this filter is a simplification of neighborhood-based filtering techniques by han [25]. this relies on the number of neighbors per prediction point within a normalized 0.05-unit radius. the idea is to eliminate isolated points or fig. 8. training (top) and lost (bottom) history of the learning process with 200 epochs. in approximately the 22 nd epoch, there has been convergence. in some epochs, insignificant spikes occur and returns to convergence. p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27–40 37 small clusters (number of neighbors). thus, the tighter the filter, the larger the cluster points remain. in the case of multi-cluster prediction, radius-based filters are not optimal. the average distance from the ground truth widened in the tighter filter. otherwise, cluster-based filter consistently provides better results than radius-based filters (table 1). on zygool and zygoor landmarks, the prediction points are spread over a wide range. especially in zygool landmarks, there is more than one cluster (figure 9). radius-based filters failed to improve accuracy for finding top-20 points, while cluster-based filters cope this problem. the performance of cluster-based filters is generally better than mass-radius-based filters (figure 10 and figure 11). referring to the position of the actual landmark, figure 9 shows visual prediction of landmark and the filtered results. the mean distance from the ground truth is reduced (better). it improves the accuracy by eliminating points scattered too far from the center of mass. the results shown in figure 10, figure 11, and table 1 confirm that the filter can eliminate less relevant points. the cluster-based filter consistently provides better results. it also strengthens the hypothesis that the prediction of a landmark will be clustered because it has a similar scf. with neighborhood-based filtering techniques [25], they can be clustered. fig. 9. results of top-hundred landmark detection, compared with cluster-based filtering side by side. 38 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 using scf with the iterative closest point (icp-scf) successfully registered point clouds based on local surface scf [15]. a similar idea is used in this work with a machine learning approach. learning runs convergently, and the prediction results show that scf can act as a surface descriptor that can be recognized for its landmark characteristics. the main challenge in using 3d modalities is many voxels [17]. this work addresses this issue by using three-dimensional descriptors to draw the shape context of a three-dimensional surface. input learning is information in the form of a 3d surface shape context. thus, the workload of the learning process will be much reduced. table 1. the average distance between the top-n prediction points and the ground truth points landmark top100 top-90 top-80 top-70 top-60 top-50 top-40 top-30 top-20 top-10 top-5 glb cluster filter 0.1420 0.0615 0.0457 0.0406 0.0384 0.0348 0.0316 0.0308 0.0287 0.0259 0.0228 radius filter 0.1420 0.0531 0.0449 0.0417 0.0379 0.0353 0.0318 0.0289 0.0251 0.0280 0.0246 fmtr cluster filter 0.0973 0.0382 0.0336 0.0294 0.0258 0.0223 0.0199 0.0163 0.0134 0.0104 0.0125 radius filter 0.0973 0.0387 0.0372 0.0361 0.0378 0.0387 0.0390 0.0418 0.0478 0.0611 0.0651 zygr cluster filter 0.1933 0.0483 0.0435 0.0393 0.0371 0.0359 0.0324 0.0277 0.0243 0.0206 0.0162 radius filter 0.1933 0.0484 0.0440 0.0417 0.0379 0.0354 0.0326 0.0334 0.0337 0.0322 0.0324 pog cluster filter 0.2150 0.1444 0.0623 0.0582 0.0554 0.0503 0.0506 0.0525 0.0527 0.0553 0.0512 radius filter 0.2150 0.1093 0.0623 0.0545 0.0474 0.0412 0.0357 0.0305 0.0249 0.0185 0.0136 fmtl cluster filter 0.0917 0.0395 0.0350 0.0338 0.0311 0.0301 0.0278 0.0244 0.0227 0.0184 0.0218 radius filter 0.0917 0.0400 0.0378 0.0359 0.0336 0.0310 0.0287 0.0267 0.0229 0.0225 0.0291 zygl cluster filter 0.1717 0.0944 0.0489 0.0414 0.0388 0.0353 0.0316 0.0315 0.0313 0.0331 0.0304 radius filter 0.1717 0.0798 0.0492 0.0452 0.0433 0.0384 0.0397 0.0421 0.0429 0.0467 0.0562 zygoor cluster filter 0.4486 0.3882 0.3394 0.2567 0.1774 0.0501 0.0350 0.0230 0.0206 0.0240 0.0316 radius filter 0.4486 0.3385 0.2580 0.1953 0.1205 0.0781 0.0671 0.0806 0.1153 0.1968 0.2798 zygool cluster filter 0.7019 0.6982 0.6783 0.6244 0.5983 0.5412 0.4521 0.2040 0.0219 0.0177 0.0212 radius filter 0.7019 0.6237 0.5780 0.5062 0.5040 0.5786 0.6806 0.7681 0.7726 0.6747 0.6177 nas cluster filter 0.3305 0.2705 0.2427 0.1393 0.0666 0.0708 0.0719 0.0731 0.0721 0.0717 0.0695 radius filter 0.3305 0.2150 0.0900 0.0634 0.0584 0.0551 0.0505 0.0494 0.0511 0.0675 0.0700 fig. 10. filter landmark prediction points by distance radius. with the center of mass of the predicted points as the center of the radius. p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27–40 39 the feature-based approach is not only good for classification based on global features. however, a good understanding of local surface features can also be used to detect the unit shape of a landmark. so far, such detection and prediction have only been carried out using a view-based approach, such as works by lindner [18], lee [17], jacinto [16], or kang [20]. the method of automatically detecting the position of landmarks is necessary in craniofacial reconstruction. especially in computer-assisted reconstruction, the morphological approach is a computational opportunity because it can be simulated by a computer based on available medical modalities (ct, mri, usg). iv. conclusions the hypothesis that inspires this work is that each landmark has a unique surface shape characteristic. scf is used as a feature representation of the shape of the landmark surface. the neural network model has a convincing performance, with the average distance to the ground truth is 0.0326 normalised units. several prediction points with the highest probability values are taken for each landmark. they are scattered but tend to be clustered in the desired area. cluster-based filters are better than mass-radius-based filters, consistently giving better pinpoint accuracy, especially in multi-cluster cases. precision is measured based on the average distance of the top predictions to the ground truths. the cluster-based filter is needed for multi-cluster distribution to isolate the largest cluster. it is successfully coping with the multi-cluster case. based on these results, using scf was successful as a 3d descriptor representing local surface features. with the success of this work, the next research is its implementation as part of the craniofacial reconstruction framework. not only for forensics, but the detection of landmarks is also helpful in simulating medical reconstruction and plastic surgery. future research based on this work is organ damage reconstruction research with template transfer based on landmarks. with a similar concept, research on implant reconstruction is also possible. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement our thanks to lembaga pengelola dana pendidikan (lpdp) / indonesia endowment fund for education make this research possible. lpdp grant number [201902210113890]. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. fig. 11. filter based on the largest cluster http://journal2.um.ac.id/index.php/keds 40 p.h. suputra et al. / knowledge engineering and data science 2022, 5 (1): 27-40 references [1] p. t. jayaprakash, “conceptual transitions in methods of skull-photo superimposition that impact the reliability of identification: a review,” forensic sci. int., vol. 246, pp. 110–121, 2015. [2] m. a. lenza, a. a. de carvalho, e. b. lenza, m. g. lenza, h. m. de torres, and j. b. de souza, “radiographic evaluation of orthodontic treatment by means of four different cephalometric superimposition methods,” dental press j. orthod., vol. 20, no. 3, pp. 29–36, 2015. [3] c. c. snow, b. p. gatliff, and k. r. mcwilliams, “reconstruction of facial features from the skull: an evaluation of its usefulness in forensic anthropology,” am. j. phys. anthropol., vol. 33, no. 2, pp. 221–227, 1970. [4] h. ullrich and c. n. stephan, “mikhail mikhaylovich gerasimov's authentic approach to plastic facial recpnstruction,” anthropologie, vol. liv/2, no. october 2014, pp. 97–107, 2016. [5] p. claes, d. vandermeulen, s. de greef, g. willems, j. g. clement, and p. suetens, “computerized craniofacial reconstruction: conceptual framework and review,” forensic sci. int., vol. 201, no. 1–3, pp. 138–145, 2010. [6] p. h. suputra, a. d. sensusiati, e. m. yuniarno, m. h. purnomo, and i. k. e. purnama, “3d laplacian surface deformation for template fitting on craniofacial reconstruction,” in icccm’20: proceedings of the 8th international conference on computer and communications management, 2020, pp. 27–32. [7] m. de buhan and c. nardoni, “a facial reconstruction method based on new mesh deformation techniques,” forensic sci. res., vol. 3, no. 3, pp. 256–273, 2018. [8] t. gietzen et al., “a method for automatic forensic facial reconstruction based on dense statistics of soft tissue thickness,” plos one, vol. 14, no. 1, pp. 1–19, 2019. [9] p. guyomarc’h, f. santos, b. dutailly, p. desbarats, c. bou, and h. coqueugniot, “three-dimensional computerassisted craniometrics: a comparison of the uncertainty in measurement induced by surface reconstruction performed by two computer programs,” forensic sci. int., vol. 219, no. 1–3, pp. 221–227, 2012. [10] l. jiang, j. zhang, b. deng, h. li, and l. liu, “3d face reconstruction with geometry details from a single image,” ieee trans. image process., vol. 27, no. 10, pp. 4756–4770, 2018. [11] a. lodha, m. mehta, m. n. patel, and s. k. menon, “facial soft tissue thickness database of gujarati population for forensic craniofacial reconstruction,” egypt. j. forensic sci., vol. 6, no. 2, pp. 126–134, 2016. [12] b. rosario campomanes-álvarez et al., “modeling facial soft tissue thickness for automatic skull-face overlay,” ieee trans. inf. forensics secur., vol. 10, no. 10, pp. 2057–2070, 2015. [13] l. j. short, b. khambay, a. ayoub, c. erolin, c. rynn, and c. wilkinson, “validation of a computer modelled forensic facial reconstruction technique using ct data from live subjects: a pilot study,” forensic sci. int., vol. 237, pp. 147.e1-147.e8, 2014. [14] y. zhao et al., “laplacian musculoskeletal deformation for patient-specific simulation and visualisation,” in 2013 17th international conference on information visualisation, jul. 2013, pp. 505–510. [15] e. mulyanto yuniarno, m. hariadi, and m. hery purnomo, “point cloud registration for a non-deformable object using surface curvature features,” j. theor. appl. inf. technol., vol. 51, no. 3, pp. 506–514, 2013. [16] h. jacinto, s. valette, and r. prost, “multi-atlas automatic positioning of anatomical landmarks,” j. vis. commun. image represent., vol. 50, no. november, pp. 167–177, 2018. [17] s. m. lee, h. p. kim, k. jeon, s. h. lee, and j. k. seo, “automatic 3d cephalometric annotation system using shadowed 2d image-based machine learning,” phys. med. biol., vol. 64, no. 5, 2019. [18] c. lindner, c. w. wang, c. t. huang, c. h. li, s. w. chang, and t. f. cootes, “fully automatic system for accurate localisation and analysis of cephalometric landmarks in lateral cephalograms,” sci. rep., vol. 6, no. august, pp. 1–10, 2016. [19] h. kanaan and a. behrad, “three-dimensional shape recognition and classification using local features of model views and sparse representation of shape descriptors,” j. inf. process. syst., vol. 16, no. 2, pp. 343–359, 2020. [20] s. h. kang, k. jeon, s. h. kang, and s. h. lee, “3d cephalometric landmark detection by multiple stage deep reinforcement learning,” sci. rep., vol. 11, no. 1, pp. 1–13, 2021. [21] n. chowdhury et al., “concurrent segmentation of the prostate on mri and ct via linked statistical shape models for radiotherapy planning,” med. phys., vol. 39, no. 4, p. 2214, 2012. [22] m. bayome, j. hyun park, and y. a. kook, “new three-dimensional cephalometric analyses among adults with a skeletal class i pattern and normal occlusion,” korean j. orthod., vol. 43, no. 2, pp. 62–73, 2013. [23] a. h. ross, d. e. slice, and s. e. williams, “geometric morphometric tools for the classification of human skulls. research report,” pp. 1–59, 2010. [24] m. a. ulinuha, e. m. yuniarno, s. m. s. nugroho, and m. hariadi, “outer contour extraction of skull from ct scan images,” iop conf. ser. mater. sci. eng., vol. 185, no. 1, 2017. [25] x. f. han, j. s. jin, m. j. wang, w. jiang, l. gao, and l. xiao, “a review of algorithms for filtering the 3d point cloud,” signal process. image commun., vol. 57, no. may, pp. 103–112, 2017. https://doi.org/10.1016/j.forsciint.2014.10.043 https://doi.org/10.1016/j.forsciint.2014.10.043 https://doi.org/10.1590/2176-9451.20.3.029-036.oar https://doi.org/10.1590/2176-9451.20.3.029-036.oar https://doi.org/10.1590/2176-9451.20.3.029-036.oar https://doi.org/10.1002/ajpa.1330330207 https://doi.org/10.1002/ajpa.1330330207 http://puvodni.mzm.cz/anthropologie/article.php?id=2175 http://puvodni.mzm.cz/anthropologie/article.php?id=2175 https://doi.org/10.1016/j.forsciint.2010.03.008 https://doi.org/10.1016/j.forsciint.2010.03.008 https://doi.org/10.1145/3411174.3411175 https://doi.org/10.1145/3411174.3411175 https://doi.org/10.1145/3411174.3411175 https://doi.org/10.1080/20961790.2018.1469185 https://doi.org/10.1080/20961790.2018.1469185 https://doi.org/10.1371/journal.pone.0210257 https://doi.org/10.1371/journal.pone.0210257 https://doi.org/10.1016/j.forsciint.2012.01.008 https://doi.org/10.1016/j.forsciint.2012.01.008 https://doi.org/10.1016/j.forsciint.2012.01.008 https://doi.org/10.1109/tip.2018.2845697 https://doi.org/10.1109/tip.2018.2845697 https://doi.org/10.1016/j.ejfs.2016.05.010 https://doi.org/10.1016/j.ejfs.2016.05.010 https://doi.org/10.1109/tifs.2015.2441000 https://doi.org/10.1109/tifs.2015.2441000 https://doi.org/10.1016/j.forsciint.2013.12.042 https://doi.org/10.1016/j.forsciint.2013.12.042 https://doi.org/10.1016/j.forsciint.2013.12.042 https://doi.org/10.1109/iv.2013.67 https://doi.org/10.1109/iv.2013.67 http://www.jatit.org/volumes/vol51no3/23vol51no3.pdf http://www.jatit.org/volumes/vol51no3/23vol51no3.pdf https://doi.org/10.1016/j.jvcir.2017.11.015 https://doi.org/10.1016/j.jvcir.2017.11.015 https://doi.org/10.1088/1361-6560/ab00c9 https://doi.org/10.1088/1361-6560/ab00c9 https://doi.org/10.1038/srep33581 https://doi.org/10.1038/srep33581 https://doi.org/10.1038/srep33581 https://doi.org/10.3745/jips.02.0132 https://doi.org/10.3745/jips.02.0132 https://doi.org/10.1038/s41598-021-97116-7 https://doi.org/10.1038/s41598-021-97116-7 https://doi.org/10.1118/1.3696376 https://doi.org/10.1118/1.3696376 https://doi.org/10.4041/kjod.2013.43.2.62 https://doi.org/10.4041/kjod.2013.43.2.62 https://www.hsdl.org/?view&did=20036 https://www.hsdl.org/?view&did=20036 https://doi.org/10.1088/1757-899x/185/1/012028 https://doi.org/10.1088/1757-899x/185/1/012028 https://doi.org/10.1016/j.image.2017.05.009 https://doi.org/10.1016/j.image.2017.05.009 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 109–121 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p109-121 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) hybrid artificial bee colony and improved simulated annealing for the capacitated vehicle routing problem farhanna mar’i a,1,*, hafidz ubaidillah a,2, wayan firdaus mahmudy b,3, ahmad afif supianto c,d,4 a informatics engineering, faculty of engineering, university of muhammadiyah gresik, gresik, 61121 b faculty of computer science, brawijaya university, malang, 65145 c department of ict and natural sciences, norwegian university of science and technology, torgarden, 8900 d research center for data and information sciences, national research and innovation agency (brin), jakarta, 10310 1 farhannamar@umg.ac.id *, 2 hafidz21ub@gmail.com, 3 wayanfm@ub.ac.id, 4 ahmad.a.supianto@ntnu.no * corresponding author i. introduction the vehicle routing problem (vrp) is the most important in distribution management operations. vrp is faced by all organizations or companies involved in shipping and logistics. the primary purpose of vrp is to minimize travel costs for each vehicle route that serves customer requests with different location coordinates. each delivery route will be started and ended by a depot or warehouse, and each customer will only be visited once [1][2]. vrp is one of the topics in optimizing complex combinatorial problems that researchers in computer science most often discuss. vrp solutions have specific objectives and limitations in real applications, making vrp have categories or variants [3]. the variants of vrp include vrp with time windows (vrptw) [4], multiple depot vrp (mdvrp) [5], vrp with backhauls [6], and capacitated vrp (cvrp) [7]. one of the most popular vrp variants in this study will be discussed, namely the capacitated vehicle routing problem (cvrp). cvrp is included in the type of np-hard combinatorial problem that requires a high computational process [8]. in the case of cvrp, there is an additional constraint in the form of a capacity limit owned by the vehicle, so the complexity of the problem from cvrp is to find the optimum route pattern for minimizing travel costs which are also adjusted to customer demand and vehicle capacity for distribution [7]. one method of solving cvrp can be done by implementing a meta-heuristic algorithm that can be used to solve complex combinatorial optimization problems [9][10]. in recent years, the meta-heuristic algorithm has become a popular method researchers use because of its effectiveness, efficacy, and flexibility [11]. the meta-heuristic algorithm is an optimization technique that uses an iterative approach to produce the best solution by article info a b s t r a c t article history: received 12 may 2022 revised 14 november 2022 accepted 9 december 2022 published online 30 december 2022 capacitated vehicle routing problem (cvrp) is a type of np-hard combinatorial problem that requires a high computational process. in the case of cvrp, there is an additional constraint in the form of a capacity limit owned by the vehicle, so the complexity of the problem from cvrp is to find the optimum route pattern for minimizing travel costs which are also adjusted to customer demand and vehicle capacity for distribution. one method of solving cvrp can be done by implementing a meta-heuristic algorithm. in this research, two meta-heuristic algorithms have been hybridized: artificial bee colony (abc) with improved simulated annealing (sa). the motivation behind this idea is to complete the excess and the lack of two algorithms when exploring and exploiting the optimal solution. hybridization is done by running the abc algorithm, and then the output solution at this stage will be used as an initial solution for the improved sa method. parameter testing for both methods has been carried out to produce an optimal solution. in this study, the test was carried out using the cvrp benchmark dataset generated by augerat (dataset 1) and the recent cvrp dataset from uchoa (dataset 2). the result shows that hybridizing the abc algorithm and improved sa could provide a better solution than the basic abc without hybridization. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: counted data damage analysis herbarium specimen nbr poisson regression http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ mailto:hafidz21ub@gmail.com https://creativecommons.org/licenses/by-sa/4.0/ 110 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109-121 exploring the local optimum solution [12]. meta-heuristic algorithms can be used to find the optimum solution with a predetermined time or number of iterations [13]. in previous studies, several types of meta-heuristic algorithms have been carried out to solve cvrp, including simulated annealing [7], genetic algorithm [8], particle swarm optimization [9], firefly algorithm [14], and artificial bee colony [15][16]. based on these studies, artificial bee colony (abc) is one of the meta-heuristic algorithms that can produce the best average solution output. besides, abc is the most popular variant of swarm intelligence because it is used most widely in optimization research despite abc being the youngest algorithm compared with other swarm intelligence [17]. abc is an algorithm inspired by swarm intelligence (si), especially in bees. bees have intelligence that is used to select food source locations by evaluating quality through a dance movement called the waggle dance. the quality of food sources is assessed from the quality of nectar in flowers (pollen), as well as the distance and direction of the location of the food source to the nest. abc is not only used to optimize cvrp, but has also succeeded in overcoming other optimization problems such as optimization cost efficiency for sizing and composition of arctic offshore drilling support fleets [18], multi-objective optimization on scheduling for palletizing task using robotic arm [19], multi-objective land-use allocation [20] and other optimization projects [21][22][23]. the main advantages of the abc algorithm are fewer control parameters than other si algorithms, which can handle stochastic objective functions, and are easy to hybridize with other algorithms. however, the abc algorithm also has a drawback and cannot produce an optimal solution because it is trapped in a local optimum solution [24]. besides that, the performance of the ability to search for a better solution is also poor [25]. so in this study, improvements will be made to the performance of the abc algorithm by performing a hybridization with other meta-heuristic algorithms. hybridization is proven to improve the performance of an algorithm, especially for optimization problems. in previous studies, the abc algorithm has been hybridized with several other meta-heuristic algorithms such as tabu search [26], genetic algorithm [27], particle swarm optimization [28], monarchy butterfly optimization [29], and quantum computing [30]. those studies show that the hybridization of the abc algorithm can show significant differences in producing the optimal solution. in this research, we proposed hybridizing the abc algorithm and another popular meta-heuristic algorithm, simulated annealing (sa). the motivation behind this hybridization is to increase the performance of two meta-heuristic algorithms by utilizing both algorithms' strengths and weaknesses to provide a better solution for solving cvrp. sa is a probability-based meta-heuristic algorithm used to solve combinatorial optimization problems adapted from the cooling process of metals or materials in thermodynamics [31][32][33]. sa is an attractive method for solving optimization problems because of its ability to deal with arbitrary system and functions, which are easy to implement. sa has been used in several optimization problems in different fields such as statistical physic [34], discrete structures [35], biotechnology [36], and others [37][38]. however, sa also had a disadvantage: the parameters are difficult to control, especially the initial temperature and annealing rate. handling the weaknesses of sa can be done by hybridizing, and it is proven that sa is a method with settings that are easy to modify and hybridize with other algorithms. in earlier research, sa also had been hybridizing with particle swarm optimization for solving cvrp [39], optimizing assembly sequences with genetic algorithms, and melanoma classification by neural networks [32]. in order to maximize the results of hybridization, we will implement a new approach from the sa method that is proven to produce the best solution. there are several modifications of sa to improve its performance, such as using a crossover operator [40], adding two new operators that are folding and reheating [41], and adding a very fast simulated annealing with two stages of annealing plan [42]. that improvisation had success in increasing the performance of sa. this research will use one of the improved sa proposed by yuxin et al. (2018) to prove its performance in solving cvrp. a test will be carried out on the cvrp benchmark dataset generated by augerat et al. (1998) to prove the reliability of the hybridization of the two algorithms [43] and the latest cvrp dataset from uchoa et al. (2016) [44]. our main contribution: first, we show that hybridizing two different metaheuristic algorithms could produce the best performance compared with a single meta-heuristic algorithm implementation for solving cvrp. second, we demonstrated that our proposed algorithm could achieve a minimum distance of cvrp. in addition, we used a novel dataset of cvrp that had high complexity and was close to the original problems. this research is structured into four sections. section i is about the background and the related research. section ii illustrates the research methods, f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109–121 111 and section iii analyzes the implementation algorithm's result with parameter testing. finally, the main findings and future research direction are outlined in section iv. ii. methods in solving the cvrp problem, the expected solution value is the minimum distance from the entire vehicle trip in one dataset group. thus, several benchmark datasets of cvrp will be used to test the parameters and algorithms used in this study. as for this research, hybridization was carried out by first finding the best parameters by testing the parameters of the two methods. the best parameters would be used in the hybridization process by running the artificial bee colony (abc) method first, and the solution's output from the process would be used as an initial solution of the improved simulated annealing (sa) method. the methodology in this research is depicted in figure 1. fig. 1. research method a. datasets the dataset 1 used in this study is a dataset generated by augerat [43], shown in table 1 and the new benchmark dataset on cvrp, uchoa (dataset 2) [44] which are shown in table 2. 112 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109-121 b. artificial bee colony (abc) the abc algorithm is an example of swarm intelligence that tries to adopt some intelligent behavior from animals, especially honey bees when looking for food sources. in a swarm of bees, there are three types of bees: employed bees, onlooker bees, and scout bees [2]. the three types of bees have the same goal when looking for food sources to find the highest quality. the flowchart of the abc algorithm in this study is depicted in figure 2. the value of the best-known solution from the dataset used in this study is shown in table 3. the implementation of the abc algorithm is described in this pseudocode. 1 // initialization 2 b = the number of bees 3 i = the number of iterations 4 𝑆𝑡 = set of stages {𝑆1, 𝑆2, 𝑆3, 𝑆𝑚 } 5 // find any solution x of the problem r 6 for i=1 until i=i 7 for j=1 until j=m 8 for b=1 until b=b 9 // forward step, allow bees to fly from the hive and choose b 10 //partial solutions from the set of partial solutions 𝑆𝑗 at stage 𝑆𝑡𝑗 11 // backward step, send all bees back to the hive, allow bees to exchange 12 // the information about the quality of the partial solutions 13 set j:= j+1 14 if r>x, x=r 15 set j:= j+1 16 set i:= i+1 abc's parameter will be tested to produce the optimum solution, namely the number of populations. so in this study, a solution calculation will be carried out from each dataset with a population of 100 and 1000. the evaluation of the solution is the calculation of the total distance that must be traveled by all vehicles using the euclidean distance formula shown in (1). 𝑑𝑡 = (∑ √(𝑥𝑖 − 𝑥𝑗 ) 2 + ∑ √(𝑦𝑖 − 𝑦𝑗 ) 2) (1) where: 𝑥𝑖,𝑗 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑥 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟 𝑖, 𝑗 𝑦𝑖,𝑗 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑦 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟 𝑖, 𝑗 𝑑𝑡 = 𝑡𝑜𝑡𝑎𝑙 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 𝑏𝑦 𝑎𝑙𝑙 𝑣𝑒ℎ𝑖𝑐𝑙𝑒𝑠 table 1. dataset 1 no. set of problem number of customer number of vehicle capacity 1. an32k5 32 5 100 2. an69k9 69 9 100 3. an80k10 80 10 100 4. bn31k5 31 5 100 5. bn50k7 50 7 100 6. bn78k10 78 10 100 7. e-n51-k5 51 5 8000 8. pn101k4 101 4 400 table 2. dataset 2 no. set of problem number of customer number of vehicle capacity 1. xn200k36 200 36 402 2. xn359k29 359 29 68 3. xn627k43 627 43 110 4. xn876k59 876 59 764 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109–121 113 the fitness value is obtained by comparing the distance value obtained from the solution using the distance best-known solution value from each dataset formulated in (2). 𝑓 = 1 1+(𝑑𝑡−𝑏𝑘𝑠) (2) where: 𝑓 = 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 𝑣𝑎𝑙𝑢𝑒 𝑏𝑘𝑠 = best-known solution fig. 2. artificial bee colony (abc) algorithm 114 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109-121 c. improved simulated annealing after getting a solution with the best fitness in the previous method using the abc algorithm, the following solution will be searched again using the improved simulated annealing (sa) algorithm. the improved simulated annealing algorithm that will be carried out in this study is the sa algorithm which has been improved using the very fast simulated annealing (vfsa) concept, which is applied to cvrp [7]. the annealing plan of the improved simulated annealing method in stage 1 is formulated in (3). t1(k) = 𝑇0 exp (−𝑐𝑘 1 𝑁) (3) initial temperature is (𝑇0), several iterations are k, then c is the value of the given constant, and n is the number of inversion parameters. if the temperature exceeds the specified t value, step 2 will be carried out with in (4). 𝑇2(𝑘) = 𝑇0 𝑒𝑥𝑝 (−𝛼(𝑗 − 𝑘0 𝛽 )1/2) (4) the number of iterations in step 1 is 𝑘0, the temperature rise factor is 𝛽 . t and 𝛽 have inversely proportional values; when it is small, the value of t will be more significant. the parameters used in the improved simulated algorithm and also those that will be tested in this study include temperature reduction factor value (𝛼), the random value on the opportunity (𝑟), parameter values c and n, and the number of iteration 𝑡𝑚𝑎𝑥. iii. results and discussion in this study, hybridization was carried out by first running the abc algorithm and followed by improved sa, but before that, it was necessary to do parameter testing first. the testing parameters on the abc algorithm will be carried out by testing the number of populations, which are 100 shown in table 4 and 1000 populations shown in table 5. the test was carried out several 5 trials and concluded with the minimum, maximum, and average values of the entire experiment. table 3. the value of the best-known solution from the datasets no. datasets bks 1. an32k5 784 2. an69k9 1763 3. an80k10 1174 4. bn31k5 672 5. bn50k7 1032 6. bn78k10 1221 7. en51k5 521 8. pn101k4 681 9. xn200k36 58578 10. xn359k29 51509 11. xn627k43 62366 12. xn876k59 99715 table 4. result of 100 population datasets min max avg an32k5 2128 2300 2231 an69k9 4264 4554 4387 an80k10 5522 5778 5653 bn31k5 1292 1420 1365 bn50k7 2878 3016 2370 bn78k10 4734 5076 4900 en51k5 1770 2022 1908 pn101k4 3832 4022 3946 xn200k36 134866 141286 138000 xn359k29 244534 250950 248529 xn627k43 405618 412564 409079 xn876k59 553752 567470 562342 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109–121 115 based on table 4 and table 5, when the minimum result is visualized, it will be seen in figure 3 for dataset 1 and figure 4 for dataset 2. fig. 3. result of num of population testing on dataset 1 as visualized in figure 3, there are different results regarding the number of populations, and it shows that using 100 populations could produce the best distance than 1000 populations. the same thing happened in dataset 2, shown in figure 4, although there are just slightly different. so, the num of the population used in this research is 100. fig. 4. result of num of population testing on dataset 2 after getting the best parameter for the abc algorithm, the next step is looking for the best parameter for improved sa. the temperature reduction factor value is used to reduce the temperature value of the improved sa method, and this parameter determines the temperature value of the improved sa method, which will affect the number of iterations. therefore, the test results of the temperature reduction factor (∝) value on the an69k9 dataset are presented in table 6. table 5. result of 1000 population datasets min max avg an32k5 1984 2168 2098 an69k9 4132 4282 4179 an80k10 5296 5466 5396 bn31k5 1210 1554 1322 bn50k7 1166 1298 1227 bn78k10 3834 4884 4604 en51k5 1702 1842 1810 pn101k4 3792 3950 3874 xn200k36 131506 137816 135147 xn359k29 242058 246610 245205 xn627k43 399734 403684 402032 xn876k59 549460 559772 556476 116 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109-121 table 6 shows the results of testing the value of the temperature reduction factor on the an69k9 dataset by conducting ten trials. the increase in the value of the reduction factor also affects the increasing computation time. so this study will use the value of the temperature reduction factor as 0.9. then, parameters c and n in the annealing plan are other parameters to be tested. in addition, a random value parameter will also be tested ten times, determining whether or not a bad solution is accepted in the sa method, which was previously only generated randomly. the trial results of random values in determining the acceptance of the worse solution are presented against the objective value, and the minimum total distance in the an69k9 augerat’s dataset shows in table 7. table 7 shows that the random value on the probability of receiving a solution that produces the most optimal minimum distance is 0.9. tests for parameters 𝑐 and 𝑁 were carried out to determine the optimal values for parameters c and n. in this parameter trial, the experiment will be performed ten times according to the optimal number of trials to get the average value of the total stable minimum distance value. the results of testing parameters 𝑐 and 𝑁 are presented in table 8. table 6. result of temperature reduction factor ∝ min max avg 0.1 3064 3546 3302 0.2 3038 3392 3182 0.3 2890 3112 3007 0.4 2830 3046 2947 0.5 2376 2966 2726 0.6 2480 2802 2678 0.7 2312 2556 2442 0.8 2230 2738 2416 0.9 2064 2434 2169 table 7. testing the value of r 𝒓 min max avg 0.1 2516 200768 2711 0.2 2610 204512 2757 0.3 2562 203384 2793 0.4 2554 204380 2757 0.5 2562 1999018 2791 0.6 2344 202698 2728 0.7 2632 204198 2747 0.8 2540 202536 2744 0.9 2478 205346 2650 table 8. test results parameters 𝑐 and 𝑁 𝒄 𝑵 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9 0.1 2781 2822,6 2891,4 2840,2 2829,4 2829,4 2824,4 2804,2 2819,2 2862,8 0.2 2900,8 2857,8 2799,6 2833,4 2845 2845 2835,8 2765 2767,8 2809,8 0.3 2814,2 2727,6 2736,2 2847,6 2753,4 2753,4 2827,7 2738,4 2913 2841,2 0.4 2784,4 2808 2781 2835,4 2782,4 2782,4 2846,6 2716,4 2800,8 2809 0.5 2848 2765,8 2867,6 2758,2 2791,2 2791,2 2855,2 2774,4 2848,4 2821,2 0.6 2863,4 2897,4 2793,2 2813,2 2779,4 2779,4 2791,8 2771,4 2864 2857,6 0.7 2776,4 2858,4 2828 2900,4 2768,4 2768,4 2793,4 2798,2 2850,8 2864,8 0.8 2778,8 2811,2 2719,4 2736,6 2884,2 2884,2 2755,2 2805,6 2884,8 2775,8 0.9 2860,4 2813,6 2789,4 2796,6 2814,6 2814,6 2823,4 2808 2777 2821 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109–121 117 it can be seen in table 8 parameters 𝑐 and 𝑁 that produce the optimal solution; namely, the minimum total distance value of the minimum is 0.4 for parameter 𝑐 and 0.7 for parameter 𝑁. in the trial of the number of iterations (𝑡𝑚𝑎𝑥 ), the experiment will be carried out on the an69k9 datasets with the number of iterations being 100, 1000, 10000, 100000, up to 1000000. in table 9, the results of the solution calculation based on the number of iterations are presented. in table 9, the increasing number of iterations carried out in the implementation of sa on the cvrp problem will result in an average total distance that is increasingly minimal which indicates that the solution is getting better. however, the average computing time is also increasing. when the number of iterations is carried out as much as 1.000.000, it can be observed that the decrease in the average value of the minimum distance is not too significant compared to the average value of the minimum distance obtained from the number of iterations of 100.000, but there is a vast difference in computation time as shown in table 9 which is 106.392 seconds and 1103.243 seconds. so, it could be concluded that this research would be using 100.000 nums of iterations. after doing some testing of the required parameters, the hybridization of the abc algorithm and improved sa will be carried out ten times using the best parameter settings. the results of the hybridization dataset 1 and dataset 2 are shown in table 10. the results of this study, shown in table 10, will be compared with the results of implementing the abc algorithm without hybridization to determine whether hybridization can produce a more optimal solution. the result of these comparison visualized in figure 5 for dataset 1and figure 6 for dataset 2. based on the result in figure 5, the comparison on dataset 1 shows that abcsa could minimize the total distance significantly compared with a single abc. this result means that sa could reconsider a solution based on the probability with the opportunity-based concept. because not all the lousy solution found in early iteration always provides poor results. augerat’s datasets used in this research come from a different type of set. so, the proposed method could prove the performance regarding data variation. table 9. test results parameters num of iteration 𝒕𝒎𝒂𝒙 min max avg avg computation time (s) 100 4032 4596 4222 0,481 1000 3006 3356 3144 1,460 10000 2076 2388 2205 10,737 100000 1656 1910 1754 106,392 1000000 1630 1690 1619 1103,243 table 10. result of hybridization datasets distance avg fitness avg comp. time (s) min max avg an32k5 1082 1352 1246 0,002 24,6 an69k9 1860 2164 2023 0,004 48,7 an80k10 2788 2888 2833 0,0006 56 bn31k5 804 874 844 0,005 22,9 bn50k7 1138 1378 1244 0,005 35,8 bn78k10 1816 2156 1986 0,001 54,6 en51k5 898 978 944 0,002 34,9 pn101k4 1156 1446 1274 0,001 62,5 xn200k36 96358 108048 102063 2,3e05 145,1 xn359k29 196890 209652 203881 6,5e-06 236 xn627k43 349086 365972 358497 3,3e-06 407,3 xn876k59 503928 522220 516263 2,4e-06 566,9 118 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109-121 fig. 5. result of comparison of dataset 1 on the average distance figure 6 the average total distance results from testing the dataset 2 using single abc and abcsa hybridization. these results also show that the more complex the dataset used, the less the difference in the average distance results. this happens because, naturally the more complex a problem is, the more difficult it will be for an algorithm to converge [45]. fig. 6. result of average distance on dataset 2 performance evaluation is not only carried out on the difference in distance generated by each algorithm. performance measurement is also carried out by comparing computation time to show the impact of the hybridization. the result of the computation time is shown in figure 7 for dataset 1 and figure 8 for dataset 2. fig. 7. result of average distance on dataset 1 based on the results of a comparison of single abc and abcsa hybridization to the average computation time in seconds on dataset 1, there are variations in the difference in the an32k5, an69k9, bn31k5, en51k5, and pn101k4 datasets abcsa hybridization actually results in less computation time than single abc, and the increase in computation time only occurs in the an80k10 dataset. this is because sa helps abc to perform local searches so converge faster. apart from that, f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109–121 119 another exciting thing in the ank69k9 dataset in figure 7 is that there is a difference of 13.5 seconds to the computation time, even can minimize the resulting distance by 2404. this shows that hybridization minimizes almost 50 % of the total distance of a single abc. fig. 8. result of comparison on dataset 2 as we can see in figure 8, the difference in computational time between the implementation of the single abc and abcsa hybridization on uchoa's dataset is quite different compared to dataset 1. it can even be seen in the xn359k9 dataset that the abcsa hybridization significantly increases the computation time. this is because dataset 2 is highly complex, so abcsa hybridization also requires a high time. however, comparing it with the distance minimization results obtained on dataset 2, the abcsa hybridization is superior and minimizes the average distance. for example, in the xn359k29 dataset, it takes 47 seconds longer, but abcsa hybridization can minimize the distance to 47644, which means 20% of the result of a single abc distance. iv. conclusion in this research, two meta-heuristic algorithms have been hybridized, namely, artificial bee colony (abc) and improved simulated annealing (sa), to solve the capacitated vehicle routing problem (cvrp) using the default dataset 1 and the latest cvrp dataset 2. the hybridization results show good performance compared to implementing the abc algorithm without hybridization. parameter testing of the two algorithms has also been carried out to produce an optimal solution. based on the results of the study, the hybridization of the two meta-heuristic algorithms can provide more optimal performance in the cvrp optimization problem seen by how the total average distance can be minimized. in addition, the impact is that the computation time is not very significant, and even in some light datasets, it is proven to produce less time. in future research, improvements can be made to the artificial bee colony used with a modification so that hybridization can produce even better performance, besides the hybridization experiment of the meta-heuristic algorithm can be carried out to solve the cvrp case so that it can find out which combination of meta-heuristic algorithms can provide the most optimal solution. acknowledgment we thank the department of informatics engineering university of muhammadiyah gresik for supporting this research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. 120 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109-121 additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. islam, y. gajpal, and t. y. elmekkawy, “hybrid particle swarm optimization algorithm for solving the clustered vehicle routing problem,” appl. soft comput., vol. 110, p. 107655, 2021. [2] b. yao, q. yan, m. zhang, and y. yang, “improved artificial bee colony algorithm for vehicle routing problem with time windows,” plos one 12 e0181275, pp. 1–18, 2017. [3] d. g. rossit, d. vigo, and f. tohm, “visual attractiveness in routing problems : a review,” comput. oper. res., vol. 103, pp. 13–34, 2019. [4] r. y. pratama and w. f. mahmudy, “optimization of vehicle routing problem with time window ( vrptw ) for food product distribution using genetics algorithm,” j. inf. technol. comput. sci., vol. 2, no. 2, pp. 77–84, 2017. [5] k. leungsubthawee and s. saranwong, “multiple depot vehicle routing problems on clustering algorithms,” j. math., pp. 205–216, 2018. [6] c. koc and g. laporte, “vehicle routing with backhauls : review and research perspectives,” comput. oper. res., vol. 91, pp. 79–91, 2018. [7] f. mar’i, w. f. mahmudy, and p. b. santoso, “an improved simulated annealing for the capacitated vehicle routing problem (cvrp),” j. ilm. kursor, vol. 9, no. 3, pp. 119–128, 2018. [8] h. awad, r. elshaer, a. abdelmo, and g. nawara, “an effective genetic algorithm for capacitated vehicle routing problem,” in proceedings of the international conference on industrial engineering and operations management, 2018, pp. 374–384. [9] c. pornsing, “a particle swarm optimization for the vehicle routing problem,” university of rhode island, 2014. [10] a. rahmi, w. f. mahmudy, and s. anam, “a crossover in simulated annealing for population initialization of genetic algorithm to optimize the distribution cost,” j. telecommun. electron. comput. eng., vol. 9, no. 2–8, pp. 177–182, 2017. [11] p. lu, l. ye, y. zhao, b. dai, m. pei, and y. tang, “review of meta-heuristic algorithms for wind power prediction : methodologies , applications and challenges,” appl. energy, vol. 301, no. april, p. 117446, 2021. [12] v. ganesan et al., “quantum inspired meta-heuristic approach for optimization of genetic algorithm,” comput. electr. eng., vol. 94, no. july, p. 107356, 2021. [13] d. połap and m. woźniak, “meta-heuristic as manager in federated learning approaches for image processing purposes,” appl. soft comput., vol. 113, p. 107872, 2021. [14] a. m. altabeeb, a. m. mohsen, and a. ghallab, “an improved hybrid firefly algorithm for capacitated vehicle routing problem,” appl. soft comput. j., vol. 84, p. 105728, 2019. [15] w. y. szeto, y. wu, and s. c. ho, “an artificial bee colony algorithm for the capacitated vehicle routing problem,” eur. j. oper. res., vol. 215, no. 1, pp. 126–135, 2011. [16] h. ding, h. cheng, and x. shan, “modified artificial bee colony algorithm for the capacitated vehicle routing problem,” in 2nd international conference on advances in management science and engineering, 2018, pp. 197–201. [17] b. akay, d. karaboga, b. gorkemli, and e. kaya, “a survey on the artificial bee colony algorithm variants for binary, integer and mixed integer programming problems,” appl. soft comput., vol. 106, p. 107351, 2021. [18] a. a. kondratenko, m. bergström, m. suominen, and p. kujala, “an artificial bee colony optimization-based approach for sizing and composition of arctic offshore drilling support fleets considering costefficiency,” sh. technol. res., pp. 1–24, 2022. [19] r. szczepanski, k. erwinski, m. tejer, a. bereit, and t. tarczewski, “optimal scheduling for palletizing task using robotic arm and artificial bee colony algorithm,” eng. appl. artif. intell., vol. 113, no. february, p. 104976, 2022. [20] l. yang, a. zhu, j. shao, and t. chi, “a knowledge-informed and pareto-based artificial bee colony optimization algorithm for multi-objective land-use allocation,” isprs int. j. geo-information, vol. 7, no. 2, 2018. [21] b. k. dedeturk and b. akay, “spam filtering using a logistic regression model trained by an artificial bee colony algorithm,” appl. soft comput. j., vol. 91, p. 106229, 2020. [22] y. boudouaoui, h. habbi, c. ozturk, and d. karaboga, “solving differential equations with artificial bee colony programming,” soft comput., vol. 24, no. 23, pp. 17991–18007, 2020. [23] y. deng, h. xu, and j. wu, “optimization of blockchain investment portfolio under artificial bee colony algorithm,” j. comput. appl. math., vol. 385, p. 113199, 2021. [24] ş. öztürk, r. ahmad, and n. akhtar, “variants of artificial bee colony algorithm and its application in medical image processing,” appl. soft comput., vol. 97, pp. 1–50, 2020. [25] l. ge and e. ji, “an improved artificial bee colony algorithm and its application in machine learning,” j. phys. conf. ser., vol. 1650, no. 3, 2020. [26] f. ye, d. zhang, y. whar si, x. zeng, and t. t. nguyen, “a hybrid algorithm for a vehicle routing problem with realistic constraints,” inf. sci. (ny)., vol. 394, no. 95, pp. 167–182, 2017. [27] x. huang, x. zeng, r. han, and x. wang, “an enhanced hybridized artificial bee colony algorithm for optimization problems,” iaes int. j. artif. intell., vol. 8, no. 1, pp. 87–94, 2019. [28] y. wang, “improving artificial bee colony and particle swarm optimization to solve tsp problem,” proc. 2018 int. conf. virtual real. intell. syst. icvris 2018, pp. 179–182, 2018. [29] b. rambabu, a. v. reddy, and s. janakiraman, “hybrid artificial bee colony and monarchy butterfly optimization algorithm ( habc-mboa ) -based cluster head selection for wsns,” j. king saud univ. comput. inf. sci., 2019. [30] f. barani and h. nezamabadi-pour, “bqiabc: a new quantum-inspired artificial bee colony algorithm for binary optimization problems,” j. ai data min., vol. 6, no. 1, pp. 133–143, 2018. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1016/j.asoc.2021.107655 https://doi.org/10.1016/j.asoc.2021.107655 https://doi.org/10.1371/journal.pone.0181275 https://doi.org/10.1371/journal.pone.0181275 https://doi.org/10.1016/j.cor.2018.10.012 https://doi.org/10.1016/j.cor.2018.10.012 https://doi.org/10.25126/jitecs.20172216 https://doi.org/10.25126/jitecs.20172216 http://thaijmath.in.cmu.ac.th/index.php/thaijmath/article/view/3186 http://thaijmath.in.cmu.ac.th/index.php/thaijmath/article/view/3186 https://doi.org/10.1016/j.cor.2017.11.003 https://doi.org/10.1016/j.cor.2017.11.003 https://doi.org/10.28961/kursor.v9i3.178 https://doi.org/10.28961/kursor.v9i3.178 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewi8znm4ro7-ahw7xmwghzd0dpeqfnoecakqaq&url=http%3a%2f%2fieomsociety.org%2fieom2018%2fpapers%2f105.pdf&usg=aovvaw28cswt6fbj0qlbc7fb9vkq https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewi8znm4ro7-ahw7xmwghzd0dpeqfnoecakqaq&url=http%3a%2f%2fieomsociety.org%2fieom2018%2fpapers%2f105.pdf&usg=aovvaw28cswt6fbj0qlbc7fb9vkq https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewi8znm4ro7-ahw7xmwghzd0dpeqfnoecakqaq&url=http%3a%2f%2fieomsociety.org%2fieom2018%2fpapers%2f105.pdf&usg=aovvaw28cswt6fbj0qlbc7fb9vkq https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewjepaplro7-ahwztmwghrnpcj8qfnoeccqqaq&url=https%3a%2f%2fdigitalcommons.uri.edu%2fcgi%2fviewcontent.cgi%3farticle%3d1264%26context%3doa_diss&usg=aovvaw0iph1eu4qov-rx_0fxlvhh https://jtec.utem.edu.my/jtec/article/view/2651 https://jtec.utem.edu.my/jtec/article/view/2651 https://jtec.utem.edu.my/jtec/article/view/2651 https://doi.org/10.1016/j.apenergy.2021.117446 https://doi.org/10.1016/j.apenergy.2021.117446 https://doi.org/10.1016/j.compeleceng.2021.107356 https://doi.org/10.1016/j.compeleceng.2021.107356 https://doi.org/10.1016/j.asoc.2021.107872 https://doi.org/10.1016/j.asoc.2021.107872 https://doi.org/10.1016/j.asoc.2019.105728 https://doi.org/10.1016/j.asoc.2019.105728 https://doi.org/10.1016/j.ejor.2011.06.006 https://doi.org/10.1016/j.ejor.2011.06.006 https://scholar.archive.org/work/xnb6jotvy5ghpkt7gi7b4fhq3u/access/wayback/http:/dpi-proceedings.com/index.php/dtssehs/article/download/24837/24469 https://scholar.archive.org/work/xnb6jotvy5ghpkt7gi7b4fhq3u/access/wayback/http:/dpi-proceedings.com/index.php/dtssehs/article/download/24837/24469 https://doi.org/10.1016/j.asoc.2021.107351 https://doi.org/10.1016/j.asoc.2021.107351 https://doi.org/10.1080/09377255.2021.2022906 https://doi.org/10.1080/09377255.2021.2022906 https://doi.org/10.1080/09377255.2021.2022906 https://doi.org/10.1016/j.engappai.2022.104976 https://doi.org/10.1016/j.engappai.2022.104976 https://doi.org/10.3390/ijgi7020063 https://doi.org/10.3390/ijgi7020063 https://doi.org/10.1016/j.asoc.2020.106229 https://doi.org/10.1016/j.asoc.2020.106229 https://doi.org/10.1007/s00500-020-05051-y https://doi.org/10.1007/s00500-020-05051-y https://doi.org/10.1016/j.cam.2020.113199 https://doi.org/10.1016/j.cam.2020.113199 https://doi.org/10.1016/j.asoc.2020.106799 https://doi.org/10.1016/j.asoc.2020.106799 https://iopscience.iop.org/article/10.1088/1742-6596/1650/3/032053 https://iopscience.iop.org/article/10.1088/1742-6596/1650/3/032053 https://doi.org/10.1016/j.ins.2017.02.028 https://doi.org/10.1016/j.ins.2017.02.028 http://doi.org/10.11591/ijai.v8.i1.pp87-94 http://doi.org/10.11591/ijai.v8.i1.pp87-94 https://doi.org/10.1109/icvris.2018.00051 https://doi.org/10.1109/icvris.2018.00051 https://doi.org/10.1016/j.jksuci.2019.12.006 https://doi.org/10.1016/j.jksuci.2019.12.006 https://doi.org/10.22044/jadm.2017.899 https://doi.org/10.22044/jadm.2017.899 f. mar’i et al. / knowledge engineering and data science 2022, 5 (2): 109–121 121 [31] s. kirkpatrick, c. d. gelatt, and m. p. vecch, “optimization by simulated annealing,” science (80-. )., vol. 220, no. 4598, pp. 671–680, 2007. [32] e. j. kusuma, i. pantiawati, and s. handayani, “melanoma classification based on simulated annealing optimization neural network,” knowl. eng. data sci., vol. 4, no. 2, pp. 97–104, 2021. [33] h. lv, x. chen, and x. zeng, “optimization of micromixer with cantor fractal baffle based on simulated annealing algorithm,” chaos, solitons and fractals, vol. 148, p. 111048, 2021. [34] j. adler and e. n. ribak, “simulated annealing in application to telescope phasing,” phys. a stat. mech. its appl., vol. 572, p. 125900, 2021. [35] x. liu et al., “simulated annealing for optimization of graphs and sequences,” neurocomputing, vol. 465, pp. 310–324, 2021. [36] s. dhagat and s. e. jujjavarapu, “simulated annealing and artificial neural network as optimization tools to enhance yields of bioemulsifier and exopolysaccharides by thermophilic brevibacillus borstelensis,” j. environ. chem. eng., vol. 9, no. 4, p. 105499, 2021. [37] s. peng, d. rippel, m. becker, and h. szczerbicka, “scheduling of offshore wind farm installation using simulated annealing,” ifac-papersonline, vol. 54, no. 1, pp. 325–330, 2021. [38] g. çetin and a. keçebaş, “optimization of thermodynamic performance with simulated annealing algorithm: a geothermal power plant,” renew. energy, vol. 172, pp. 968–982, 2021. [39] f. mar’i, w. f. mahmudy, and p. b. santoso, “hybrid particle swarm optimization and simulated annealing for capacitated vehicle routing problem,” proc. 2019 4th int. conf. sustain. inf. eng. technol. siet 2019, pp. 66–71, 2019. [40] i̇. i̇lhan, “an improved simulated annealing algorithm with crossover operator for capacitated vehicle routing problem,” swarm evol. comput., vol. 64, no. march, p. 100911, 2021. [41] b. morales-castañeda, d. zaldívar, e. cuevas, o. maciel-castillo, i. aranguren, and f. fausto, “an improved simulated annealing algorithm based on ancient metallurgy techniques,” appl. soft comput. j., vol. 84, p. 105761, 2019. [42] t. yuxin, y. hairong, g. hang, s. yuying, and l. gang, “application of svr optimized by modified simulated annealing (msa-svr) air conditioning load prediction model,” j. ind. inf. integr., vol. 15, no. january, pp. 247–251, 2018. [43] p. augerat et al., “computational results with a branch-and-cut code for the capacitated vehicle routing problem,” no. may 2014, pp. 1–24, 1995. [44] e. uchoa, d. pecin, a. pessoa, m. poggi, t. vidal, and a. subramanian, “new benchmark instances for the capacitated vehicle routing problem,” eur. j. oper. res., vol. 257, no. 3, pp. 845–858, 2017. [45] f. wang, h. zhang, k. li, z. lin, j. yang, and x. l. shen, “a hybrid particle swarm optimization algorithm using adaptive learning strategy,” inf. sci. (ny)., vol. 436–437, pp. 162–177, 2018. https://doi.org/10.1126/science.220.4598.671 https://doi.org/10.1126/science.220.4598.671 http://dx.doi.org/10.17977/um018v4i22021p97-104 http://dx.doi.org/10.17977/um018v4i22021p97-104 https://doi.org/10.1016/j.chaos.2021.111048 https://doi.org/10.1016/j.chaos.2021.111048 https://doi.org/10.1016/j.physa.2021.125900 https://doi.org/10.1016/j.physa.2021.125900 https://doi.org/10.1016/j.neucom.2021.09.003 https://doi.org/10.1016/j.neucom.2021.09.003 https://doi.org/10.1016/j.jece.2021.105499 https://doi.org/10.1016/j.jece.2021.105499 https://doi.org/10.1016/j.jece.2021.105499 https://doi.org/10.1016/j.ifacol.2021.08.037 https://doi.org/10.1016/j.ifacol.2021.08.037 https://doi.org/10.1016/j.renene.2021.03.101 https://doi.org/10.1016/j.renene.2021.03.101 https://doi.org/10.1109/siet48054.2019.8986006 https://doi.org/10.1109/siet48054.2019.8986006 https://doi.org/10.1109/siet48054.2019.8986006 https://doi.org/10.1016/j.swevo.2021.100911 https://doi.org/10.1016/j.swevo.2021.100911 https://doi.org/10.1016/j.asoc.2019.105761 https://doi.org/10.1016/j.asoc.2019.105761 https://doi.org/10.1016/j.asoc.2019.105761 https://doi.org/10.1016/j.jii.2018.04.003 https://doi.org/10.1016/j.jii.2018.04.003 https://doi.org/10.1016/j.jii.2018.04.003 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewiirbix0i7-ahvk1jgghyt1abmqfnoecbiqaq&url=https%3a%2f%2fwww.osti.gov%2fetdeweb%2fservlets%2fpurl%2f289002&usg=aovvaw0fvywkox8azax_26znecnm https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahukewiirbix0i7-ahvk1jgghyt1abmqfnoecbiqaq&url=https%3a%2f%2fwww.osti.gov%2fetdeweb%2fservlets%2fpurl%2f289002&usg=aovvaw0fvywkox8azax_26znecnm https://doi.org/10.1016/j.ejor.2016.08.012 https://doi.org/10.1016/j.ejor.2016.08.012 https://doi.org/10.1016/j.ins.2018.01.027 https://doi.org/10.1016/j.ins.2018.01.027 knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 129–136 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p129-136 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) an accurate real-time method for face mask detection using cnn and svm shili hechmi * university of tabuk, tabuk 741, saudi arabia asuhaili@ut.edu.sa * * corresponding author i. introduction respiratory infections have been the leading cause of mortality worldwide for many years. deaths from pneumonia occur in all countries. deaths are mainly among the elderly in high-income nations, whereas children are the primary casualties in low-income ones. at the same time, fatalities in both population categories are documented in most middle-income nations. since the end of 2019, a novel beta-coronavirus has caused several viral pneumonia episodes in china in the wuhan region [1][2][3] before spreading over the world, resulting in the worst contagious epidemic since the 1918 spanish flu. this coronavirus, sars-cov-2 (severe acute respiratory syndrome coronavirus-2), is responsible for a clinical picture called covid-19 by the who (for coronavirus disease 2019), involving various organ assaults, but most notably an attack on the upper and lower airways. the covid-19 epidemic has claimed the deaths of almost 5 million people worldwide since the first outbreak [4]. before the covid-19 pandemic, over 2.5 million adults and children perished from pneumonia every year: no other virus resulted in as many fatalities. after the emergence of covid19, matters became more complicated, and respiratory diseases became a significant challenge to global health, especially with their rapid spread and danger to all segments of society. studies on influenza, influenza-like illnesses, and human coronaviruses prove that a medical mask helps lessen the spread of contagious droplets from an infected individual and the possible contamination of the environment with these droplets [5]. these particles are expelled when a covid-19 patient talks, sneezes, or coughs. people nearby may breathe in these contagious droplets through their mouths and noses and even be inhaled them into the lungs. the who highly recommends wearing masks to avoid infection [4]. many nations have mandated that individuals wear medical masks in public spaces like squares to keep the sickness from spreading. in contrast, most of the time, the verification process for wearing a medical mask is still done manually, which requires a lot of human capabilities and leads to many errors in identifying people who do not adhere to wearing a mask, especially in crowded places. in response to this need, tools for detecting the wearing of masks without using facial recognition are becoming increasingly common. many recent types of research have been carried out to identify people without a medical mask [6][7][8][9][10], and the challenge remains to get the highest recognition rate. article info a b s t r a c t article history: received 3 march 2023 revised 29 march 2023 accepted 30 march 2023 published online 30 december 2022 infectious respiratory diseases, including covid-19, pose a significant challenge to humanity and a potential threat to life due to their severity and rapid spread. using a surgical mask is among the most significant safety precautions that can help keep this sort of pandemic from spreading, and manual monitoring of large crowds in public places for face masks is problematic. in this research, we suggest a real-time approach for face mask detection. first, we use a multi-scale deep neural network to extract features. as a result, the attributes are better suited for training the detection system. we employ svm post-processing in the classification stage to make the face mask detection method more robust. according to the experimental findings, our strategy considerably decreased the percentage of false positives and undetected cases. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: face mask detection covid-19 cnn svm http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ s. hechmi / knowledge engineering and data science 2022, 5 (2): 129–136 130 detecting individuals not using medical masks has posed a significant challenge to researchers since a person can wear a mask incorrectly and not completely cover the nose and mouth. below we review the most critical research conducted in this field. preeti nagrath et al. [6] introduced ssdmnv2, a deep learning, tensorflow, keras, and opencv-based technique for recognizing face masks. the single shot multibox detector detects faces in this approach, and the mobilenetv2 algorithm acts as the classifier's framework. s. sanjaya and s. rakhmawan provide a machinelearning method for recognizing face masks in indonesia, citing [7]. authorities can plan for covid19 mitigation, evaluation, prevention, and response. the suggested model may be used with a security camera to halt the covid-19 epidemic by recognizing individuals not using medical masks. the authors used the preprocessing step before training and testing the data. the regional result displays the percentage of persons in the cities with the highest and lowest percentages. gui ling wu developed a masked face identification strategy focused on the process of attention [8] to enhance the efficacy of covered face image recognition. the covered face image is extracted first, followed by the face image component, utilizing the local restricted dictionary learning approach. the dilated convolution is then used to compensate for the resolution loss during the subsampling process. finally, the attentive method neural network is used based on the relevant feature information in the face image to reduce information loss during the sub-sampling process and enhance the rate of facial identification. hariri, w. proposed an efficient method for masked face recognition [2]. the author employs three pre-trained deep convolutional neural networks (cnn), vgg-16, alexnet, and resnet-50 use them to extract deep features from the resulting regions. the bag-of-features (bof) is used for indepth feature extraction and masked faces classification. multilayer perceptron (mlp) is applied for the classification process. in [9], g. yang et al. suggested using a deep learning system based on yolov5 to replace manual testing. the technique is applied to look for face masks. the entire system was divided into four sections: facial mask image improvement, facial mask image segmentation, facial mask image identification, and interface interaction. giou loss and center loss are combined to identify whether a face mask is used. anirudh et al. proposed in [11] a face mask detection using image processing. this presented system has three phases: 1. image preprocessing 2. face recognition and cropping 3. classifier for face masks. this technology can identify faces with and without masks and may be utilized with webcam cameras. this method will promote using face masks, help identify safety infractions, and provide a safe working environment. in [10], the authors proposed a new mixed method to automatically detect whether someone is protecting himself by wearing a mask. it combines convolutional neural network (cnn)-extracted visual characteristics with an image histogram that communicates information about the pixel intensity. the authors of this study present a few pretrained models for creating feature extraction systems utilizing cnns and several kinds of picture histograms. in [12], the authors presented a system using convolutional neural networks to classify the detection of facial masks utilizing covid-19 precautions in images and videos. a complete experiment on the dataset and an effectiveness assessment of the suggested method is presented. in addition, we have succeeded in preserving the interand intra-class facial mask detection variability using a symbolic approach. the authors explored different classifiers, including support vector machines and symbolic classifiers. this work is being prototyped to monitor temperature readings and find masks on individuals. the first technique uses a temperature sensor that records your body temperature, and the first approach employs a temperature sensor that measures your body temperature and immediately sprays disinfection. the second, your job is to provide people with safety systems to avoid covid-19. the author proposed using deep learning concepts to monitor people's conditions continuously. jiang, mingjie, and xinqi fan introduced retinafacemask [13], a singlestage detector that employs a feature pyramid network to combine very sophisticated semantic data with a novel context attention module focusing on facial mask recognition. furthermore, the authors present a novel cross-class object elimination method for denying hypotheses with a high union intersection and little confidence. we have reviewed several works related to masked and unmasked face identification. through this study, we find a gradual and consistent increase in the accuracy of these systems. nevertheless, most studies use face databases and not actual images, which does not allow them to be implemented directly into surveillance systems to identify individuals not using medical masks in common areas and in real time. our method, detailed in the next section, tries to solve this problem. 131 s. hechmi / knowledge engineering and data science 2022, 5 (2): 129-136 this research introduces a novel medical mask detection model that uses a cnn to produce multiscale in-depth features. automatically, the deep network extracts the original image's multi-layer attributes. usually, such a network extracts many characteristics that perform better than standard subjective attributes. in response, we used an image dataset to train a cnn-based feature extraction model. in this work, we refer to these derived characteristics as cnn features. the following is a summary of our paper's primary contributions: a. we propose a novel method for identifying medical masks that employ multi-scale cnn characteristics collected from layered windows using a deep neural network over several complete connection layers. an svm classifier is trained using the finding rate from every detection window. b. we present a complete medical mask detection method that is simple to use, efficient, and reliable. c. we identify people not wearing medical masks from the actual images and not by using wellframed face images. our automatic mask detection method can be easily implemented on an existing video surveillance system. this article is organized as follows: section 2 summarizes the most recent approaches to detect persons not wearing medical masks. section 3 contains a complete overview of our suggested technique. the experiment's results are outlined in section 4. we will conclude this study in section 5 and suggest possible future steps. ii. proposed method many essential neurons are used in deep learning, each receiving the output of a lower-level neuron. based on the nonlinear relation among outputs and inputs, the low-level characteristics are merged into a higher-level abstract concept that describes the scattered properties of the observed data. bottom-up research is used to create a multi-layer abstract representation. multi-layer feature learning is a completely automated technique that does not require any human participation. the inputs are transferred to several feature levels via the deep learning technique based on the learned network structure, and then classifies or identifies the top layer's output using a matching algorithm or classifier. we propose a technique for detecting persons not wearing medical masks based on cnn. we can determine if someone uses a face mask when analyzing surveillance camera photos. first, the human body positioning module receives real-world photos containing multiple possible human body areas. the facial positioning module is then used to identify several potential face areas. the face mask detection module uses this data to identify several face mask detection zones. finally, using an svm classifier for post-processing, we determine the accurate face mask identification result. figure 1 presents a structural representation of the system. fig. 1. face mask identification method using a convolution neural network (cnn) s. hechmi / knowledge engineering and data science 2022, 5 (2): 129–136 132 a. dataset description there are several databases for the evaluation of facial mask detection methods. in order to provide a relevant evaluation of our proposed method, we wanted to use more than one database in the experimental tests but observed that each dataset's annotation format was distinct and that some data did not fit our needs. to solve this problem, we built a database with images collected from available databases with a new annotation and added images from the internet. this constructed dataset contained 10229 images distributed as follows: 3250 images from wider face [14], 4108 images from mafa [15], 1521 images from rmfrd [16], and 1350 images obtained from the internet. we utilize 7412 images for training, 772 for verification, and 2045 for testing. b. cnn feature extraction consider the n-layer structure s = (s1,…,sn). the system's input is i, and the output is o. it may be written as i→ s1 → s2→…→ sn→ o. if the input i and the output o are equal, o has the same information as the initial input, indicating that no information was lost throughout the layering process (si). in other words, o is an alternative rendering of the input, whereas i represents the initial (the original information). the essential principle of deep learning is that the input and output are equal for each layer of an n-layer neural network. in an ideal world, no human assistance would be required during the learning process. a cnn is a multi-layered neural network. every layer consists of numerous two-dimensional planes, each with its own set of neurons. simple (s-neurons) and complex (c-neurons) neurons comprise the network. the s-neurons create the s-plane, and the s-planes form the s-layer, the symbol we indicate. c-neuron, c-plane, and c-layer (uc) are all equivalent. the s and c layers connect each intermediate network level. the input layer, on the other hand, comprises just one layer and direct access to the two-dimensional visual features. the techniques for extracting features from the sample are included in the cnn model's linked structure. in a cnn, the input connections between s-neuron are flexible, whereas the rest are constant. the output of an s-neuron on the l level of the kl s-plane is represented by usl (kl, n), whereas the output of a c-neuron on the kl c-plane is represented by ucl (kl, n). n is a two-dimensional coordinate representing the field's position in the input layer. the receptive field is limited initially, which rises in level l. the s-output neurons are in (1) to (2). 𝑢𝑠𝑙 (𝑘, 𝑛) = 𝑟𝑙 (𝑘)𝛷 { 1+∑ ∑ 𝑎𝑙 (𝑣,𝑘𝑙−1,𝑘)𝑢𝑐𝑙−1(𝑘𝑐𝑙−1,𝑛+𝑣)𝑣∈𝐴𝑙 𝑘𝑙−1 𝑘𝑙−1 1+ 𝑟𝑙(𝑘) 𝑟𝑙(𝑘)+1 𝑏𝑙(𝑘)𝑢𝑣𝑙 ( 𝑛) − 1} (1) 𝛷(𝑥) = {𝑥 𝑥 ≥ 0 0 𝑥 < 0 (2) the connection coefficients of the excitatory and inhibitory inputs, respectively, are 𝑎𝑙(𝑣, 𝑘𝑙−1, 𝑘) and 𝑏𝑙 (𝑘). 𝑟𝑙 (𝑘) is a constant that regulates the feature extraction's selectivity. a higher number indicates a lower tolerance for noise and feature distinctions. the function 𝛷(𝑥) is nonlinear. v is a vector that indicates the former neuron's relative location in the receptive field, n. the size of the s neuron's feature extraction reflecting the receptive field of n, is determined by 𝐴𝑙 . as a result, the total of v includes all of the neurons in the defined region and the sum of 𝑘𝑙 − 1 includes all of the sub-planes in the previous level. as a result, the numerator's sum term is frequently referred to as the excited term and is the sum of the product. neurons are fed into the receptive field, multiplying their outputs by weights. 𝑢𝑣𝑙 ( 𝑛) is a supposed inhibitory neuron, v in the s-plane that may be used to illustrate the network's inhibitory impact. the v-output neurons are (3). 𝑢𝑣𝑙(𝑛) = (∑ ∑ 𝑐𝑙 (𝑣)(𝑢𝑑−1(𝑘𝑙−1, 𝑛 + 𝑣)) 2 𝑣∈𝐴𝑙 𝑘𝑙−1 𝑘𝑙−1 ) 1 2 (3) where 𝑐𝑙 (𝑣) represents the weights of the v-neurons. the c-output neurons are (4) to (5). 133 s. hechmi / knowledge engineering and data science 2022, 5 (2): 129-136 𝑢𝑐𝑙 (𝑘𝑙, 𝑛) = 𝜑 [ 1+∑ 𝑗𝑙 (𝑘𝑙,𝑘𝑙−1) 𝑘𝑙 𝑘𝑙−1=1 ∑ 𝑑𝑙(𝑣)𝑢𝑠𝑙(𝑘𝑙,𝑛+𝑣)𝑣∈𝐷𝑙 1+𝑉𝑠𝑙 (𝑛) − 1] (4) 𝜑(𝑥) = { 𝑥 𝛽+𝑥 𝑥 ≥ 0 0 𝑥 < 0 (5) where 𝛽 is a fixed value. 𝑘𝑙 is used to indicate how many sub-planes (s) are present at the l level. 𝐷𝑙 is the c-receptive neuron's field. as a result, it is related to the size of the feature. the aforementioned fixed link's weight is 𝑑𝑙(𝑣), and it is a monotonically descending function of |v|. if the s neurons 𝑘𝑙 is sub-plane got signals from the 𝑘𝑙−1 sub-plane, 𝑗𝑙 (𝑘𝑙 , 𝑘𝑙−1) 𝑒𝑞𝑢𝑎𝑙𝑠 1, else it equals 0. c. multi-scale detecting method the choice of a suitable observation scale is essential for identifying and comprehending targets because the characteristics of an object vary according to the scale. since images contain objects of various sizes, choosing an ideal scale for picture analysis in advance is not feasible. as a result, the image's content must be considered at several scales. we created a multi-scale feature extraction model with three cnns. each cnn model has eight layers, five of which are convolution layers and three total connection layers. three layered and increasingly monumental rectangular panes automatically extract features from each image (the mask region, the face region, and the human body region). cnns extract three features, which are then transmitted to two total connection layers, with the output of the second full connection delivered to the output layer. finally, the linear svm classifier is used to categorize all of the sub-blocks. cnn is used to extract characteristics from each picture. we begin by selecting candidates for the mask region. we can see the mask information immediately in this place. using the cnn model, we extract features for the face area. this vector is known as feature a. we will have numerous incorrect detection regions if we identify the mask region by simply extracting its features. a second feature vector is extracted from a rectangle neighborhood to boost the accuracy. this neighborhood comprises the mask region and its immediate adjoining regions, namely the face region. this vector is known as feature b. we locate the body region utilizing the observed face and mask portions. feature c refers to the features recovered by the cnn from the human body area. to train the body identification model, we employ feature c. b and c is combined to produce w, a new feature utilized to train the face detection model. w is described as: 𝑊 = 𝑣𝐵 + 𝜇𝐶 (6) where v and 𝜇 denote the confidences of b and c. the properties of human observations suggest that different items attract different kinds of attention. for instance, we specify the object region as the target region to recognize an object in an image. the weight of a position decreases the further it is from the target zone. as a result, we put 𝑣 = 0.7 𝑎𝑛𝑑 𝜇 = 0.3. we employ a mix of a, b, and c (feature s) to train the mask identification model. s is defined as: 𝑆 = 𝛼𝐴 + 𝛽𝐵 + 𝛾𝐶 (7) where 𝛼 = 0.6, 𝛽 = 0.3, 𝛾 = 0.1 denote a, b, and c's confidences. d. post-processing using svm we discussed the cnn approach in the previous section for finding the coarse locations of the human body, face, and mask region. to create the feature vector, we compute their detection scores. finally, we employ the svm technique for post-processing to delete the inaccurate areas. in this research, we apply three cnn detection algorithms (d1, d2, and d3) to make three detections. d1 represents the detection of a human body, d2 represents the detection of a face, and d3 represents the detection of a medical mask. (b: s)∈ di is a five-dimensional feature vector with b = (x1, y1; x2, y2), where (x1, y1) denotes the position of the top left corner of the detection box and (x2, y2) denotes the position of the bottom right corner. each part's feature score is represented by s. the dimensions of the candidate image are used to normalize coordinate b., which satisfies (x1, y1; x2, y2) ∈ [0, 1]. we create a 15-dimension feature vector (d1, d2, d3) for the mask area. we generate a feature s. hechmi / knowledge engineering and data science 2022, 5 (2): 129–136 134 vector in ten dimensions for the face region (d1, d2). a linear svm is then used to classify the feature vectors of the mask, face, and human body regions. the training data are the labeled values (d1, d2, d3) discovered by the cnn algorithm. e. performance evaluation metrics the following lists the many measures employed to evaluate the proposed model. the accuracy, precision, and recall equation is in (8) to (10). 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑝+𝑇𝑛 𝑇𝑝+𝑇𝑛+𝐹𝑝+𝐹𝑛 (8) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑝 𝑇𝑝+𝐹𝑝 (9) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑝 𝑇𝑝+𝐹𝑛 (10) tp, tn, fp, and fn denote true positives, true negatives, false positives, and false negatives, respectively. true positives are images appropriately categorized as positive, whereas images mistakenly classified as positive are known as false positives. true negatives are correctly predicted to belong in the negative class, while false negatives are wrongly classified. iii. results and discussion in this part, we run a thorough benchmark of our technique and seven face mask detectors on five well-known datasets. this benchmark's main objective is to ascertain how these face detectors behave while identifying masked faces and in which instances they are likely to succeed or fail. we next undertake an in-depth discussion on how to create face detectors capable of handling faces obscured by various sorts of masks by evaluating the experimental findings of these face mask detectors. all the experimental tests have been conducted using a laptop running on windows 10 with the following specifications: amd ryzen 7-5700x processor with 32 gb. in this research, the pycharm program with python 3.9.12 has been selected for the creation and execution of numerous experimental tests using a variety of libraries, including opencv 3.0 and darknet [17]. the suggested method was tested against various pre-existing models on the same datasets, and the findings are presented in this study. five contemporary approaches [18][19][20][13][21] were chosen for this purpose. table 1 compares several models with the suggested one using the accuracy metric. from table 1, the first four techniques (yolo v4, r-cnn, resnet50, and retinafacemask) were published between 2020 and 2021, and each of them achieved an accuracy between 84% and 89%. each technique also improved the accuracy of the previous year's best-performing technique by a small percentage, ranging from 0.05% to 0.08%. the fifth technique, centerface, was published in 2022 and achieved the highest accuracy of 0.91, with an improvement of 0.03% compared to the previous year's best-performing technique. finally, the last is our proposed technique, which combines cnn and svm; the results exhibit the enhanced accuracy of our method over other recent methods. in particular, the suggested model attained a greater accuracy of 94% compared to previous techniques. the accuracy is improved by 3% [21], 5% [19], 6% [13], 8% [18], and 10% [20]. table 1. accuracy evaluation of various techniques technique year accuracy %improvement yolo v4 [18] 2020 0.86 +0.08% r-cnn [19] 2021 0.89 +0.05% resnet50 [20] 2021 0.84 +0.10% retinafacemask [13] 2021 0.88 +0.06% centerface [21] 2022 0.91 +0.03% cnn and svm (ours) 2022 0.94 135 s. hechmi / knowledge engineering and data science 2022, 5 (2): 129-136 table 2 compares several models with the suggested one using the precision metric. in table 2, the precision of selected models has been analyzed. the precision scores range from 87% to 93%, with the cnn and svm (ours) techniques achieving a precision score of 92%. the % improvement column shows the percentage improvement in precision compared to the previous year's bestperforming technique. the results show that our model outperforms [19][20][13][21]. table 3 compares several models with the suggested one using the recall metric. results indicate that our strategy outperforms other recent approaches in terms of recall. notably, the proposed model achieved a higher recall of 93%. the recall is improved by 1% [21], 2% [13], 3% [19], 5% [20], and 6% [18]. overall, based on the information presented in table 1, table 2, and table 3, we can conclude that various techniques for object detection, such as yolo v4, r-cnn, resnet50, retinafacemask, centerface, and a combination of cnn and svm (proposed), have been developed and evaluated. therefore, we can conclude that our proposed cnn and svm technique has achieved the highest accuracy, precision, and recall scores among the techniques evaluated in this study. iv. conclusion automatic identification of people not wearing face masks is a significant study issue. using cnn and svm, we propose an accurate real-time technique for detecting face masks. cnn enables the extraction of attributes better suited for training the detection model, whereas svm is utilized for classification. the findings show that our approach significantly outperforms the other recent techniques utilized in the studies. in the future, we will address the issue of incorrectly worn face masks, making the identification system more intelligent. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. table 2. precision evaluation of various techniques technique year precision %improvement yolo v4 [18] 2020 0.88 +0.04% r-cnn [19] 2021 0.91 +0.01% resnet50 [20] 2021 0.87 +0.05% retinafacemask [13] 2021 0.91 +0.01% centerface [21] 2022 0.93 +0.01% cnn and svm (ours) 2022 0.92 table 3. precision evaluation of various techniques technique year recall %improvement yolo v4 [18] 2020 0.87 +0.06% r-cnn [19] 2021 0.90 +0.03% resnet50 [20] 2021 0.88 +0.05% retinafacemask [13] 2021 0.91 +0.02% centerface [21] 2022 0.92 +0.01% cnn and svm (ours) 2022 0.93 s. hechmi / knowledge engineering and data science 2022, 5 (2): 129–136 136 references [1] c. huang et al., “clinical features of patients infected with 2019 novel coronavirus in wuhan, china,” lancet, vol. 395, no. 10223, pp. 497–506, feb. 2020. [2] w. hariri, “efficient masked face recognition method during the covid-19 pandemic,” signal, image video process., vol. 16, no. 3, pp. 605–612, apr. 2022. [3] n. zhu et al., “a novel coronavirus from patients with pneumonia in china, 2019,” n. engl. j. med., vol. 382, no. 8, pp. 727–733, feb. 2020. [4] who, “infection prevention and control during health care when coronavirus disease (covid-19) is suspected or confirmed,” who, 2021. (access on 29 july 2022) [5] who, “infection prevention and control of epidemic-and pandemic prone acute respiratory infections in health care,” who, 2014. (access on 29 july 2022) [6] p. nagrath, r. jain, a. madan, r. arora, p. kataria, and j. hemanth, “ssdmnv2: a real time dnn-based face mask detection system using single shot multibox detector and mobilenetv2,” sustain. cities soc., vol. 66, p. 102692, mar. 2021. [7] s. a. sanjaya and s. adi rakhmawan, “face mask detection using mobilenetv2 in the era of covid-19 pandemic,” in 2020 international conference on data analytics for business and industry: way towards a sustainable economy (icdabi), oct. 2020, pp. 1–5. [8] g. wu, “masked face recognition algorithm for a contactless distribution cabinet,” math. probl. eng., vol. 2021, pp. 1–11, may 2021. [9] g. yang et al., “face mask recognition system with yolov5 based on image recognition,” in 2020 ieee 6th international conference on computer and communications (iccc), dec. 2020, pp. 1398–1404. [10] e. ryumina, d. ryumin, d. ivanko, and a. karpov, “a novel method for protective face mask detection using convolutional neural networks and image histogram,” int. arch. photogramm. remote sens. spat. inf. sci., vol. xliv-2/w1-, pp. 177–182, apr. 2021. [11] k. anirudh, a. ravi, v. s. charan, and v. chaurasiya, “face mask detection using machine learning,” in 2022 ieee international students’ conference on electrical, electronics and computer science (sceecs), feb. 2022, pp. 1–5. [12] g. k. j. hussain, r. priya, s. rajarajeswari, p. prasanth, and n. niyazuddeen, “the face mask detection technology for image analysis in the covid-19 surveillance system,” j. phys. conf. ser., vol. 1916, no. 1, p. 012084, may 2021. [13] x. fan and m. jiang, “retinafacemask: a single stage face mask detector for assisting control of the covid-19 pandemic,” conf. proc. ieee int. conf. syst. man cybern., pp. 832–837, 2021. [14] s. yang, p. luo, c. c. loy, and x. tang, “wider face: a face detection benchmark,” in 2016 ieee conference on computer vision and pattern recognition (cvpr), jun. 2016, pp. 5525–5533. [15] s. ge, j. li, q. ye, and z. luo, “detecting masked faces in the wild with lle-cnns,” in proceedings of the ieee conference on computer vision and pattern recognition, 2017, pp. 2682–2690. [16] b. huang et al., “masked face recognition datasets and validation,” in 2021 ieee/cvf international conference on computer vision workshops (iccvw), oct. 2021, pp. 1487–1491. [17] a. farhadi, “darknet: open source neural networks in c.” (access on 29 july 2022) [18] k. bhambani, t. jain, and k. a. sultanpure, “real-time face mask and social distancing violation detection system using yolo,” in 2020 ieee bangalore humanitarian technology conference (b-htc), oct. 2020, pp. 1–6. [19] j. zhang, f. han, y. chun, and w. chen, “a novel detection framework about conditions of wearing face mask for helping control the spread of covid-19,” ieee access, vol. 9, pp. 42975–42984, 2021. [20] s. sethi, m. kathuria, and t. kaushik, “a real-time integrated face mask detector to curtail spread of coronavirus,” comput. model. eng. sci., vol. 127, no. 2, pp. 389–409, 2021. [21] c. w. yang, t. h. phung, h. h. shuai, and w. h. cheng, “mask or non-mask? robust face mask detector via triplet-consistency representation learning,” acm trans. multimed. comput. commun. appl., vol. 18, no. 1s, pp. 1–19, 2022. https://doi.org/10.1016/s0140-6736(20)30183-5 https://doi.org/10.1016/s0140-6736(20)30183-5 https://doi.org/10.1007/s11760-021-02050-w https://doi.org/10.1007/s11760-021-02050-w https://www.nejm.org/doi/10.1056/nejmoa2001017 https://www.nejm.org/doi/10.1056/nejmoa2001017 https://www.who.int/publications/i/item/who-2019-ncov-ipc-2021.1 https://www.who.int/publications/i/item/who-2019-ncov-ipc-2021.1 https://www.who.int/publications-detail-redirect/infection-prevention-and-control-of-epidemic-and-pandemic-prone-acute-respiratory-infections-in-health-care https://www.who.int/publications-detail-redirect/infection-prevention-and-control-of-epidemic-and-pandemic-prone-acute-respiratory-infections-in-health-care https://doi.org/10.1016/j.scs.2020.102692 https://doi.org/10.1016/j.scs.2020.102692 https://doi.org/10.1016/j.scs.2020.102692 https://doi.org/10.1109/icdabi51230.2020.9325631 https://doi.org/10.1109/icdabi51230.2020.9325631 https://doi.org/10.1109/icdabi51230.2020.9325631 https://doi.org/10.1155/2021/5591020 https://doi.org/10.1155/2021/5591020 https://doi.org/10.1109/iccc51575.2020.9345042 https://doi.org/10.1109/iccc51575.2020.9345042 https://pdfs.semanticscholar.org/8d12/2012acbda5af6c3f88fce0b087c83070d619.pdf https://pdfs.semanticscholar.org/8d12/2012acbda5af6c3f88fce0b087c83070d619.pdf https://pdfs.semanticscholar.org/8d12/2012acbda5af6c3f88fce0b087c83070d619.pdf https://doi.org/10.1109/sceecs54111.2022.9740913 https://doi.org/10.1109/sceecs54111.2022.9740913 https://doi.org/10.1109/sceecs54111.2022.9740913 https://iopscience.iop.org/article/10.1088/1742-6596/1916/1/012084/meta https://iopscience.iop.org/article/10.1088/1742-6596/1916/1/012084/meta https://doi.org/10.1109/smc52423.2021.9659271 https://doi.org/10.1109/smc52423.2021.9659271 https://doi.org/10.1109/cvpr.2016.596 https://doi.org/10.1109/cvpr.2016.596 https://doi.org/10.1109/cvpr.2017.53 https://doi.org/10.1109/cvpr.2017.53 https://doi.org/10.1109/iccvw54120.2021.00172 https://doi.org/10.1109/iccvw54120.2021.00172 https://pjreddie.com/darknet/ https://doi.org/10.1109/b-htc50970.2020.9297902 https://doi.org/10.1109/b-htc50970.2020.9297902 https://doi.org/10.1109/access.2021.3066538 https://doi.org/10.1109/access.2021.3066538 https://doi.org/10.32604/cmes.2021.014478 https://doi.org/10.32604/cmes.2021.014478 https://doi.org/10.1145/3472623 https://doi.org/10.1145/3472623 https://doi.org/10.1145/3472623 knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 69–78 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p69-78 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) k-means clustering and multilayer perceptron for categorizing student business groups miftahul walid a,1,*, norfiah lailatin nispi sahbaniya a,,2, hozairi a,3, fajar baskoro b,4, arya yudhi wijaya b,5 a department of informatics, faculty of engineering, madura islamic university, jl. pondok peantren miftahul ulum bettet, pamekasan 69317, indonesia b department of informatics, faculty of electrical technology and intelligent informatics, sepuluh nopember institute of technology, jl. teknik kimia, surabaya 60117, indonesia 1 miftahul.walid@uim.ac.id*; 2 norfiah210@gmail.com; 3 dr.hozairi@gmail.com; 4 baskoro@gmail.com; 5 arya.wijaya@gmail.com * corresponding author i. introduction sma double track is a flagship program of east java province in the field of education that is packaged as an extracurricular activity in senior high schools aimed at developing students' entrepreneurial skills. in this activity, students will learn about the skills they are interested in and the ins and outs of the business and gain real-life experience in running a business. so, even if they cannot continue their higher education, they can establish their businesses or work in their local area according to their acquired skills [1]. however, it is necessary to pay attention to the challenges of these activities, which include infrastructure, resources, curriculum management, and the perception and understanding of the community towards the educational program. the training method for sma double track utilizes a group system called student business group (kus), with each kus consisting of 5–6 students. the aim is for each student to have a role and responsibilities in running their business. the target is for each kus to be capable of selling products or services resulting from their training to the community, thus enabling them to generate transactions or revenue. each year, the number of transactions for each kus per topic is recorded by the east java provincial education office to determine the potential for developing students' businesses into start-up companies that will receive business capital assistance. kus in the sma double track system is crucial in providing article info a b s t r a c t article history: received 10 june 2023 revised 10 july 2023 accepted 19 august 2023 published online 18 september 2023 the research conducted in this study was driven by the east java provincial government's requirement to assess the transaction levels of the student business group (kus) in the sma double track program. these transaction levels are a basis for allocating supplementary financial aid to each business group. the system's primary objective is to assist the provincial government of east java in making wellinformed choices pertaining to the distribution of supplementary capital to the kus. the classification technique employed in this study is the multilayer perceptron. however, the k-means clustering method is utilised to generate target data due to the limited availability during the classification process, which involves dividing the transaction level attributes into three distinct groups: (0) low transactions, (1) medium transactions, and (2) high transactions. the clustering process encompasses three distinct features: (1) income, (2) spending, and (3) profit. these three traits will be utilized as input data throughout the categorization procedure. the classification procedure employing the multilayer perceptron technique involved processing a dataset including 1383 data points. the training data constituted 80% of the dataset, while the remaining 20% was allocated for testing. in order to evaluate the efficacy of the constructed model, the training error was assessed using k-fold cross-validation, yielding an average accuracy score of 0.92. in the present study, the categorization technique yielded an accuracy of 0.96. this model aims to classify scenarios when the dataset lacks prior target data. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: student business groups k-means clustering multilayer perceptron transaction level classification http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 70 students with practical experience and entrepreneurial opportunities. the activity has a positive impact on students, including providing them with the opportunity to apply the knowledge they have acquired in real-life situations, helping them develop crucial entrepreneurial skills for their future, and creating a collaborative environment where students can work together in teams, share ideas, and build strong networks. students can also enhance their self-confidence by facing challenges and taking risks. kus can serve as a means to implement integrated learning among the subjects taught in the academic and vocational tracks. the sma double track program has many kus, making it difficult for the province government to assess the volume of transactions, which influences the decision to provide more cash to each kus. consequently, the provincial government needs a transaction classification system to make decision-making easier. the convolutional neural networks-recurrent neural network (cnnrnn) resulted in an accuracy value of 75% [2]. long short-term memory (lstm) yielded satisfactory precision, recall, and f1-score values [3]. the use of k-nearest neighbors (k-nn), cnn, lstm indicated that the knn algorithm performed the best among the machine learning algorithms in this case, achieving an accuracy of 83.82% [4]. supervised machine learning models, including linear, non-linear, and ensemble models classified harmful and non-harmful activities. this study showed that linear and non-linear machine learning outperformed ensemble learning in classifying ethereum blockchain addresses [5]. all methodologies are contingent upon the availability and utilization of data. the classification [6][7] research that has been conducted requires input data and target data. however, the transaction data from the sma dt, kus, does not have target data yet. therefore, a method is needed to create target data. hence, in this study, a combination of methods is employed to classify the transaction levels of the kus. the methods used in this research are k-means clustering and multilayer perception [8]. the k-means clustering [9][10] method is utilized to create transaction-level classes with three levels: low, medium, and high transaction levels [11]. meanwhile, the multilayer perceptron (mlp) [12][13] is employed to determine the transaction level of the double-track student business groups. k-means clustering is one of the popular algorithms in data analysis used to group data into different clusters based on similarities in features or attributes [14][15]. on the other hand, the mlp is one type of structured artificial neural network (ann) [16] architecture that utilizes supervised learning methods [17], known as backpropagation, for classification purposes [18][19]. the mlp is chosen because it is highly effective, easy to implement, and provides good results in many cases [20][21][22]. the capability of mlp, compared to several other methods such as support vector machines (svm) [23], can yield better results [24][25]. in addition, compared to the decision tree and random forest methods, mlp can achieve a higher accuracy rate of 80% [25]. additionally, mlp is better than cnn [26]. subsequently, compared to the polynomial regression method, mlp shows better performance [27]. python was chosen as the programming language for this study since it is considered one of the easiest to learn and utilize [28]. a machine learning model developed from this study can execute data target labeling and categorization with the highest precision and accuracy. this model is anticipated to help the east java provincial education office determine the transaction amounts and streamline the decisionmaking process for each kus receiving financial support. ii. method the research methodology is a framework researchers use when conducting a study, encompassing the stages from data collection to data analysis. these stages are carried out in a structured and systematic manner. the research stages are presented in figure 1. 71 walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 fig. 1. research stages the data collection process was obtained from the official website of the sma double track program, www.smadt.id, to be used as testing material for this research. the data obtained consists of 1547 records with 16 feature attributes, including school, district, topic name, topic, income, expenditure, profit, catalog, screenshot, online shop link, instagram link, product poster, description, chairman's name, chairman's phone number, and chairman's address. out of these 16 features, only 4 are utilized: topic, income, expenditure, and profit. the dataset is stored in excel file format to facilitate the calculation process. the next step is data preprocessing, which consists of two processes. firstly, the data cleaning is performed to handle outliers and missing values. secondly, the attribute data is normalized using the min-max method with a value range between 0 and 1. the min-max method can be calculated as in (1). 𝑥′ = 𝑥− 𝑥𝑚𝑖𝑛 𝑥𝑚𝑎𝑥− 𝑥𝑚𝑖𝑛 (1) 𝑥 ′ is normalized result value, 𝑥 is actual data value to be normalized, 𝑥𝑚𝑖𝑛 is minimum value of the actual data, and x𝑚𝑎𝑥 is maximum value of the actual data. after completing the data preprocessing, the next step is to perform data clustering using k-means clustering with three clusters: low, medium, and high. k-means clustering is chosen for its advantages, as it can efficiently group large objects, thereby expediting the clustering process. this capability has been demonstrated and proven in several studies such as [29][30][31][32][33]. the pseudocode for the k-means algorithm can be seen in pseudocode 1. pseudocode 1: k-means clustering input : d = {d1, d2, ….., dn} // set of data items. k // number desired cluster output : a set of k clusters. step : 1. arbitrarily choose k dsts-items from d as initial centroids; 2. repeat assign each item d1 to the cluster which has the closest centroid; calculate new mean for each cluster; until convergence criteria is met. the classification system will use the clustering results as labels or target attributes. after the labeling process, the data is divided into training and testing data. the dataset is randomly split, with 80% as training data and 20% as testing data. next, the model validation process is conducted to assess the performance of the built model using k-fold cross-validation on the training data. after that, the classification process is performed using the multilayer perceptron method. the multilayer perceptron is chosen for its ease of implementation and good results in various cases. this capability has been demonstrated and proven in several studies [34][35][36]. this test uses 5.2 hidden layers and 300 max_iter to achieve good accuracy values. next, the model's performance is evaluated using k-fold http://www.smadt.id/ walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 72 cross-validation with five splits. afterward, the classification model's performance is evaluated using the confusion matrix to obtain precision, recall, and f1-score values. the pseudocode for the mlp algorithm can be seen in pseudocode 2. pseudocode 2: mlp input : the features vector each user start with random initial weights (i.e. , uniform random in [-5,+5]) do { for all patterns p { for all output nodes j { calculate activation (j) error_j = target value_j_for_pattern_p= activation_j for all input nodes i to output node j { delta_weight = learning constant = error_j * activation_i weight = weight * delta_weight ) } } } until error is sufficiently small or "time_out" output : the user id identification result. iii. result and discussion the dataset obtained from the sma double track website consists of 1547 records with 16 feature attributes, including school, district, topic name, topic, income, expenditure, profit, catalog, screenshot, online shop link, instagram link, product poster, description, chairman's name, chairman's phone number, and chairman's address. of these 16 features, only three are used: income, expenditure, and profit. the dataset contains missing values and outliers, necessitating data cleaning and normalization processes to ensure they do not interfere with the calculation process. after completing the preprocessing process, the next step is the data clustering process using kmeans clustering with a total of 1383 data points and three attributes: income, expenditure, and profit. the distribution of the data used in this research based on topic data can be seen in figure 2. figure 2 shows the number of data for each topic, where the topic of designing muslim fashion has 48 data, the topic of fashion design has 95 data, the topic of bridal hijab makeup has 163 data, the topic of hair styling has 17 data, the topic of stage makeup has 63 data, the topic of photography has 60 data, the topic of video editing has 37 data, the topic of graphic design has 245 data, the topic of pastry bakery processing has 340 data, the topic of indonesian food preparation has 59 data, the topic of snacks and beverages has 167 data, the topic of motorcycle tune-up has 66 data, and the topic of electronic equipment maintenance and repair has 23 data. 73 walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 fig. 2. distribution of data by topic from these 1382 data points, they will be clustered into three classes, namely (0) low, (1) medium, and (2) high, with centroid values as shown in the table 1. the visualization of the clustering results based on the three features used is present in figure 3. table 1. centroid values based on income, spending, and benefits transaction class income spending benefits 0 0.0427982 0.02958055 0.03180455 1 0.20932486 0.1558142 0.13110358 2 0.61264435 0.50793811 0.27338189 fig. 3. clustering results walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 74 figure 3 shows the visualization of the clustering results based on the attributes used, namely income, expenditure, and profit. the visualization based on transaction classes is shown in figure 4. fig. 4. visualization based on transaction class figure 4 illustrates the number of data points in each transaction class, where there are 1180 data points in class 0 (low), 163 data points in class 1 (medium), and 40 data points in class 2 (high). this explanation shows that class 0 ("low") has a higher number of data points than the other transaction classes. figure 5 shows the comparison of transaction classes for each topic. fig. 5. comparison of transaction classes in each topic figure 5 indicates that class 0 or 'low' transactions are most prominent in topics 9 (pastry bakery processing), 8 (graphic design), and 3 (hijab bridal makeup). as for class 1 or 'medium' transactions, they are most abundant in topics 10 (indonesian food making), 9 (pastry bakery processing), and 11 (snacks and beverages). on the other hand, class 2 or 'high' transactions are most prevalent in topics 9 (pastry bakery processing), 11 (snacks and beverages), and 8 (graphic design). 75 walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 the clustering results are used as labels or target attributes in the classification system. these results will be manually divided to create training data and testing data. from the dataset, 80% will be randomly selected as training data, while the remaining 20% will be used as testing data. next, the data validation process is performed using k-fold cross-validation on the training data, and the results are shown in table 2. table 2. kfold model validation results number of cv scores used in average 5 cross validation scores 1 0.94 2 0.97 3 0.94 4 0.86 5 0.88 average cv score 0.92 after that, the classification process is performed using the mlp. this mlp test uses 5.2 hidden layers and 300 max_iter to achieve good loss results. the loss curve graph is shown in the figure 6. fig. 6. graph of the loss curve next is the comparison between actual values and the classification prediction results. the comparison results can be seen in figure 7. fig. 7. comparison of actual values and predicted values walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 76 the validation of the multilayer perceptron model for classifying transactions resulted in an accuracy of 0.96. after the classification process, the matrix was tested using a confusion matrix with calculations based on the classification report. the results are shown in table 3. table 3. matrix testing results based on class precision recall f1-score support 0 0.97 1.00 0.99 219 1 0.85 0.85 0.87 39 2 1.00 0.71 0.83 14 accuracy 0.96 277 macro avg 0.96 0.85 0.90 277 weighted avg 0.96 0.96 0.96 277 from table 3, it can be concluded that the average accuracy value of the f1 score is 0.96. the average is 0.90, and the weighted average is 0.96. these accuracy results indicate that the multilayer perceptron method is highly effective in classifying data for the double track student business groups. iv. conclusion the present study has produced a promising framework that integrates two separate methodologies, enabling the simultaneous execution of clustering and classification tasks. this novel framework has significant use in situations with a dearth of predetermined target data. the model presented in this study holds significant potential for a wide range of applications, primarily aimed at providing valuable assistance to the east java provincial education office. its main objective is to acquire insights into the transaction behaviors exhibited by the sma double track student business groups. these observations can provide a basis for formulating policies to provide more money to student-led firms, thus enhancing their entrepreneurial initiatives' quality and long-term viability. this study showcases the impressive accuracy of the k-means clustering and multilayer perceptron algorithms in effectively identifying the transactions of the double track student business groups, highlighting their dynamic synergy. the k-means clustering technique was crucial in producing the desired dataset by categorizing transaction levels into three unique classes: (0) representing low transactions, (1) representing medium transactions, and (2) representing high transactions. the clustering procedure took into account three fundamental features, namely: (1) revenue, (2) spending, and (3) profit. the categorization outcomes obtained utilizing the multilayer perceptron exhibited a noteworthy accuracy rate. in order to evaluate the model's overall performance, a comprehensive analysis of training errors was carried out using k-fold cross-validation. in considering the future trajectory, it is crucial to improve both the k-means clustering and multilayer perceptron models to fully harness their capabilities and advance the effectiveness of these models. furthermore, it is suggested that the scope of model creation be expanded to encompass comparative analyses utilizing various approaches, will aid in establishing benchmarks that can be used to assess the quality and comprehensiveness of the model. this prospective investigation presents an intriguing undertaking with the potential to revolutionize transaction analysis and policy development for student-led enterprises inside the educational sector. acknowledgment the author wishes to sincerely thank all individuals and organizations who have contributed valuable contributions to this research endeavor's successful completion. special recognition is owed to sma double track for their gracious permission to perform the study. 77 walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. yulikah, m. a. faizin, and a. e. sujianto, "implementation of islamic entrepreneurship concept in double track sma program," indones. econ. rev., vol. 1, no. 2, pp. 98–108, 2021. [2] j. yu, y. qiao, n. shu, k. sun, s. zhou, and j. yang, "neural network based transaction classification system for chinese transaction behavior analysis," proc. 2019 ieee int. congr. big data, bigdata congr. 2019 part 2019 ieee world congr. serv., pp. 64–71, 2019. [3] t. hu, x. liu, t. chen, x. zhang, and x. huang, "transaction-based classification and detection approach for ethereum smart contract," vol. 58, no. may 2020, 2021. [4] b. karunachandra, n. putera, s. r. wijaya, d. suryani, j. wesley, and y. purnama, "on the benefits of machine learning classification in cashback fraud detection," procedia comput. sci., vol. 216, no. 2022, pp. 364–369, 2023. [5] r. saxena, d. arora, and v. nagar, "classifying transactional addresses using supervised learning approaches over ethereum blockchain," procedia comput. sci., vol. 218, no. 2022, pp. 2018–2025, 2023. [6] r. sukumaran, "improved customer transaction classification using semi-supervised knowledge distillation," 2021. [7] a. mardanshahi, v. nasir, s. kazemirad, and m. m. shokrieh, "detection and classification of matrix cracking in laminated composites using guided wave propagation and artificial neural networks," compos. struct., vol. 246, no. april, p. 112403, 2020. [8] u. orhan, m. hekim, and m. ozer, "eeg signals classification using the k-means clustering and a multilayer perceptron neural network model," expert syst. appl., vol. 38, no. 10, pp. 13475–13481, 2011. [9] y. rong and y. liu, "staged text clustering algorithm based on k-means and hierarchical agglomeration clustering," proc. 2020 ieee int. conf. artif. intell. comput. appl. icaica 2020, pp. 124–127, 2020. [10] t. gupta and s. p. panda, "clustering validation of clara and k-means using silhouette dunn measures on iris dataset," proc. int. conf. mach. learn. big data, cloud parallel comput. trends, prespectives prospect. com. 2019, pp. 10–13, 2019. [11] p. sharath chander, j. soundarya, and r. priyadharsini, brain tumour detection and classification using k-means clustering and svm classifier. springer singapore, 2020. [12] a. iqbal and s. aftab, "a classification framework for software defect prediction using multi-filter feature selection technique and mlp," int. j. mod. educ. comput. sci., vol. 12, no. 1, pp. 18–25, 2020. [13] r. sharma, m. kim, and a. gupta, "motor imagery classification in brain-machine interface with machine learning algorithms: classical approach to multilayer perceptron model," biomed. signal process. control, vol. 71, no. pa, p. 103101, 2022. [14] h. shi and m. xu, "a data classification method using genetic algorithm and k-means algorithm with optimizing initial cluster center," 2018 ieee int. conf. comput. commun. eng. technol. ccet 2018, pp. 224–228, 2018. [15] n. sapkota, a. alsadoon, p. w. c. prasad, a. elchouemi, and a. k. singh, "data summarization using clustering and classification: spectral clustering combined with k-means using nfph," proc. int. conf. mach. learn. big data, cloud parallel comput. trends, prespectives prospect. com. 2019, pp. 146–151, 2019. [16] g. zhou, h. moayedi, and l. k. foong, "teaching–learning-based metaheuristic scheme for modifying neural computing in appraising energy performance of building," eng. comput., vol. 37, no. 4, pp. 3037–3048, 2021. [17] s. brownfield and j. zhou, "sentiment analysis of amazon product reviews," adv. intell. syst. comput., vol. 1295, no. 1, pp. 739–750, 2020. [18] j. naskath, g. sivakamasundari, and a. a. s. begum, "a study on different deep learning algorithms used in deep neural nets: mlp som and dbn," wirel. pers. commun., vol. 128, no. 4, pp. 2913–2936, 2023. [19] m. y. chuttur and y. parianen, "a comparison of machine learning models to prioritise emails using emotion analysis for customer service excellence," knowl. eng. data sci., vol. 5, no. 1, p. 41, 2022. http://journal2.um.ac.id/index.php/keds https://iconev.org/index.php/ier/article/view/14 https://iconev.org/index.php/ier/article/view/14 https://doi.org/10.1109/bigdatacongress.2019.00021 https://doi.org/10.1109/bigdatacongress.2019.00021 https://doi.org/10.1109/bigdatacongress.2019.00021 https://doi.org/10.1016/j.ipm.2020.102462 https://doi.org/10.1016/j.ipm.2020.102462 https://doi.org/10.1016/j.procs.2022.12.147 https://doi.org/10.1016/j.procs.2022.12.147 https://doi.org/10.1016/j.procs.2023.01.178 https://doi.org/10.1016/j.procs.2023.01.178 http://arxiv.org/abs/2102.07635 http://arxiv.org/abs/2102.07635 https://doi.org/10.1016/j.compstruct.2020.112403 https://doi.org/10.1016/j.compstruct.2020.112403 https://doi.org/10.1016/j.compstruct.2020.112403 https://doi.org/10.1016/j.eswa.2011.04.149 https://doi.org/10.1016/j.eswa.2011.04.149 https://doi.org/10.1109/icaica50127.2020.9182394 https://doi.org/10.1109/icaica50127.2020.9182394 https://doi.org/10.1109/comitcon.2019.8862199 https://doi.org/10.1109/comitcon.2019.8862199 https://doi.org/10.1109/comitcon.2019.8862199 https://doi.org/10.1007/978-981-13-8323-6_5 https://doi.org/10.1007/978-981-13-8323-6_5 https://doi.org/10.5815/ijmecs.2020.01.03 https://doi.org/10.5815/ijmecs.2020.01.03 https://doi.org/10.1016/j.bspc.2021.103101 https://doi.org/10.1016/j.bspc.2021.103101 https://doi.org/10.1016/j.bspc.2021.103101 https://doi.org/10.1109/ccet.2018.8542173 https://doi.org/10.1109/ccet.2018.8542173 https://doi.org/10.1109/comitcon.2019.8862218 https://doi.org/10.1109/comitcon.2019.8862218 https://doi.org/10.1109/comitcon.2019.8862218 https://doi.org/10.1007/s00366-020-00981-5 https://doi.org/10.1007/s00366-020-00981-5 https://doi.org/10.1007/978-3-030-63319-6_68 https://doi.org/10.1007/978-3-030-63319-6_68 https://doi.org/10.1007/s11277-022-10079-4 https://doi.org/10.1007/s11277-022-10079-4 https://doi.org/10.17977/um018v5i12022p41-52 https://doi.org/10.17977/um018v5i12022p41-52 walid et al. / knowledge engineering and data science 2023, 6 (1): 69–78 78 [20] s. talatian azad, g. ahmadi, and a. rezaeipanah, "an intelligent ensemble classification method based on multilayer perceptron neural network and evolutionary algorithms for breast cancer diagnosis," j. exp. theor. artif. intell., vol. 34, no. 6, pp. 949–969, 2022. [21] i. tolstikhin et al., "mlp-mixer: an all-mlp architecture for vision," adv. neural inf. process. syst., vol. 29, no. neurips, pp. 24261–24272, 2021. [22] m. wang, y. lu, and j. qin, "a dynamic mlp-based ddos attack detection method using feature selection and feedback," comput. secur., vol. 88, p. 101645, 2020. [23] n. salankar, p. mishra, and l. garg, "emotion recognition from eeg signals using empirical mode decomposition and second-order difference plot," biomed. signal process. control, vol. 65, no. august 2020, p. 102389, 2021. [24] a. kurani, p. doshi, a. vakharia, and m. shah, "a comprehensive comparative study of artificial neural network (ann) and support vector machines (svm) on stock forecasting," ann. data sci., vol. 10, no. 1, pp. 183–208, 2023. [25] t. s. bressan, m. kehl de souza, t. j. girelli, and f. c. junior, "evaluation of machine learning methods for lithology classification using geophysical data," comput. geosci., vol. 139, p. 104475, 2020. [26] p. benz, s. ham, c. zhang, a. karjauv, and i. s. kweon, "adversarial robustness comparison of vision transformer and mlp-mixer to cnns," 2021. [27] j. hwang, j. lee, and k. s. lee, "a deep learning-based method for grip strength prediction: comparison of multilayer perceptron and polynomial regression approaches," plos one, vol. 16, no. 2 february, pp. 1–12, 2021. [28] m. b. tamam, m. walid, j. freitas, and a. bernardo, "classification of sign language in real time using convolutional neural network," vol. 6, no. 1, pp. 39–46, 2023. [29] x. s. tan, z. yang, y. benlimane, and e. liu, "using classification with k-means clustering to investigate transaction anomaly," pp. 171–174, 2020. [30] c. f. yang, g. j. liu, and c. g. yan, “a k-means-based and no-super-parametric improvement of adaboost and its application to transaction fraud detection,” in 2020 ieee international conference on networking, sensing and control (icnsc), oct. 2020, pp. 1–5. [31] t. amarasinghe and n. krishnarajah, "critical analysis of machine learning based approaches for fraud detection in financial transactions," pp. 12–17, 2018. [32] a. r. khan, m. harouni, and r. abbasi, "brain tumor segmentation using k-means clustering and deep learning with synthetic data augmentation for classification," no. february 2020, pp. 1–11, 2021. [33] c. usha kumari, s. jeevan prasad, and g. mounika, "leaf disease detection: feature extraction with k -means clustering and classification with ann," proc. 3rd int. conf. comput. methodol. commun. iccmc 2019, no. iccmc, pp. 1095–1098, 2019. [34] m. h. santoso, d. a. larasati, u. medan, and a. sumatera, "wayang image classification using mlp method and glcm feature extraction," j. comput. sci. inf. technol. telecommun. eng., vol. 1, no. 2, pp. 111–120, 2020. [35] i. t. um, j. h. ra, and m. h. kim, "comparison of clustering methods for mlp-based speaker verification," proc. int. conf. pattern recognit., vol. 15, no. 2, pp. 475–478, 2000. [36] s. fekri-ershad, "bark texture classification using improved local ternary patterns and multilayer neural network," expert syst. appl., vol. 158, p. 113509, 2020. https://doi.org/10.1080/0952813x.2021.1938698 https://doi.org/10.1080/0952813x.2021.1938698 https://doi.org/10.1080/0952813x.2021.1938698 https://arxiv.org/abs/2105.01601 https://arxiv.org/abs/2105.01601 https://doi.org/10.1016/j.cose.2019.101645 https://doi.org/10.1016/j.cose.2019.101645 https://doi.org/10.1016/j.bspc.2020.102389 https://doi.org/10.1016/j.bspc.2020.102389 https://doi.org/10.1007/s40745-021-00344-x https://doi.org/10.1007/s40745-021-00344-x https://doi.org/10.1016/j.cageo.2020.104475 https://doi.org/10.1016/j.cageo.2020.104475 http://arxiv.org/abs/2110.02797 http://arxiv.org/abs/2110.02797 https://doi.org/10.1371/journal.pone.0246870 https://doi.org/10.1371/journal.pone.0246870 https://doi.org/10.15408/aism.v6i1.29820 https://doi.org/10.15408/aism.v6i1.29820 https://doi.org/10.1109/ieem45057.2020.9309909 https://doi.org/10.1109/ieem45057.2020.9309909 https://doi.org/10.1109/icnsc48988.2020.9238121 https://doi.org/10.1109/icnsc48988.2020.9238121 https://doi.org/10.1109/icnsc48988.2020.9238121 https://doi.org/10.1145/3231884.3231894 https://doi.org/10.1145/3231884.3231894 https://doi.org/10.1002/jemt.23694 https://doi.org/10.1002/jemt.23694 https://doi.org/10.1109/iccmc.2019.8819750 https://doi.org/10.1109/iccmc.2019.8819750 https://doi.org/10.1109/iccmc.2019.8819750 https://doi.org/10.30596/jcositte.v1i2.5131 https://doi.org/10.30596/jcositte.v1i2.5131 https://doi.org/10.1109/icpr.2000.906115 https://doi.org/10.1109/icpr.2000.906115 https://doi.org/10.1016/j.eswa.2020.113509 https://doi.org/10.1016/j.eswa.2020.113509 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 41–52 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p41-52 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) a comparison of machine learning models to prioritise emails using emotion analysis for customer service excellence mohammad yasser chuttur 1, *, yashinee parianen 2 department of software and information systems, university of mauritius 2 nd floor phase ii building university of mauritius reduit mu, 80837, mauritius 1 y.chuttur@uom.ac.mu *; 2 yashinee.parianen@umail.uom.ac.mu * corresponding author i. introduction the problems caused by email overload in organizational settings have been well documented in [1][2][3]. on average, a mid-size organization may receive thousands of emails per day, and employees often struggle to respond to queries in a timely fashion [3]. in many cases, employees adopt a „last come first serve strategy‟, where emails that are received the last are responded to first. others may adopt a „first come first serve strategy‟ and respond to emails that came in first and then proceed to other emails. when it comes to customer service excellence, however, such a strategy, does not represent the most effective way to address the important customer concerns. customers with complaints or urgent issues are treated with the same priority level as those with no complaints or minor issues. in other words, frustrated and dissatisfied customers who need to be prioritized are disregarded. instead, a more effective approach to better serve customers would have been to give more attention to customers who urgently need a response than a customer who is more likely to wait for an answer [4]. however, to do so will require that recipients manually filter out all new incoming emails and select only those emails that need to be attended in priority. with recent technological advancements and widespread adoption of smartphones, however, a rising number of people use email extensively. manual filtering of emails, thus, is not only laborious but also a non-productive task. employees are overwhelmed with emails and often face many difficulties when doing manual email triage to determine which emails are to be treated with higher priority. according to [5], people's psychological resources are depleted by work-related emails, especially incoming work-related emails, leading to experiences of job overload, compulsive use, article info a b s t r a c t article history: received 22 july 2022 revised 2 august 2022 accepted 15 august 2022 published online 7 november 2022 there has been little research on machine learning for email prioritization for customer service excellence. to fill this gap, we propose and assess the efficacy of various machine learning techniques for classifying emails into three degrees of priority: high, low, and neutral, based on the emotions inherent in the email content. it is predicted that after emails are classified into those three categories, recipients will be able to respond to emails more efficiently and provide better customer service. we use the nrc emotion lexicon to construct a labeled email dataset of 517,401 messages for our proposal. following that, we train and test four prominent machine learning models, mnb, svm, logr, and rf, and an ensemble of mnb, lsvc, and rf classifiers, on the labeled dataset. our main findings suggest that machine learning may be used to classify emails based on their emotional content. however, some models outperform others. during the testing phase, we also discovered that the logr and lsvc models performed the best, with an accuracy of 72%, while the mnb classifier performed the poorest. furthermore, classification performance differed depending on whether the dataset was balanced or imbalanced. we conclude that machine learning models that employ emotions for email classification are a promising avenue that should be explored further. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: emotion analysis machine learning email prioritization customer service 42 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 stress, and work-family imbalance. email overload has direct negative consequences on employee productivity and must be addressed. in various contexts, emotion detection from written text, such as emails, may be used to improve work performance and customer relationships [6]. emotion indicates the psychological state, which is impacted by the discernment of someone‟s surroundings, health, and intent [7], and email contents are often filled with emotional cues. through automatic emotion analysis, it is possible to obtain valuable information on how a specific audience feels about a given product, person, or service offered by a business. in other words, automated emotion detection systems can be employed by businesses to track and recognize emotional reactions to their goods and services. for instance, in power marketing, the user's feelings from speech data have been analysed for improved customer service [8]. in other cases, customer service agents can use automated anger detection systems in customer care emails to recognize unhappy consumers more quickly and take the necessary prompt actions to boost customer retention rates [9]. without measures that track customer emotions, businesses risk-averse consequences on their reputation and related financial impacts, such as the loss of clients [10]. emotion analysis differs from sentiment analysis, categorizing textual data as positive, neutral, or negative. instead, emotion analysis provides information about an individual‟s feelings or emotions through a series of “emotional connotations” like joy, sadness, or anger. many proposed emotion models are reported in [11][12][13]. each of those emotion models proposes a list of emotions that humans express. a popular emotion model is the wheel of emotions defined by robert plutchik [14]. as shown in figure 1, the wheel of emotions lists several emotions that an individual usually expresses. each emotion can have different intensity, as illustrated by different wheel cones. robert plutchik also noted that individuals could express one or more of eight primary emotions, as shown in table 1. following the reasoning that frustrated customers will express primarily negative emotions, it should be possible for machine learning to detect email contents with negative content and classify them as high priority compared to emails, which contain neutral or positive emotions. to date, however, not much attention has been given to the use of emotions to classify emails according to fig. 1. the "wheel of emotions" by robert plutchik table 1. robert plutchik eight basic emotions positive emotions polar opposite emotions joy sadness anticipation surprise trust disgust fear anger m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 43 different priority levels. instead, more attention has been given to spam detection. for instance, the works of [15][16][17][18] demonstrate machine learning techniques for email spam detection. a hybrid approach to spam detection is further found in the work of [19] and [20], and [21] evaluated the use of semantic features for spam detection in emails. in addition, a detailed review of spam detection techniques can be found in the works of [22][23][24][25]. filtering spam emails targets unwanted emails but does not set any priority scheme for emails [15]. as stated by [26], there is a clear distinction between spam detection and email prioritization. the prioritization of emails aims at personalizing non-spam emails by estimating their relevance. wang [26] also states that email prioritization can be split into two main groups depending on the targeted outcome: action prediction and priority label prediction, both of which require a classification task. to the researchers‟ knowledge, research on using machine learning and emotion analysis for email prioritization is scarce. one such research can be found in [27]. the authors used naïve bayes to categorize several emails according to their importance. [27] hypothesized that assigning different weights to selected terms from email contents makes it possible to calculate the overall importance or priority of these emails. however, the authors did not report any implementation results. in this study, we investigate the possibility of using machine learning to analyse the emotions expressed in emails to set a priority ranking to different emails. it is posited that customers will send emails containing different expressed emotions, which, when detected, can further help classify those emails into three main groups: high priority, neutral, and low priority. our work contrasts with previous studies in that most works on email classification have focused on spam detection. the main contributions of this work are as follows. we create a labelled dataset of emails using emotions from the nrc emotion lexicon. there is currently no email dataset labelled with emotions. we then devise a novel algorithm to assign three levels of priorities, namely high, low, and neutral to the messages in our dataset. once the priority labels are assigned, we subject our dataset to some preprocessing stages. we then train, test, and compare different supervised machine learning models for their ability to correctly classify different email messages according to the three priority levels set for this study. the rest of the paper is organized as follows. in section ii, we provide details on our proposed methodology to use emotions and machine learning to classify emails according to three levels of priorities. in section iii, we present and discuss the results obtained. moreover, in section iv, we conclude our work with some future recommendations. ii. method this study aims to evaluate the efficacity of machine learning to prioritize emails based on the emotional contents of the texts within. the general process flow for our proposal is depicted in figure 2. a. data acquisition no publicly accessible email dataset is labelled with emotions like happiness, sadness, or anger. hence, a labelled dataset will have to be created for this study. to this end, the enron email dataset is selected because it is a large email datasets that has already been used in several related studies such as [19], [20], [28], [29], and [30]. the enron email dataset at https://www.cs.cmu.edu/~./enron/ includes 517,401emails sent by enron corporation employees. the “federal energy regulatory commission” collected it as part of its inquiry into enron's downfall. the dataset is saved as a csv file and obtained from kaggle. b. data cleaning and pre-processing the process of data cleaning aims to eliminate irrelevant contents from the dataset. in the context of this project, irrelevant content refers to any part of the email that is not valuable when the learning algorithm assigns a class to the email. not only will data cleaning make the task of classification easier for the classification model, but it may also significantly reduce the processing time in the training stage. as stated by [20], data pre-processing is essential to yield a better outcome. data preprocessing aims at curtailing noise and can help tackle the dimensionality curse reported by [31] and [32]. 44 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 for data cleaning, duplicate and irrelevant fields were removed from the raw dataset. as for data pre-processing, the following was applied to the cleaned email dataset: lower casing, noise removal, stop words removal, and tokenization. the curse of dimensionality constraint is dealt with by including text normalization and lemmatization techniques in the pre-processing phase to help in dimensionality reduction. the steps have been curated and adapted from [19] and [20]. c. annotation and priority labeling annotation preparation is a crucial step as the emails in the dataset must be labelled with their relevant emotions to enable the use of supervised machine learning. it was reported by [20] that lexicon labelling provides clear and uniform results. several existing sentiment lexicons have been employed in developing different systems and algorithms. some examples are vader, afinn, and sentiment140. in this study, the nrc word-emotion association lexicon at https://saifmohammad.com/webpages/nrc-emotion-lexicon.htm is used for the emotion detection process since it is a list containing words based on different emotions. it should be noted that the nrc word-emotion association lexicon provides multiple emotions, which is associated with a polarity (positive/negative number) weight based on the contents of an analysed text contents. once labeled, each email is tagged with a priority label according to the emotion detected. the pseudocode for assigning the labels “high priority”, “low priority”, and “neutral” is as follows. start calculate weight sum of good emotions ('anticipation', 'trust', 'joy', 'positive', 'surprise') calculate weight sum of bad emotions ('anticipation', 'surprise', 'anger', 'disgust', 'fear', 'sadness', 'negative') `if weight sum good_emotion > weight sum bad_emotion then return ‘low priority” fig. 2. general process flow for email prioritization based on emotions and machine learning m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 45 else if weight sum bad_emotion > weight sum good_emotion then return ‘high priority” else return “neutral” end an example of emotion polarity weights obtained for different messages that can be obtained from the nrc lexicon is shown in figure 3. d. feature extraction and selection machine learning algorithms are unable to work directly on raw text. hence, feature extraction methods, otherwise known as vectorization, are conducted to transform text to numerical data, more specifically into a vector of features using term frequency-inverse document frequency (tf-idf), which was initially designed for text categorization [33]. tf-idf classifiers use frequency feature vectors as input and assess the weight of the features/words by using both tf and idf. term frequency (tf) is the number of times a term appears in a text and inverse document frequency (idf) assesses a term‟s significance [34]. the formulas used to calculate the tf and idf are given by (1) and (2). ( ) (1) ( ) (2) tf-idf classifiers rely on a computational statistical approach that works by filtering the features by weighting and rating each unigram and n-grams based on the number of times certain words appear in the text [35]. in this study, tf-idf is used to execute this conversion as recommended by [18][19][20][35]. table 2 provides some more details on the hyperparameters used for the tfidfvectorizer available in python. e. model training in this step, the vectors generated during the feature extraction phase are used to train and test the machine learning models selected for this study. the dataset is uniformly and randomly split into 80% train set and 20% test set. we shall train and test the performance of the following popular machine learning models: svm, nb, logr, and rf. those classifiers have been chosen for their reported good performance scores as reported in [35][36][37][38][39]. as recommended by [40], we will also investigate whether an ensemble method may yield better performance than the selected machine learning algorithms alone. stacking is an ensemble method which learns to integrate the predictions from several machine learning models optimally. here, the mnb, lsvc and rf model will be stacked to build a new ensemble model. the ensemble method choses the best classification fig. 3. nrc emotions and polarity weights table 2. td-idf hyperparameters hyperparameter description “𝒎𝒂𝒙_𝒅𝒇 𝟎.𝟗𝟎” set a threshold to ignore words with document frequency greater than 0.90 “𝒎𝒊𝒏_𝒅𝒇 𝟐” set a threshold to ignore words with document frequency lower than 2 “𝒎𝒂𝒙_𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝟏𝟎𝟎𝟎” to consider the top 1000 features in the corpus “𝒔𝒕𝒐𝒑_𝒘𝒐𝒓𝒅𝒔 𝒔𝒕𝒐𝒑_𝒘𝒐𝒓𝒅𝒔” to remove the words from the stop words list "𝒏𝒈𝒓𝒂𝒎_𝒓𝒂𝒏𝒈𝒆 (𝟏,𝟐)" to get features composed of single tokens. 46 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 model to use on the test set after each one has been evaluated on the training set. the main goal of ensemble method is to integrate the outputs of several classifiers to build a strong one [41]. f. model evaluation the selected machine learning models will be trained and tested on the enron email dataset labelled with the nrc lexicon. for evaluation purposes, the accuracy and f1-score obtained for each model will be used to compare the performance of the implemented algorithms. accuracy refers to the ratio of correctly categorized data to the overall classifications. the formula used to calculate accuracy is: (3) f1-score, alternatively termed as f-measure is the “harmonic mean” of the precision and recall. in other words, f1-score indicates which percent of positive predictions observed were correct. (4) precision, in this study concerning the neutral class, refers to the number of cases where the expected and actual results are both neutral. (5) recall, in the context of this study with respect to the neutral class refers to the capacity of the model to predict the emails of the neutral class. (6) iii. results and discussions we used python 3.9.2, jupyter notebook, and the anaconda distribution to implement our proposed email prioritization approach. table 3 lists the different python libraries we used to execute some of the main processes described in section 2. fig. 4. nrc raw emotion scores results table 3. python libraries used library purpose pandas transform data in tabular format nrclex measure emotional affect from a body of text nltk for stopwords removal, tokenization spacy for lemmatization numpy calculate average score scikit learn import selected machine learning classification models keras for deep learning imbalance learn to import sampling modules beautifulsoup to remove html tags from emails string for noise removal pickle to save and load trained machine learning models m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 47 a. calculating raw emotion scores for annotation and priority labeling once we obtained the enron email dataset, as explained in section iia, we cleaned the data and applied several pre-processing operations as described in section iib. we then used the "top_emotion" module from nrclex to view the highest polarities from the email text for training our machine learning models. a snapshot of the resulting email messages and the associated emotions is shown in figure 4. the "raw_emotion_scores" module from nrclex was used to obtain the polarities of the different emotions. the results were then transformed into a pandas dataframe and the array of the different polarities were classified according to each emotion using the "pandas.dataframe.form_records" module. the score obtained for each emotion set was then used to decide on the polarity label (high, low, neutral) to assign to each email message according to the algorithm described in section iic. the resulting dataset was then inspected for data distribution. figure 5 shows the results of the size of classes of the complete dataset and of the dataset after removing duplicates. as observed, the pre-processing phase and priority labels were applied to two groups of the enron email datasets. in one group, we kept all the records but in the second group, we removed all duplicate messages. we could see that both data groups were imbalanced, which can further influence the classification performance. in other words, the classifiers may try to improve the accuracy of the larger class to the detriment of the smaller classes. the data was further sampled to balance the dataset as recommended by [29] to address the issue (a) (b) fig. 5. (a) bar chart showing the size of classes of the complete dataset (b) bar chart showing the size of classes of the dataset after removing duplicates. (a) (b) fig. 6. class distribution of the dataset with no duplicate after (a) undersampling and (b) oversampling 48 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 of the classifier biasing towards the majority class. the sampling method used was random oversampling, where data from the “minority class” were duplicated randomly, and random undersampling, where data from the “majority class” were randomly removed. the same sampling techniques were applied to both the complete or full dataset and the dataset with duplicates removed. figure 6 shows the dataset distribution for the dataset with no duplicate after undersampling and oversampling, respectively. more after, a similar balanced class distribution was obtained for the entire dataset. b. feature extraction and selection for feature extraction, the "tfidfvectorizer()" function from "scikit learn" module has been employed. the lemmatized text is fitted into the tfidfvectorizer. the main purpose of this approach was to improve the computation and training processes. once the tf-idf representation of the dataset is generated, the dataset was split into 80% train set and 20% test set using sklearn‟s “train_test_split” function. the feature vectors generated by the tfidfvectorizer are then used as input to train the ml classification models. as mentioned earlier, the following classifiers are used to fit the training data: nb, svm, logr and rf. thus, the inbuilt classes, namely multinomialnb, linearsvc, logisticregression, and randomforestclassifier from the "scikit learn" library are used to train the models on the dataset, both before and after the removal of duplicates, and evaluate whether the performance on a larger data set is improved. c. model training and evaluation in python, we used the ‟s “ _ _ ”, feature to split our dataset uniformly and randomly into 80% train set and 20% test set. the feature vectors generated by the tfidfvectorizer and the labeled datasets were used as input to train all the ml classification models selected. the vectorizer and models were then pickled using the python library to enable (a) (b) (c) fig. 7. confusion matrix for (a) mnb, (b) logr, and (c) lscv classifier for full oversampled testing set (balanced dataset) m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 49 saving and loading of the classifiers. we then obtained the training and testing classification score for different datasets and models when classifying emails into different priority categories using emotions. the relevant confusion matrix was generated for each model to calculate the corresponding tp, tn, fp, and fn values. the f1-score and overall accuracy for each model and the corresponding dataset were calculated from those values. the confusion matrix for the mnb, logr, and lsvc classifier corresponding to the full oversampled testing set are shown in figure 7. similar confusion matrices were obtained for the other datasets. we used different performance scores to match the dataset used. for an imbalanced dataset, f1score gives a more representative idea of the performance of a classifier model, whereas, for balanced datasets, we used the accuracy metric. we also prefer to consult the macro average for the f1-score as this metric treats all classes equally. the classification performance scores obtained for the full imbalanced dataset with and without duplicates are shown in table 4. table 5 provides the accuracy results for all the models for the balanced datasets with and without duplicates. the performance scores for the rf and stacking classifiers are seen to exhibit model overfitting, with a perfect 100% score in training but a reduced performance score for the testing set. similarly, as seen in table 5, the rf and stacking classifiers obtained 100% accuracy on the training set for all the balanced datasets. however, depending on the dataset, it drops between 72% and 99%, creates a misleading sense of obtaining high accuracy, which can be mostly attributed to model overfitting. in other words, both the rf and stacking models overfit the training set at the expense of an inferior performance on the testing set. to recall the stacking model was built using the mnb, lsvc and rf classifiers. therefore, it is safe to assume that the output of rf classifier in the stacking model has resulted in overfitting and hence fails to perform well with the new dataset. in contrast, the performance scores obtained for the other models, i.e., mnb, lscv and logr appear to be more reliable. for the imbalanced datasets (table 4), the logr classifier gives a slightly better performance score of 0.67 compared to mnb and lsvc. overall, all the models gave close performance scores during their training and testing phases. likewise, for the balanced datasets (table 5), the logr classifier is again seen to provide a good classification performance score. maximum accuracy of 0.73 close to the lsvc classifier across the balanced datasets, was observed, making both logr and lsvc as the two most suitable priority classifiers for emails using emotions. since the mnb classifier gave the worst performance for both the balanced and imbalanced datasets, we deduce that this type of task is not the most suitable model. in general, therefore, it is found that machine learning models are good candidates for classifying emails into different priority levels based on emotional content in the email. previous studies have table 4. f1-macro average score for the full dataset with and without duplicate (imbalanced dataset) dataset mnb lsvc rf logr stacking full dataset training 0.42 0.67 1 0.68 1 testing 0.42 0.66 0.92 0.67 0.93 full dataset duplicate removed training 0.43 0.67 1 0.68 0.99 testing 0.43 0.66 0.68 0.67 0.74 table 5. accuracy score on training and testing set (balanced dataset) dataset mnb lsvc rf logr stacking full dataset (over sampled) training 0.60 0.73 1 0.73 1 testing 0.60 0.73 0.99 0.73 0.99 full dataset (under sampled) training 0.60 0.73 1 0.73 1 testing 0.60 0.72 0.89 0.72 0.90 duplicate removed (over sampled) training 0.60 0.72 1 0.73 0.97 testing 0.59 0.72 0.96 0.72 0.75 duplicate removed (under sampled) training 0.60 0.73 1 0.73 0.98 testing 0.60 0.72 0.74 0.72 0.76 50 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 mostly focused on using machine learning techniques for spam detection. this study used the nrc emotion lexicon to label an otherwise unlabeled email dataset. the best performance score obtained is good but not good enough to be deployed in a real organization setting. several improvements can still be made to obtain a better-performing email prioritizing solution to the email overload problem. for instance, as discussed in [12], other emotion models can be used for the data labeling step. using lesser emotion categories could also increase accuracy, as observed by [6]. last but not least, as investigated by [42], other machine learning models like rnn can be evaluated for their performance in detecting emotions in email contents. iv. conclusion email overload is a growing organizational problem that has been overlooked. for businesses, this represents a considerable loss in productivity and poor customer service and increasing psychological stress imposed on employees. the efficacity of four machine learning models namely mnb, lsvc, rf, logr, and an ensemble of mnb, lsvc, and rf classifiers were evaluated to address this problem, for their performance in prioritising messages from the enron email dataset. the dataset was labelled using the nrc emotions lexicon and following several experiments on both imbalanced and balanced datasets, it was discovered that supervised machine learning could be used to detect emotions in email contents and assign priorities to emails accordingly. it was also noticed that data balancing influenced the classification performance and that the rf and the ensemble methods tended to overfit the data. in parallel, it was found that the logr and lsvc classifiers gave the best classification score while the mnb classifier performed the poorest. however, the highest performance scores obtained from this study are not good and considered good enough to be effective in a real-life organizational setting. thus, there is a need for more research into the use of emotions in email content when setting up a priority reply list. in future works, it is recommended that other deep learning models and alternative emotion lexicons be tested for the possibility of achieving better performance scores. in addition, the principle discussed in this paper considered email content written in the english language only. the same techniques may not work well for other written languages, which may require other considerations for text cleaning and preprocessing. in this case, further research is warranted. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher‟s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] b. graf and c. h. antoni, “the relationship between information characteristics and information overload at the workplace-a meta-analysis,” european journal of work and organizational psychology, vol. 30, no. 1, pp. 143–158, 2021. [2] b. mannion, “information overload,” risk management, vol. 69, no. 4, pp. 26–29, 2022. [3] r. kong, h. zhu, and j. a. konstan, “learning to ignore: a case study of organization-wide bulk email effectiveness,” proceedings of the acm on human-computer interaction, vol. 5, no. cscw1, pp. 1–23, 2021. [4] t. ravichandran and c. deng, “effects of managerial response to negative reviews on future review valence and complaints,” information systems research, 2022. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1080/1359432x.2020.1813111 https://doi.org/10.1080/1359432x.2020.1813111 https://doi.org/10.1080/1359432x.2020.1813111 https://www.rmmagazine.com/articles/article/2022/06/01/information-overload https://doi.org/10.1145/3449154 https://doi.org/10.1145/3449154 https://doi.org/10.1287/isre.2022.1122 https://doi.org/10.1287/isre.2022.1122 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 51 [5] e. russell, s. a. woods, and a. p. banks, “tired of email? examining the role of extraversion in building energy resources after dealing with work-email,” european journal of work and organizational psychology, vol. 31, no. 3, pp. 440–452, 2022. [6] z. halim, m. waqar, and m. tahir, “a machine learning-based investigation utilizing the in-text features for the identification of dominant emotion in an email,” knowledge-based systems, vol. 208, p. 106443, nov. 2020. [7] z. shao, r. chandramouli, k. p. subbalakshmi, and c. t. boyadjiev, “an analytical system for user emotion extraction, mental state modeling, and rating,” expert systems with applications, vol. 124, pp. 82–96, jun. 2019. [8] x. li and r. lin, “speech emotion recognition for power customer service,” in 2021 7th international conference on computer and communications (iccc), 2021, pp. 514–518. [9] s. angel deborah, t. t. mirnalinee, and s. m. rajendram, “emotion analysis on text using multiple kernel gaussian...,” neural processing letters, vol. 53, no. 2, pp. 1187–1203, 2021. [10] m. haberzettl and b. markscheffel, “a literature analysis for the identification of machine learning and feature extraction methods for sentiment analysis,” in 2018 thirteenth international conference on digital information management (icdim), sep. 2018, pp. 6–11. [11] y. chuttur and l. pokhun, “an evaluation of deep learning networks to extract emotions from yelp reviews,” in progress in advanced computing and intelligent engineering, springer, 2021, pp. 55–67. [12] l. pokhun and m. y. chuttur, “emotions in texts,” bulletin of social informatics theory and application, vol. 4, no. 2, pp. 59–69, 2020. [13] v. ahire and s. borse, “emotion detection from social media using machine learning techniques: a survey,” in applied information processing systems, springer, 2022, pp. 83–92. [14] r. plutchik, “the nature of emotions: human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,” american scientist, vol. 89, no. 4, pp. 344–350, 2001. [15] a. a. alurkar et al., “a proposed data science approach for email spam classification using machine learning techniques,” in 2017 internet of things business models, users, and networks, nov. 2017, pp. 1–5. [16] s. r. gomes et al., “a comparative approach to email classification using naive bayes classifier and hidden markov model,” in 2017 4th international conference on advances in electrical engineering (icaee), sep. 2017, pp. 482– 487. [17] e. g. dada, j. s. bassi, h. chiroma, s. m. abdulhamid, a. o. adetunmbi, and o. e. ajibuwa, “machine learning for email spam filtering: review, approaches and open research problems,” heliyon, vol. 5, no. 6, p. e01802, jun. 2019. [18] f. jáñez-martino, e. fidalgo, s. gonzález-martínez, and j. velasco-mata, “classification of spam emails through hierarchical clustering and supervised learning,” arxiv:2005.08773 [cs], may 2020, accessed: dec. 12, 2020. [19] s. liu and i. lee, “email sentiment analysis through k-means labeling and support vector machine classification,” cybernetics and systems, vol. 49, no. 3, pp. 181–199, apr. 2018. [20] r. s. h. ali and n. e. gayar, “sentiment analysis using unlabeled email data,” in 2019 international conference on computational intelligence and knowledge economy (iccike), dec. 2019, pp. 328–333. [21] n. saidani, k. adi, and m. s. allili, “a semantic-based classification approach for an enhanced spam detection,” computers & security, vol. 94, p. 101716, jul. 2020. [22] n. ahmed, r. amin, h. aldabbas, d. koundal, b. alouffi, and t. shah, “machine learning techniques for spam detection in email and iot platforms: analysis and research challenges,” security and communication networks, vol. 2022, 2022. [23] r. mansoor, n. d. jayasinghe, and m. m. a. muslam, “a comprehensive review on email spam classification using machine learning algorithms,” in 2021 international conference on information networking (icoin), 2021, pp. 327– 332. [24] i. amin and m. k. dubey, “hybrid ensemble and soft computing approaches for review spam detection on different spam datasets,” materials today: proceedings, 2022. [25] p. garg and n. girdhar, “a systematic review on spam filtering techniques based on natural language processing framework,” in 2021 11th international conference on cloud computing, data science & engineering (confluence), 2021, pp. 30–35. [26] b. wang, “personalized broadcast message prioritization,” thesis, applied sciences: school of computing science, 2018. accessed: jan. 10, 2021. [27] s. choudhari, n. choudhary, s. kaware, and a. shaikh, “email prioritization using machine learning,” ssrn journal, 2020. [28] e. m. bahgat, s. rady, and w. gad, “an email filtering approach using classification techniques,” in the 1st international conference on advanced intelligent system and informatics (aisi2015), november 28-30, 2015, beni suef, egypt, cham, 2016, pp. 321–331. [29] n. chhaya, k. chawla, t. goyal, p. chanda, and j. singh, “frustrated, polite, or formal: quantifying feelings and tone in email,” in proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media, new orleans, louisiana, usa, jun. 2018, pp. 76–86. [30] e. m. bahgat, s. rady, w. gad, and i. f. moawad, “efficient email classification approach based on semantic methods,” ain shams engineering journal, vol. 9, no. 4, pp. 3259–3269, dec. 2018. [31] m. a. naser and a. h. mohammed, “emails classification by data mining techniques,” journal of babylon university: pure and applied sciences, vol. 22, no 2, 2014. https://doi.org/10.1080/1359432x.2021.1958782 https://doi.org/10.1080/1359432x.2021.1958782 https://doi.org/10.1080/1359432x.2021.1958782 https://doi.org/10.1016/j.knosys.2020.106443 https://doi.org/10.1016/j.knosys.2020.106443 https://doi.org/10.1016/j.eswa.2019.01.004 https://doi.org/10.1016/j.eswa.2019.01.004 https://doi.org/10.1109/iccc54389.2021.9674619 https://doi.org/10.1109/iccc54389.2021.9674619 https://doi.org/10.1007/s11063-021-10436-7 https://doi.org/10.1007/s11063-021-10436-7 https://doi.org/10.1109/icdim.2018.8846980 https://doi.org/10.1109/icdim.2018.8846980 https://doi.org/10.1109/icdim.2018.8846980 https://doi.org/10.1007/978-981-33-4299-6_5 https://doi.org/10.1007/978-981-33-4299-6_5 https://doi.org/10.31763/businta.v4i2.256 https://doi.org/10.31763/businta.v4i2.256 https://doi.org/10.1007/978-981-16-2008-9_8 https://doi.org/10.1007/978-981-16-2008-9_8 https://www.jstor.org/stable/27857503 https://www.jstor.org/stable/27857503 https://doi.org/10.1109/ctte.2017.8260935 https://doi.org/10.1109/ctte.2017.8260935 https://doi.org/10.1109/icaee.2017.8255404 https://doi.org/10.1109/icaee.2017.8255404 https://doi.org/10.1109/icaee.2017.8255404 https://doi.org/10.1016/j.heliyon.2019.e01802 https://doi.org/10.1016/j.heliyon.2019.e01802 http://arxiv.org/abs/2005.08773 http://arxiv.org/abs/2005.08773 https://doi.org/10.1080/01969722.2018.1448242 https://doi.org/10.1080/01969722.2018.1448242 https://doi.org/10.1109/iccike47802.2019.9004372 https://doi.org/10.1109/iccike47802.2019.9004372 https://doi.org/10.1016/j.cose.2020.101716 https://doi.org/10.1016/j.cose.2020.101716 https://doi.org/10.1155/2022/1862888 https://doi.org/10.1155/2022/1862888 https://doi.org/10.1155/2022/1862888 https://doi.org/10.1109/icoin50884.2021.9334020 https://doi.org/10.1109/icoin50884.2021.9334020 https://doi.org/10.1109/icoin50884.2021.9334020 https://doi.org/10.1016/j.matpr.2022.03.342 https://doi.org/10.1016/j.matpr.2022.03.342 https://doi.org/10.1109/confluence51648.2021.9377042 https://doi.org/10.1109/confluence51648.2021.9377042 https://doi.org/10.1109/confluence51648.2021.9377042 http://summit.sfu.ca/item/19096 http://summit.sfu.ca/item/19096 https://dx.doi.org/10.2139/ssrn.3568518 https://dx.doi.org/10.2139/ssrn.3568518 https://doi.org/10.1007/978-3-319-26690-9_29 https://doi.org/10.1007/978-3-319-26690-9_29 https://doi.org/10.1007/978-3-319-26690-9_29 http://dx.doi.org/10.18653/v1/w18-1111 http://dx.doi.org/10.18653/v1/w18-1111 http://dx.doi.org/10.18653/v1/w18-1111 https://doi.org/10.1016/j.asej.2018.06.001 https://doi.org/10.1016/j.asej.2018.06.001 https://www.iasj.net/iasj/download/529a71d66b7651cd https://www.iasj.net/iasj/download/529a71d66b7651cd 52 m.y. chuttur and y. parianen / knowledge engineering and data science 2022, 5 (1): 41-52 [32] x. fang and j. zhan, “sentiment analysis using product review data,” j big data, vol. 2, dec. 2015. [33] xiao-lin wang and cloete, “learning to classify email: a survey,” in 2005 international conference on machine learning and cybernetics, aug. 2005, vol. 9, pp. 5716-5719 vol. 9. [34] b. ray, a. garain, and r. sarkar, “an ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews,” applied soft computing, vol. 98, p. 106935, jan. 2021. [35] p. barnaghi, p. ghaffari, and j. g. breslin, “opinion mining and sentiment polarity on twitter and correlation between events and sentiment,” in 2016 ieee second international conference on big data computing service and applications (bigdataservice), mar. 2016, pp. 52–57. [36] v. l. miguéis, a. freitas, p. j. v. garcia, and a. silva, “early segmentation of students according to their academic performance: a predictive modelling approach,” decision support systems, vol. 115, pp. 36–51, nov. 2018. [37] q. umer, h. liu, and y. sultan, “emotion based automated priority prediction for bug reports,” ieee access, vol. 6, pp. 35743–35752, 2018. [38] t. bokaba, w. doorsamy, and b. s. paul, “comparative study of machine learning classifiers for modelling road traffic accidents,” applied sciences, vol. 12, no. 2, art. no. 2, jan. 2022. [39] k. y. win, n. maneerat, s. choomchuay, s. sreng, and k. hamamoto, “suitable supervised machine learning techniques for malignant mesothelioma diagnosis,” in 2018 11th biomedical engineering international conference (bmeicon), nov. 2018, pp. 1–5. [40] c. k. hiramath and g. c. deshpande, “fake news detection using deep learning techniques,” in 2019 1st international conference on advances in information technology (icait), jul. 2019, pp. 411–415. [41] d. k. renuka, t. hamsapriya, m. r. chakkaravarthi, and p. l. surya, “spam classification based on supervised learning using machine learning techniques,” in 2011 international conference on process automation, control and computing, jul. 2011, pp. 1–7. [42] m. b. abbas and m. khan, “sentiment analysis for automated email response system,” in 2019 international conference on communication technologies (comtech), mar. 2019, pp. 65–70. https://doi.org/10.1186/s40537-015-0015-2 https://doi.org/10.1109/icmlc.2005.1527956 https://doi.org/10.1109/icmlc.2005.1527956 https://doi.org/10.1016/j.asoc.2020.106935 https://doi.org/10.1016/j.asoc.2020.106935 https://doi.org/10.1109/bigdataservice.2016.36 https://doi.org/10.1109/bigdataservice.2016.36 https://doi.org/10.1109/bigdataservice.2016.36 https://doi.org/10.1016/j.dss.2018.09.001 https://doi.org/10.1016/j.dss.2018.09.001 https://doi.org/10.1109/access.2018.2850910 https://doi.org/10.1109/access.2018.2850910 https://doi.org/10.3390/app12020828 https://doi.org/10.3390/app12020828 https://doi.org/10.1109/bmeicon.2018.8609935 https://doi.org/10.1109/bmeicon.2018.8609935 https://doi.org/10.1109/bmeicon.2018.8609935 https://doi.org/10.1109/icait47043.2019.8987258 https://doi.org/10.1109/icait47043.2019.8987258 https://doi.org/10.1109/pacc.2011.5979035 https://doi.org/10.1109/pacc.2011.5979035 https://doi.org/10.1109/pacc.2011.5979035 https://doi.org/10.1109/comtech.2019.8737827 https://doi.org/10.1109/comtech.2019.8737827 knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 2, december 2022, pp. 179–187 eissn 2597-4637 https://doi.org/10.17977/um018v5i22022p179-187 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) associated patterns in open-ended concept maps within e-learning didik dwi prasetya a,1,*,tsukasa hirashima b,2 a department of electrical engineering, universitas negeri malang, malang 65145, indonesia b department of information engineering, hiroshima university, hiroshima 739-8527, japan 1 didikdwi@um.ac.id *; 2 tsukasa@lel.hiroshima-u.ac.jp * corresponding author i. introduction concept maps are widely known as graphical tools that facilitate the representation of individual cognitive knowledge. it has been demonstrated that concept maps are beneficial for instruction, learning, and evaluation [1][2]. the concept map comprises concepts or nodes, connecting or linking lines, connecting or linking words, and organized concept maps that can be created by the students or directly provided by the instructor [3]. even though the concept map elements only consist of concepts and links, they can capture individual knowledge accurately [2][4]. a learner may benefit from concept maps if they want to memorize material meaningfully and develop more casually valuable reading comprehension skills. in a concept map, ideas are depicted as nodes, and connections between concepts are established using linking labels to construct propositions [5]. propositions could be statements about some real object or things in the universe that either happen naturally or are artificially created [6][7]. it provides a concept map's logical and meaningful structure by emphasizing how concepts are connected each other. the concept map's essential components are propositions, which are the concept map's small semantic units. as a result, it might be seen as an essential part of cognitive knowledge [8][9]. the proposition represents the declarative knowledge the unit used to shape meaningful information. two basic methods for creating concept maps are closed-ended and open-ended [1][2]. in openended concept map construction, individuals can generate and add their concepts and relationships without any predetermined structure or constraints. this approach allows for creativity and flexibility, as learners can explore diverse connections and expand the concept map based on their unique understanding and insights. on the other hand, closed-ended concept map construction follows a predefined structure or template with specific concepts and relationships already provided [10]. this style provides a more guided and structured learning experience, focusing on specific concepts and article info a b s t r a c t article history: received 23 november 2022 revised 6 december 2022 accepted 27 december 2022 published online 30 december 2022 a concept map is a diagram that visualizes the structure of individual cognitive knowledge. an approach to creating a concept map structure that allows users to contribute concepts and linkages that express their understanding freely is known as an "open-ended concept map." it has been demonstrated that an open-ended concept map accurately depicts student knowledge structures and reveals student differences. however, manually analyzing an open-ended map is difficult, time-consuming, and includes many propositions, especially in a big classroom. educational data mining could be used to further process and analyze a collection of concept maps. however, many works attempted to employ data mining in order to produce concept maps structure from text documents rather than examining the knowledge representation. this study aimed to identify hidden students' knowledge representation combination patterns using association rules analysis. the dataset used in this study consisted of 27 open-concept maps created by university students. this study found interesting patterns that reveal students' knowledge in understanding the material given by the teacher. this is an open-access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: data mining association rules concept map open-ended e-learning http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 180 d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 relationships the instructor or curriculum deems essential. while closed-ended concept maps may be more efficient in conveying specific knowledge or following a particular learning objective, openended concept maps foster critical thinking, creativity, and a deeper exploration of the subject matter [6]. compared to the closed-ended strategy, the open-ended approach presents more significant evaluation challenges. judging students' open-ended idea maps may involve rubrics and expert assessment [11][12]. however, rubrics have their limitations, particularly regarding different openended learners’ concept maps. the rubric takes time and does not allow for a thorough assessment of learners' knowledge structure [12]. it is challenging for teachers to identify specific student mental model patterns because of the variety of concept maps. the data mining technique is one method that could be used to manage and analyze concept maps automatically. data mining, commonly called knowledge discovery in data (kdd), is the systematic exploration and analysis of extensive datasets to unveil patterns and extract valuable insights. data mining is synonymous with extracting information from large datasets to discover significant hidden knowledge patterns. unlike text mining which extracts patterns from natural language texts, data mining extracts patterns from structured databases [13]. educational data mining (edm) is a widespread use of data mining in education. it has been demonstrated to have numerous advantages. edm is an emerging multidisciplinary field of research that focuses on providing methods for examining data generated in an educational context [14][15]. it uses computational approaches to examine educational data in studying education topics. edm focuses on creating tools for examining the special forms of data present in educational settings to improve understanding and learning environments. numerous research has previously examined the effects of data mining techniques on maintaining concept maps. for instance, a valuable study [16] described a process known as concept map mining (cmm) that involves autonomously creating concept maps from a text. cmm is a process for automatically or partially creating concept maps from the source text. another relevant study attempted to generate concept maps automatically by utilizing input in text documents and applying association rules mining [17]. still related to making semi-automatic concept maps, a recent study offers easy concept map making on english reading materials and shows satisfactory results [18]. instead of reviewing the concept map that the learners created, cmm strives to build a concept map that offers reliable information about the student's comprehension. concept maps from source documents could be automatically or partially created using cmm. although several studies on mining concept maps have been carried out, only a few have analyzed the knowledge structure. the current study processed open-ended maps structure and examined students’ comprehension of study topics using data mining techniques. previous studies that resemble this study have been conducted by yoo and cho [12]. they used association analysis and sub-graph mining tasks to investigate students’ understanding quickly and accurately. in practice, they provided a predefined concept map and asked learners to draw by hand. furthermore, the student hand-drawn concept maps were digitized and processed using a data mining approach. on the other hand, the prior strategy was designed for closed-ended concept maps that refer to students' understanding instead of open-ended concept maps. this study investigated the application of association rules analysis to uncover crucial hidden information that quickly represents students' understanding of a database topic. ii. method in this study, utilizing data mining operations on open-ended (or low-directed) concept maps to swiftly determine learners' knowledge is called "mining concept maps." in particular, the concept maps data was subjected to the association rules analysis methodology to uncover significant hidden information. however, open-ended concept maps created by students could not be as interpreted as closed-ended ones, which have provided components to be reconstructed. figure 1 shows the general process required in open-ended concept map mining activities. d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 181 fig. 1. general mining concept maps process the association rule is an appropriate data mining operation for assessing the degree of correlation between different database variables [19]. by identifying particular patterns, it seeks to find regularity in the data. finding crucial connections between the elements in each transaction is the core task of association rules. the relationship may show how strong a rule is inside the group. depending on the characteristics and data requirements, most association rules are solved using a priori algorithms [20]. a. dataset and preprocessing the data source used in this study was a dataset generated from open-ended concept maps created by 27 university students. learning was carried out by utilizing a web-based e-learning system. in relational database material, students were asked to demonstrate their comprehension after the teacher delivers the lecture material. furthermore, student concept maps were stored in the database as a corpus concept map. the concept map extraction stage produces a dataset of 27 open-ended concept maps describing students' knowledge of relational database topics. preprocessing was done before carrying out the main activity, as is typical for data mining activities in general. preparing the data is a crucial phase in the data mining process and a crucial effort. data preparation and form-fitting transformation are both included. the techniques used in the data preparation stage are designed to enhance the dataset or data used in the modeling step. in order to ensure that the modeling step can produce the best results consistent with expectations, a preprocessing approach, cleaning process, and data selection will be carried out at this point. the primary objectives of preprocessing include normalizing data, lowering data size, determining how the data connect, removing outliers, and extracting features from the data [21]. the preprocessing phase was applied to the concept pairings that make up the relationships on the concept maps. it creates a useful and effective framework from the unprocessed ideas and propositions data. the set of propositions on the concept maps created by the students was first processed during this step. initial processing involves cleaning up the data, tokenization, and reading and analyzing concept maps from the data source [22]. b. association rules mining the stages of association rules mining in this study are shown in figure 2. preparation is the initial stage in open-ended concept map mining to prepare data for further analysis. concept recognition involves the identification of potential candidates for key ideas in a student's knowledge structure. the student's knowledge structure is recognized and retrieved from the relational database system. the frequent one itemset method was used to ascertain the emergence of concepts in each student’s knowledge. the group's organizational structure was analyzed since the concept is a special component in each learner's knowledge. additionally, the concept identification function results could be verified to find out which concepts students understand through the developed knowledge structure. the support value for a single itemset x is computed by dividing the count of operations that contain x by the total number of operations and is given by the formula support (x) = px / p total. 182 d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 relationships refer to statements explaining how two particular labels relate ideas. during the relationship's detection phase, each concept wholly linked to another idea is acknowledged as a subconcept map. association rules or collections of frequently occurring elements illustrate the relationships. for instance, the frequent itemset of terms "a – b" shows that there was a close connection between the terms "a" and "b" in the learning content. fig. 2. the flow of association rules analysis the association rules phase reveals fascinating connections or links among various data points [2][23]. it shows characteristics that have a high likelihood of coexisting in a dataset. the primary aim of association rules is to discover patterns that demonstrate the simultaneous occurrence of features within a database. an association rule is a formula that says x => y, where x and y are two sets of items. it suggests that database transactions containing x are more likely to have y [19]. for instance, the item sets (a, b) => (c, d) together appear in 5 out of 10 learners' concept maps. thus, it can be said that 50% of the learners have the same understanding of topics (a, b) and (c, d) in the learning materials. support and confidence criteria are frequently used to evaluate the validity of association analysis results [24][25][26]. the lower the minimum support value, the more items appear in the frequent itemset [27][28][29]. a low minimum support value will also increase the frequency of itemset at a higher level. conversely, the higher the minimum support value, the fewer items that appear will have an impact. therefore, improving the recognition of special patterns by raising the support value is possible. support is commonly expressed in probability notation as support (x => y) = p (x y) for a twoitem set. the equation indicates the frequency of the rule inside transactions. the confidence value was further tested using associative rules after receiving a frequent high value. confidence (x => y) = p (y | x) is the formula for a directional connection. it displays the proportion of transactions that contain both x and y. this study examined frequent association idea map mining using the apriori method. iii. results and discussion the preparation phase which was carried out on a dataset of 27 open-ended concept maps resulted in 505 concepts and 283 propositions. next, the association rules mining stage begins by uncovering the frequent itemset of concepts in class groups consisting of 27 learners. this analysis was based on a unique collection of concepts defined by each group member. the goal of this operation was to find the concepts in the group that has the highest frequency. the results of the formation of frequent 1itemset combinations with a minimum support value of 40% are shown in table 1. the results of the frequent 1-itemset concepts show the ideas most expressed by students in relational database material. d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 183 table 1. concepts frequent itemset concepts frequency support (%) relational database 22 81 relation 22 81 attribute 22 81 tuple 22 81 domain 21 78 cardinality 20 74 two-dimensional table 17 70 degrees 14 52 super key 13 48 row and column 13 48 candidate key 12 44 primary key 12 44 simple 12 44 term 12 44 table 1 explains that the concepts "relational database", "relational", "attribute", and "tuple" are the most frequently occurring ideas with a frequency value of 22 (support 81%). this value states that as many as 22 learners jointly define these ideas on their concept map. when viewed from the substance of the relational database material, these ideas have a close relationship with the topic of the material. nevertheless, further analyses must be conducted to uncover other, more obvious association patterns. in the open-ended concept map, the concept is an essential element representing an individual's original idea. however, concept analysis cannot reveal essential and hidden information on concept maps. this condition is very different from the market basket case, which considers terms as independent data to be analyzed directly. in the case of concept maps, the appearance of concepts a and b in some concept maps does not necessarily indicate them as important terms unless they are related. the concept will not have any meaning when it is not associated with other concepts to form a proposition. therefore, an analysis that focuses on the propositions the learners have created is required. further investigation was done to identify patterns of frequent relationships that represented learners' knowledge of the propositions that have been made. table 2 lists the relationships with a minimum support level of 20%. with the results of the frequent itemset of concepts, the pattern of frequent relationships appears to be consistent. relationships "two-dimensional table – row and column" appear 10 times with a support value of 37%. these results state that some 10 learners define the proposition "two-dimensional table – row and column" on their concept map. referring to the results of the concept frequent itemset, the concept of "two-dimensional table" has a frequency value of 17 (support 70%), while "row and column" weights 13 (support 48%). this condition confirms that the proposition "two-dimensional table – row and column" is formed from concepts often appearing in concept map collections. thus, the relation "two-dimensional table – row and column" became the dominant proposition representing learners' understanding of learning materials. table 2. relationships frequent itemset relationships frequency support (%) two-dimensional table -row and column 10 37 term -relation 8 30 term -attribute 8 30 term -tuple 8 30 relational database -profit 7 26 term -domain 7 26 relational database -term 7 26 relational key -super key 6 22 term -degrees 6 22 relational key -candidate key 6 22 relational database -relational key 6 22 184 d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 further analysis is carried out on the frequent itemset relationships formed by identifying concept pairs. this research aims to reveal whether the concept pairs in the proposition are considered concepts in the frequent itemset list of concepts with at least 40% support. table 3 shows the concepts marked in bold, indicating that they are in the list of frequent itemset concepts. the identification results of table 3 emphasize again that the propositions formed by learners consist of concepts with a high emergence value. almost all concepts are found in frequent items concepts with a minimum support value of 40%, except for the "profit" and "relational key" concepts. even so, the emergence value of the concept of "profit" was 9 times (support 33%) and the concept of "relational key" was 10 times (support 37%). table 3. relationships frequent itemset relationships in concepts frequent itemset concept 1 concept 2 two-dimensional table -row and column yes yes term -relation yes yes term -attribute yes yes term -tuple yes yes relational database -profit yes no term -domain yes yes relational database -term yes yes relational key -super key no yes term -degrees yes yes relational key -candidate key no yes relational database -relational key yes no further analysis is to combine the support and confidence value parameters to get the association rules patterns. table 4 depicts the results of the investigation of the concept association rules by applying the minimum support and confidence of 50% and 90%, respectively. the minimum support was set at 50% because a lower value would result in more patterns. to obtain fascinating patterns with a manageable number of numbers, some researchers advise using a minimum value of 90% confidence [30]. too many patterns make extracting valuable insights into the dataset difficult. table 4. concepts association rules association rules min-support (%) min-confidence (%) degrees => domain 50 100 degrees => relation 50 100 degrees => attribute 50 100 degrees => tuple 50 100 degrees => relational database 50 100 attribute => tuple 50 96 tuple => attribute 50 96 domain => attribute 50 96 domain => tuple 50 96 cardinality => domain 50 95 cardinality => relation 50 95 cardinality => attribute 50 95 cardinality => tuple 50 95 relation => attribute 50 91 relation => tuple 50 91 attribute => domain 50 91 attribute => relation 50 91 tuple => domain 50 91 tuple => relation 50 91 domain => relation 50 91 the frequent sub-concept maps were examined to see if any new hidden patterns might emerge. frequent sub-concept maps illustrate the link between propositions a and b on idea maps. further analysis was conducted using association rules to find patterns in the propositions formed. two settings were applied to the pattern search for relationships. the first used minimum support of 30% and 100% confidence, and the second applied a minimum support of 20% and 90% confidence. this processing stage was based on the fact that no relationship patterns were found when the support value d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 185 was more than 30%. table 5 shows the results of association rules analysis on propositions with relational database material. table 5. propositions association rules association rules min-support (%) min-confidence (%) term -attribute => term -relation 30 100 term -attribute => term -tuple 30 100 term -relation => term -attribute 30 100 term -relation => term -tuple 30 100 term -tuple => term -relation 30 100 term -tuple => term -attribute 30 100 profit -simple => term -relation 20 90 profit -simple => term -attribute 20 90 profit -simple => term -tuple 20 90 profit -simple => term -domain 20 90 profit -simple => term -degrees 20 90 term -attribute => term -relation 20 90 term -attribute => term -tuple 20 90 term -relation => term -attribute 20 90 term -relation => term -tuple 20 90 term -tuple => term -relation 20 90 term -tuple => term -attribute 20 90 term -domain => term -relation 20 90 term -domain => term -attribute 20 90 term -domain => term -tuple 20 90 term -degrees => term -relation 20 90 term -degrees => term -attribute 20 90 term -degrees => term -tuple 20 90 term -degrees => term -domain 20 90 term -cardinality => term -relation 20 90 term -cardinality => term -attribute 20 90 term -cardinality => term -tuple 20 90 term -cardinality => term -domain 20 90 data mining approaches may be useful for revealing hidden information in the context of education, in line with prior studies [31]. this study revealed that data mining techniques made it possible to determine how well students understood an open-ended concept map construction. in particular, teachers might immediately identify recurring ideas, common connections, and relationships analysis developed by students through association rules. analysis of association rules applied to a collection of open-ended concept maps reveals new concept and proposition formation patterns. teachers can use the results of these findings to understand learners' knowledge in capturing material. furthermore, teachers gain important insights regarding the association of forming concepts and propositions in electronic learning activities using concept maps. a weight is given for each concept on the concept maps to determine its frequency. frequent itemset of concepts makes it easier for teachers to identify original ideas related to learning materials. after exposing the frequent itemset of relationships in the context of the concept map, the teacher benefited greatly because it highlighted that relationships are illustrations of propositions on the concept map [1][2]. weight was assigned to each instance of a proposition in the concept map to make clear its function. finally, teachers could record their students' understanding patterns by creating association rules. this information is significant because it reveals what they believe after learning new information. this study also identified several issues that need to be considered to improve the operation of association rules. since the concept map was formed with an open-ended approach, there are possibly several terms that have the same meaning but are written differently, for example, "two-dimensional table," “2-dimensional table”, and “2d table”. as a result, manual spelling correction was required to obtain the correct expression. additionally, a semantic similarity approach might be suggested to produce more accurate results. 186 d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 iv. conclusion the current study uses data mining techniques to uncover hidden data in e-learning built on openended concept maps. the association rule analysis using the apriori algorithm was used to identify the pattern of association knowledge among students studying relational databases. this study discovered that data mining methods can potentially interpret sizable collections of open-ended concept maps effectively. the analysis's findings showed that forming patterns of association rules and frequent item sets for concepts, relationships, and relationships between concepts could reveal the learners' level of understanding. association analysis quickly uncovers valuable insight compared to laborious and time-consuming manual processing. the extent to which the students comprehend the information the teacher has presented can also be determined using the discovered associative rules. some restrictions on this study should be considered. in order to get more trustworthy results, the experiment's use of a relatively small number of concept maps as data sets needed to be expanded. second, the apriori algorithm, which is typically less effective when dealing with large data sets, was used in this study. so, it is possible to test other algorithms like fp-growth, hash-based, or generalized rules induction. additionally, association rules analysis that merely expresses the percentage of proposition combinations was the focus of the current study. applying other operations, such as comparing concept maps made by students and teachers, is fascinating. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations references [1] t. hirashima, “reconstructional concept map: automatic assessment and reciprocal reconstruction,” int. j. innov. creat. chang., vol. 5, no. 5, pp. 669–682, 2019. [2] d. d. prasetya, a. pinandito, y. hayashi, and t. hirashima, “analysis of quality of knowledge structure and students’ perceptions in extension concept mapping,” res. pract. technol. enhanc. learn., vol. 17, no. 1, p. 14, dec. 2022. [3] k. e. de ries, h. schaap, a.-m. m. j. a. p. van loon, m. m. h. kral, and p. c. meijer, “a literature review of openended concept maps as a research instrument to study knowledge and learning,” qual. quant., vol. 56, no. 1, pp. 73 – 107, feb. 2022. [4] a. whitelock-wainwright, n. laan, d. wen, and d. gašević, “exploring student information problem solving behaviour using fine-grained concept map and search tool data,” comput. educ., vol. 145, p. 103731, feb. 2020. [5] a. j. cañas and j. d. novak, “concept mapping using cmaptools to enhance meaningful learning,” in advanced information and knowledge processing, 2014, pp. 23–45. [6] j. d. novak and a. j. cañas, “the theory underlying concept maps and how to construct them,” tech. rep. ihmc c. 2006-01, no. may, pp. 1–31, 2006. [7] d. d. prasetya, t. hirashima, and y. hayashi, “comparing two extended concept mapping approaches to investigate the distribution of students’ achievements,” ieice trans. inf. syst., vol. e104.d, no. 2, pp. 337 –340, feb. 2021. [8] v. aleven et al., “example-tracing tutors: intelligent tutor development for non-programmers,” int. j. artif. intell. educ., vol. 26, no. 1, pp. 224–269, mar. 2016. [9] j. pailai, w. wunnasri, k. yoshida, y. hayashi, and t. hirashima, “the practical use of kit -build concept map on formative assessment,” res. pract. technol. enhanc. learn., vol. 12, no. 1, p. 20, dec. 2017. http://journal2.um.ac.id/index.php/keds https://www.ijicc.net/images/vol5iss5/part_2/55225_hirashima_2020_e_r.pdf https://www.ijicc.net/images/vol5iss5/part_2/55225_hirashima_2020_e_r.pdf https://doi.org/10.1186/s41039-022-00189-9 https://doi.org/10.1186/s41039-022-00189-9 https://doi.org/10.1007/s11135-021-01113-x https://doi.org/10.1007/s11135-021-01113-x https://doi.org/10.1007/s11135-021-01113-x https://doi.org/10.1016/j.compedu.2019.103731 https://doi.org/10.1016/j.compedu.2019.103731 https://doi.org/10.1007/978-1-4471-6470-8_2 https://doi.org/10.1007/978-1-4471-6470-8_2 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e4cb974d91e6f373c77c0560f5ad3139ca1f90c8 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e4cb974d91e6f373c77c0560f5ad3139ca1f90c8 https://search.ieice.org/bin/summary.php?id=e104-d_2_337 https://search.ieice.org/bin/summary.php?id=e104-d_2_337 https://search.ieice.org/bin/summary.php?id=e104-d_2_337 https://doi.org/10.1007/s40593-015-0088-2 https://doi.org/10.1007/s40593-015-0088-2 https://doi.org/10.1186/s41039-017-0060-x https://doi.org/10.1186/s41039-017-0060-x d.d prasetya / knowledge engineering and data science 2022, 5 (2): 179–187 187 [10] a. cañas, e. martínez-ortigosa, b. prieto, b. pino, and a. prieto, “swad, an open learning management system: results and challenges.,” in edmedia+ innovate learning, 2019, pp. 1496–1509. [11] e. m. taricani and r. b. clariana, “a technique for automatically scoring open-ended concept maps,” educ. technol. res. dev., vol. 54, no. 1, pp. 65–82, feb. 2006. [12] j. s. yoo and m.-h. cho, “mining concept maps to understand university students’ learning,” 2012. [13] s. inzalkar and j. sharma, “a survey on text mining-techniques and application,” int. j. res. sci. eng., vol. 24, pp. 1– 14, 2015. [14] c. romero and s. ventura, “educational data mining and learning analytics: an updated survey,” wires data min. knowl. discov., vol. 10, no. 3, may 2020. [15] r. s. baker, “challenges for the future of educational data mining: the baker learning analytics prizes,” j. educ. data min., vol. 11, no. 1, pp. 1–17, 2019. [16] j. j. villalon and r. a. calvo, “concept map mining: a definition and a framework for its evaluation,” in 2008 ieee/wic/acm international conference on web intelligence and intelligent agent technology, dec. 2008, pp. 357–360. [17] z. shao, y. li, x. wang, x. zhao, and y. guo, “research on a new automatic generation algorithm of concept map based on text analysis and association rules mining,” j. ambient intell. humaniz. comput., vol. 11, no. 2, pp. 539 – 551, feb. 2020. [18] a. pinandito, d. d. prasetya, y. hayashi, and t. hirashima, “design and development of semi-automatic concept map authoring support tool,” res. pract. technol. enhanc. learn., vol. 16, no. 1, p. 8, dec. 2021. [19] a. ait-mlouk, f. gharnati, and t. agouti, “an improved approach for association rule mining using a multi-criteria decision support system: a case study in road safety,” eur. transp. res. rev., vol. 9, no. 3, p. 40, sep. 2017 . [20] k. li et al., “impact factors analysis on the probability characterized effects of time of use demand response tariffs using association rule mining method,” energy convers. manag., vol. 197, p. 111891, oct. 2019. [21] s. a. alasadi and w. s. bhaya, “review of data preprocessing techniques in data mining,” j. eng. appl. sci., vol. 12, no. 16, pp. 4102–4107, 2017. [22] m. a. depositario, g. g. t. noangay, j. m. f. melchor, c. c. abalorio, and j. c. m. bustillo, “automated categorization of research papers with mono supervised term weighting in recapp,” int. j. adv. comput. sci. appl., vol. 14, no. 2, 2023. [23] m. karabatak, m. c. ince, and a. sengur, “wavelet domain association rules for efficient texture classification,” appl. soft comput., vol. 11, no. 1, pp. 32–38, jan. 2011. [24] r. agrawal, t. imieliński, and a. swami, “mining association rules between sets of items in large databases,” in proceedings of the 1993 acm sigmod international conference on management of data sigmod ’93, 1993, pp. 207–216. [25] t. m. alam et al., “a novel framework for prognostic factors identification of malignant mesothelioma through association rule mining,” biomed. signal process. control, vol. 68, p. 102726, jul. 2021. [26] m. kaushik, r. sharma, s. a. peious, m. shahin, s. ben yahia, and d. draheim, “a systematic assessment of numerical association rule mining methods,” sn comput. sci., vol. 2, no. 5, p. 348, sep. 2021. [27] s. yacoubi, g. manita, h. amdouni, s. mirjalili, and o. korbaa, “a modified multi-objective slime mould algorithm with orthogonal learning for numerical association rules mining,” neural comput. appl., vol. 35, no. 8, pp. 6125– 6151, mar. 2023. [28] t. berteloot, r. khoury, and a. durand, “association rules mining with auto-encoders,” arxiv prepr. arxiv2304.13717, 2023. [29] m. sun, r. zhou, and c. jiao, “analysis of hazmat truck driver fatigue and distracted driving with warning-based data and association rules mining,” j. traffic transp. eng. (english ed., vol. 10, no. 1, pp. 132–142, feb. 2023. [30] i. h. witten, e. frank, m. a. hall, c. j. pal, and m. data, “practical machine learning tools and techniques,” in data mining, 2005, vol. 2, no. 4. [31] y. xu, “research of association rules algorithm in data mining,” int. j. database theory appl., vol. 9, no. 6, pp. 119– 130, 2016. https://www.learntechlib.org/p/210166/ https://www.learntechlib.org/p/210166/ https://doi.org/10.1007/s11423-006-6497-z https://doi.org/10.1007/s11423-006-6497-z https://eric.ed.gov/?id=ed537216 http://www.ttcenter.ir/articlefiles/enarticle/3771.pdf http://www.ttcenter.ir/articlefiles/enarticle/3771.pdf https://doi.org/10.1002/widm.1355 https://doi.org/10.1002/widm.1355 https://jedm.educationaldatamining.org/index.php/jedm/article/view/432 https://jedm.educationaldatamining.org/index.php/jedm/article/view/432 https://doi.org/10.1109/wiiat.2008.387 https://doi.org/10.1109/wiiat.2008.387 https://doi.org/10.1109/wiiat.2008.387 https://doi.org/10.1007/s12652-018-0934-9 https://doi.org/10.1007/s12652-018-0934-9 https://doi.org/10.1007/s12652-018-0934-9 https://doi.org/10.1186/s41039-021-00155-x https://doi.org/10.1186/s41039-021-00155-x https://doi.org/10.1007/s12544-017-0257-5 https://doi.org/10.1007/s12544-017-0257-5 https://doi.org/10.1016/j.enconman.2019.111891 https://doi.org/10.1016/j.enconman.2019.111891 https://d1wqtxts1xzle7.cloudfront.net/54509277/4102-4107-libre.pdf?1506113528=&response-content-disposition=inline%3b+filename%3dreview_of_data_preprocessing_techniques.pdf&expires=1685243604&signature=aunbmi~vkhzzzqbgn6guxhcnytzl6zewrd3kwhzjgcfsryqonnvcig2q6fwwyrvs7wobdip4akpj7bfqawlxvenml~pzbt1qweeocfxbx4-dib4b7ldrnurr6rgpl5efmejx6czmq~njm6qmqqq52sxunrfzm9umxwzy5l5e4p3jbkr158d8y9dgz3v44edey0luley~m7ah1vldcdsmb0lkn6j3ln7ar~8th9w2iftqctidp-80fphp7i77av2p8dcboerhu1ttq2u2hfr0vwkt6bvfa9cunokwgerut121-pu4kvlnzm1podw~jgk3toeutvr7srh~-chmkwbbpq__&key-pair-id=apkajlohf5ggslrbv4za https://d1wqtxts1xzle7.cloudfront.net/54509277/4102-4107-libre.pdf?1506113528=&response-content-disposition=inline%3b+filename%3dreview_of_data_preprocessing_techniques.pdf&expires=1685243604&signature=aunbmi~vkhzzzqbgn6guxhcnytzl6zewrd3kwhzjgcfsryqonnvcig2q6fwwyrvs7wobdip4akpj7bfqawlxvenml~pzbt1qweeocfxbx4-dib4b7ldrnurr6rgpl5efmejx6czmq~njm6qmqqq52sxunrfzm9umxwzy5l5e4p3jbkr158d8y9dgz3v44edey0luley~m7ah1vldcdsmb0lkn6j3ln7ar~8th9w2iftqctidp-80fphp7i77av2p8dcboerhu1ttq2u2hfr0vwkt6bvfa9cunokwgerut121-pu4kvlnzm1podw~jgk3toeutvr7srh~-chmkwbbpq__&key-pair-id=apkajlohf5ggslrbv4za https://www.researchgate.net/profile/cristopher-abalorio/publication/368850313_automated_categorization_of_research_papers_with_mono_supervised_term_weighting_in_recapp/links/63fdf5ea57495059454f3126/automated-categorization-of-research-papers-with-mono-supervised-term-weighting-in-recapp.pdf https://www.researchgate.net/profile/cristopher-abalorio/publication/368850313_automated_categorization_of_research_papers_with_mono_supervised_term_weighting_in_recapp/links/63fdf5ea57495059454f3126/automated-categorization-of-research-papers-with-mono-supervised-term-weighting-in-recapp.pdf https://www.researchgate.net/profile/cristopher-abalorio/publication/368850313_automated_categorization_of_research_papers_with_mono_supervised_term_weighting_in_recapp/links/63fdf5ea57495059454f3126/automated-categorization-of-research-papers-with-mono-supervised-term-weighting-in-recapp.pdf https://doi.org/10.1016/j.asoc.2009.10.009 https://doi.org/10.1016/j.asoc.2009.10.009 https://doi.org/10.1145/170035.170072 https://doi.org/10.1145/170035.170072 https://doi.org/10.1145/170035.170072 https://doi.org/10.1016/j.bspc.2021.102726 https://doi.org/10.1016/j.bspc.2021.102726 https://doi.org/10.1007/s42979-021-00725-2 https://doi.org/10.1007/s42979-021-00725-2 https://doi.org/10.1007/s00521-022-07985-w https://doi.org/10.1007/s00521-022-07985-w https://doi.org/10.1007/s00521-022-07985-w https://arxiv.org/abs/2304.13717 https://arxiv.org/abs/2304.13717 https://doi.org/10.1016/j.jtte.2022.07.004 https://doi.org/10.1016/j.jtte.2022.07.004 https://doc1.bibliothek.li/acb/flmf040119.pdf https://doc1.bibliothek.li/acb/flmf040119.pdf https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=research+of++association++rules+algorithm+in++data++mining&btng= https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=research+of++association++rules+algorithm+in++data++mining&btng= knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 103–113 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p103-113 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) optimizing random forest algorithm to classify player's memorisation via in-game data akmal vrisna alzuhdi a,1, harits ar rosyid a,2,*, mohammad yasser chuttur b,3, shah nazir c,4 a department of electrical engineering and informatics, universitas negeri malang jl. semarang no. 5, malang 65145, indonesia b department of software and information systems, university of mauritius 2nd floor phase ii building university of mauritius reduit mu, 80837, mauritius c department of computer science, university of swabi ambar, swabi, khyber pakhtunkhwa 94640, pakistan 1 akmal.vrisna.1705356@students.um.ac.id; 2 harits.ar.ft@um.ac.id; 3 y.chuttur@uom.ac.mu; 4 nasirshah@uoswabi.edu.pk * corresponding author i. introduction a game is a work of art based on specific rules. these rules drive the end of the game based on the player's actions within the game. a player should use the tools and objects provided in the game to achieve victory. entertainment is paramount in games, but it is a potential vehicle for training and education through basic thinking skills to solve conflicts or problems [1][2]. educational games integrate complex principles, including knowledge, pedagogy, decision-making, collaboration, and gaming [3]. the primary aim of an educational game is learning and having fun in unity [4]. presenting knowledge in the educational game is often wrapped or decorated at the game level. for instance, the number munchers game (popular in the 1980s and 1990s) [5] represents dots with math equations. the player aims to collect equations that produce a particular answer (the mission objective). here, players can learn math equations full of joy (e.g., an experience after evading an article info a b s t r a c t article history: received 13 april 2023 revised 30 august 2023 accepted 25 september 2023 published online 4 october 2023 assessment of a player's knowledge in game education has been around for some time. traditional evaluation in and around a gaming session may disrupt the players' immersion. this research uses an optimized random forest to construct a noninvasive prediction of a game education player's memorization via in-game data. firstly, we obtained the dataset from a 3-month survey to record in-game data of 50 players who play 4-15 game stages of the chem fight (a test case game). next, we generated three variants of datasets via the preprocessing stages: resampling method (smote), normalization (min-max), and a combination of resampling and normalization. then, we trained and optimized three random forest (rf) classifiers to predict the player's memorization. we chose rf because it can generalize well given the high-dimensional dataset. we used rf as the classifier, subject to optimization using its hyperparameter: n_estimators. we implemented a grid search cross validation (gscv) method to identify the best value of n_estimators. we utilized the statistics of gscv results to reduce the weight of n_estimators by observing the region of interest shown by the graphs of performances of the classifiers. overall, the classifiers fitted using the best n_estimators (i.e., 89, 31, 89, and 196 trees) from gscv performed well with around 80% accuracy. moreover, we successfully identified the smaller number of n_estimators (optimal), at least halved the best n_estimators. all classifiers were retrained using the optimal n_estimators (37, 12, 37, and 41 trees). we found out that the performances of the classifiers were relatively steady at ~80%. this means that we successfully optimized the random forest in predicting a player's memorization when playing the chem fight game. an automated technique presented in this paper can monitor student interactions and evaluate their abilities based on in-game data. as such, it can offer objective data about the skills used. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: classification game education memorization random forest http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 104 enemy). number muncher game showcases an educational game with learning and gaming elements that make it fun and motivating [6]. in addition, by playing an educational game, players can repetitively play a game level (learning tasks as the mission objective) he failed. this repetitiveness in educational games hinders the fear of losing marks. using bibliometric analysis has allowed for the investigation of serious gaming research trends [7][8][9][10][11]. the data showed a rise in serious game publications in recent years, highlighting the growing significance of serious games in education. many academic fields, including education technology, psychology, the medical sciences, the environmental sciences, and corporate economics, have studied serious games. research has also focused on the use of serious games to help persons with disabilities [9], with an emphasis on education and computer games as the most popular game genres and game platforms, respectively. collecting data for serious game analytics has proven difficult, with pre-game, in-game, and post-game data being the most common. digital games and gamification have proven helpful in nursing education in fostering active involvement, elevating satisfaction levels, and imparting skills [10]. there are complex experiences of the player involved during game playing. affective experiences were reported to exist in an educational game, such as emotion [12], motivation [13][14][15][16], and enjoyment [17][18]. those articles prove that affective experiences can aid the same significance as the learning goal. however, the knowledge of the player reportedly dominated this research topic, such as travel in europe and sea game [17][19], math games [20], crystal island narrative-based game [21][22], and many more. serious game players are more empowered to become more actively involved, not only in the learning process but also in the design and development of cutting-edge formative assessment tools, as discussed by hainey et al., and serious games are becoming more and more popular as alternative supplementary learning approaches across all disciplines at all levels [23]. hence, assessing a player's knowledge due to playing an educational game is the core. an assessment based on one, two, or only a small number of the game's attributes or indicators (e.g., final result, total failures, or duration) sometimes needs to be clarified as to what so-called learning is. for instance, a victory in a game session may indicate whether the player has understood the knowledge with commitment, through a lucky guess, or just playing around. to solve such a problem, one can apply a traditional assessment method via a questionnaire or preand post-game exams. this is undoubtedly a reliable and effective assessment method for a user's learning [24]. the numeric difference between pre and post-game examinations quantitatively measures the learning gain. traditional assessment via a test within a game session may disrupt the enjoyment. for instance, a questionnaire to self-report an affective experience, such as enjoyment [25]. however, these assessment methods are interrupting the gaming experience. the players must dismiss the exciting gaming experience to a seriously thoughtful test. not many players can deal with such conditions that can lead to disengagement from the game. behavior observation of the players is not practical as well because of the subjectiveness of the observer. meanwhile, there exist in-game actions and the corresponding game level (as inputs) that can represent an experience (as an output) [26][27][28]. however, how to identify and what information is relevant to the player's learning is proprietary to the game. more importantly, optimally correlating the input values and output is the goal of this research. once we optimally train the prediction model, it should non-intrusively and accurately assess the player's knowledge while maintaining immersion in gaming [28]. considering the vast amount of information one can retrieve from a game; a data mining approach should fit the task to solve the problem at hand [29][30][31][32]. say the player's knowledge is categorized as memorization type. a classification technique can solve that, such as [33]. a potential solution is implementing a random forest classification to predict the player's memorization since it is robust with high dimensional, high accuracy, and good generalization [34]. the data often acquired from human players is unreliable and generally imbalanced. optimally categorizing the player's memorization should provide an unbiased evaluation that is important for customized learning experiences, adaptable game mechanics, or customized feedback to improve learning outcomes. thus, we need an optimization method for the random forest classifier to handle such data. to accommodate that, we want to 105 alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 experiment with datasets that preprocess the dataset using a resampling method via smote, normalizing the data using the min-max method, and combining both preprocessing. we expect that our approach is replicable in other educational games since the procedures are straightforward and clear. with the seamless prediction of the player's memorization, we can identify more insights about the correlation between gaming actions and learning experience. the more variety of game education with seamless assessment can lead to standard in-game data that contribute to learning experiences. thus, it provides a reliable guideline for designing educational games based on the most relevant in-game data. the following section discusses the proposed methodology to develop an in-game assessment of the player's knowledge when playing an educational game. it starts by describing the test case educational game and follows the proposed methodology, including the data collection method and optimization experiment. ii. method a. overview of the educational game as the case study this study uses a game called chem dungeon (chemical dungeon) as the test case [35]. it is an educational game in introductory chemistry that helps players memorize atoms and chemical compounds. the game genre of chem dungeon is a roguelike in a labyrinth. chem dungeon's labyrinth comprises paths, walls, intersections, and dead-end alleys (figure 1) reproduced from [35]. an avatar starts from a spawn point and then collects and forms a chemical compound to reveal an escape portal at the bottom-right of the labyrinth. the avatar should evade non-player characters (npcs) and avoid constructing incorrect chemical compounds. the avatar has an atomic shield (an atom ready for bonding with others), and details of the atom are readable near the spawn point. when the avatar strikes an atomic mine (blue shield), the game informs the compound-forming result or atom properties readable at the top center of the labyrinth. game attributes on the right side of the maze include lives (heart icon), experience in a red bar, the remaining ammunition (number), total bonds made (number), and the countdown timer. inside the labyrinth, there are bullets (yellow item), atoms (blue item), and live potions (red item) that are collectible for the avatar. each collection of bullets increases the ammunition for the avatar. a live potion can restore the avatar's life. fig 1. chem dungeon game alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 106 chem dungeon's goals are to find the right element to create a compound and pass through the escape portal within 90 seconds. the avatar initially spawns in its residence, and the npcs start in the diagonal pathways of the labyrinth (bottom-left to top-right). players can press keyboard keys a, s, d, and w to navigate the avatar to the left, right, up, and down. the character must stay clear of npcs and atomic mines while exploring the labyrinth. it loses one life whenever it collides with a weak opponent or a bad atomic mine. the character can also shoot an atomic mine to clear the way. when the bullet strikes a strong opponent, it changes that enemy's state to one of weakness (whitecolored npc). then, the avatar can capture a weak npc to make it respawn at its house, opening a new path for the avatar. as a result, the avatar can search for and gather the appropriate element (mine), creating a compound with other atoms. hence, a piece of educational information on the chemical compound appears at this time. as a result, this game condition should engage participants to memorize and understand the learning materials. the escape portal opens once the avatar has gathered the correct atom ten times. finally, the avatar receives a victory by passing through the escape portal—otherwise, a defeat results from losing all lives or running out of time. the chem dungeon game contains 100 chemical compounds constructed by at least two atoms. a compound is shown as character strings representing the symbol, name, and bonding atoms. for instance, two hydrogen and one oxygen atoms construct an h2o representing the water compound. in the game, an atom is a collectible object as an atomic symbol, e.g., c, ag, n, if more than one atom of the same type appears as a combination of the total atom and the atomic symbol, e.g., 2h, 6b. the information that follows provides some useful game-playing advice. although each game has different element options, the objective is to create a single compound (repeatedly). players new to the game frequently use a trial-and-error approach and are completely conscious of not wasting their remaining lives. the player should, therefore, attentively peruse the text communication corresponding to the compound-forming effort's most recent outcome. every time someone loses a life, they must recover it by gathering potions. alternatively, one can gain experience (xp) bars by killing weak foes via bullets shot and capturing them. one extra life is awarded once the xp meter is full. such an endeavor should, however, consider the remaining ammunition and the 90-second time restriction. these restrictions prevent players from exploiting such tactical strategies purely for amusement while ignoring the main objective of the game, which is to keep compound formation in the player's memory. according to [35], the game can procedurally generate up to 486,000 playable stages. each stage consists of a combination of learning material and a game map. this vast number of game stages allows players to experience different challenges categorized into three difficulty levels. the game map data, the player's actions, and achievements are recorded during game sessions. these data together are called in-game data. b. proposed methodology this research follows some procedures shown in figure 2. the first step collects datasets from the chem dungeon game sessions using the procedures shown in figure 3. a survey was conducted to allow participants to follow the data collection steps. each participant played at least ten cycles of data collection. this means that each cycle will produce a sample comprised of in-game data and a label. the label (a.k.a memorization performance/mp) is the score difference between pre (m0) and post-game (m1) questionnaires about the learning material presented in the game stage. given that each response to the questionnaire is a binary value, there will be four possible mp categories shown in table 1. from table 1, there are only three categories used. first, if the players score 0 between the pre and post-game, the sample is categorized as mp0. it represents a player who needs to memorize new knowledge. second, mp1 is the label when the difference between pre and post-game is 1. it means a recognition (successful memorization) of the new knowledge. the third category is mp2, when players can recall knowledge they already know. the fourth label does not use a negative difference between pre and post-game, indicating a decrease in memory. these results occurred because a player arbitrarily responded to the pre or post-game questionnaires, so the sample was categorized as 107 alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 an outlier [36]. this method is simple; however, the risk of irrelevant gaming action or arbitrary responses to the questionnaire is possible. therefore, we must preprocess the resulting in-game data before the modeling stage. fig 2. research procedures fig 3. data collection procedures table 1. memorization performance m1 = 0 m1 = 1 m0 = 0 mp0 mp1 m0 = 1 n/a mp2 once the dataset is collected, the next step is the preprocessing stage. preprocessing aims to identify outliers by filtering out samples with a negative label or missing values. this results in a clean raw dataset, the first dataset s0. the following preprocessing are resampling for imbalanced dataset and min-max normalization, or their combination. as such, preprocessing yields three more datasets: smoted dataset sr, normalized dataset sm, and smotednormalized dataset srm. each dataset is split into a 70% training set (i.e., r0, rr, rm, rrm) and a 30% test set (i.e. t0, tr, tm, trm). each dataset (i.e., r0, rr, rm, rrm) will be used to construct an optimized classifier using the random forest algorithm. random forest can generalize well given the high-dimensional dataset with higher accuracy than other algorithms [34][37]. research in [34] shows that the random forest well classifies behavior-related data. the in-game data falls into this category. the first optimization targets the random forest parameter using the grid search cross-validation (gscv). the second optimization is to evaluate whether preprocessing affects the classification result. survey smote raw dataset normalization r0 dataset rr dataset rrm dataset rm dataset grid search cross-validation (gscv) result analysis alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 108 the following are the gscv configurations for the random forest classifier using the training dataset: • parameter grid is n_estimators = {2,3,…,201}, • 5-fold cross-validation (considering the imbalanced dataset), • scorer = weighted f1-score (considering the imbalanced dataset) then, we trained the random forest (rf) using the training set over the best value of n_estimators; we call these values n_searched. we measured the rf performance using the f1score because it accommodates precision and recall scores. here is the formula for f1-score = 2*precision*recall/(precision+recall). the resulting list of rfs trained using training sets is called rf_searched. however, the best rf should not solely be confirmed from gscv peak performance. so, we delved deeper into the gscv statistics to see the overall picture of rf_searched based on the average mean scores (a_mean) and the average standard deviation of the scores (a_std). we observed the region of interest in both graphs. from there, we decided the best rf from the one closest to the a_mean and the lowest a_std. then, we retrained the random forest classifier using the best n_estimators using the training sets (i.e., r0, rr, rm, and rrm). we used the test sets (i.e., t0, tr, tm, and trm) to evaluate the performances of each classifier. the goal is to reproduce the training stage from the gscv. by testing each optimal random forest classifier using the test set (i.e., t0, tr, tm, trm), one can compare the effect of normalization, balancing, balancingnormalization, and n_estimators for the classification. in the final stage, we tested the performance difference between optimized classifiers; we used mcnemar's test [38]. iii. result and discussion from the survey, according to figure 3, we collected 540 samples of in-game data labeled with mp0, mp1, and mp2 (this is the raw dataset s0). we distributed 90, 219, and 231 samples to mp0, mp1, and mp2. each sample contains 30 independent variables of mixed types. subsequently, we generated sr consisting of 744 samples equally distributed via smote from the original dataset (s0). next, we developed sm consisting of 540 min-max normalized samples from the original dataset (s0). subsequently, the dataset srm (744 samples) was generated from sm by resampling the normalized dataset. s0, sr, sm, and srm datasets were split into a 70% training set (i.e., r0, rr, rm, and rrm) and a 30% test set (i.e., t0, tr, tm, and trm). the optimization stage using gscv identifies four classifiers: c0, cr, cm, and crm, best constructed using 89, 31, 89, and 196 trees, respectively. see figure 4 for comparing these four classifiers that predict a player's memorization given the test set (i.e., t0, tr, tm, and trm). all classifiers successfully predicted the memorization of the players via in-game data with at least 80% confidence. however, we can see that the cr was the better, with overall scores of ~86%. from this graph, we can see that the balanced dataset is slightly better than the imbalanced dataset. we use these performance rates to optimize the classifiers via n_estimators. we analyze further the gscv results in the improvements made with various values of n_estimators. figure 5 to figure 8 show the comprehensive results of each random forest classifier when the gscv searched the best n_estimators or the number of trees. there are four line graphs: • the blue solid line represents the mean f1 scores of the rf_searched (using the left horizontal axis), • and the blue dotted line represents the average of mean f1 scores (using the left horizontal axis). we denote this as avg_f1, • the red solid line represents the standard deviation of f1 scores between cross-validated predictors of the rf_searched (using the right horizontal axis), • the red dashed line represents the average of std deviation of f1 scores between cross-validated predictors of the rf_searched (using the right horizontal axis). we denote this as avg_std. 109 alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 fig 4. performance comparison between classifiers using best n_estimators in addition, there are also two rectangles (transparent blue and transparent red) representing the regions of interest (roi_blue and roi_red) we observed regarding the candidate value of n_estimators that optimizes the scores. the right bound of each rectangle is set based on the best n_estimators found from the gscv stage. meanwhile, the left bound is set towards the lowest possible n_estimators value where the f1 score is greater or equal to the blue dotted line. the rules to choose the optimal n_estimators are: • choose the value of n_estimators (in the x-axis) from the left most of the roi_blue. we denote fsx as the mean f1 score from the selected rf using the n_estimators value. • choose the current value of n_estimators if the fsx* of the next n_estimators is less than or equal to the current fsx. • if some neighboring n_estimators have the same fsx, then choose the n_estimators which has the smallest value of avg_std. fig 5. gscv random forest using raw training set fig 6. gscv random forest using normalised training set alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 110 fig 7. gscv random forest using smoted raw training set fig 8. gscv random forest using smoted normalized training set these graphs show that the lowest f1-scores were around 0.7 0.73 and quickly stabilized between 0.86 0.89 of f1 scores. this indicates that the random forest classifiers were effective under the 5-fold cross-validation using the training sets. based on the above rules, we identified the values of n_estimators for c0, cr, cm, and crm using 37, 12, 37, and 41 trees, respectively (table 2). table 2. comparison of best and optimal rf based on n_estimators gscv random forest n_estimators scores best gscv optimal gscv best score optimal score best std optimal std raw 89 37 0.8718033199 0.8447963115 0.02667657109 0.02016542199 raw smoted 31 12 0.8757689417 0.8748399185 0.01188986414 0.02573413102 raw normalized 89 37 0.8689533825 0.8447963115 0.02490655104 0.02016542199 smoted normalized 196 41 0.8844496674 0.8676858415 0.01288923776 0.01528713981 next, we retrained the c0, cr, cm, and crm using 37, 12, 37, and 41 trees, respectively, using the training sets. comparing classifiers that used optimal value of n_estimators can be seen in figure 9. this graph shows that the random forest maintained the prediction performances when using significantly fewer trees. the smoted datasets make the classifiers slightly more steady than the imbalanced datasets (s0 and sm). in addition, the classifiers fitted using the smoted-normalized dataset maintained their performances using only 41 trees, compared to the 196 trees initially found in gscv. based on these performances, we ran mcnemar's test to see if there were any significant differences between classifiers using best and optimal n_estimators. as a result, all p-values were at least 0.05. this indicates that these optimal classifiers were similar. it means that the random forest algorithm is a robust classifier and optimizable via the total decision trees used to predict the memorization performance of chem fight players. these experiments prove our confidence in using random forest as the classifier to predict the player's memorization. the raw dataset, which was not resampled nor normalized, can be classified well. upon the optimization, experiment results show that resampling the dataset using smote can improve the performance of the random forest to at least 4% higher. we proved that the 111 alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 gscv method can slightly optimize the performance of random forest. because we only optimize the random forest’s n_trees, while there are more parameters that are optimizable such as max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features= {“sqrt”, “log2”, none}, max_leaf_nodes. fig 9. performance comparison between classifiers using optimal n_estimators iv. conclusion assessing a game education player's memorization is non-intrusively practical using in-game data. our approach applies a data mining classification technique using the random forest (rf) algorithm. we experimented using variants of the dataset to train the rf. since rf is a complex classifier, we used a grid search cross-validation (gscv) technique to identify and picture the development of classifiers based on a vector or n_estimators. our approach has successfully optimized the classifiers that use at least half the total trees inside the rf. the classifiers predict the player's memorization with around 80% accuracy using the imbalanced dataset and using 37 decision trees (optimal). the classifiers performed better (~86% accuracy) when fitted with a balanced dataset. we also found that the most effective optimization occurred when the classifier used the balanced and normalized dataset. based on our experiments, the n_estimators found by the gscv are based on the peak performance of the classifiers. our observation identified that the classifiers maintained their performance at least using half the value of n_estimators found by the gscv. our experiments demonstrated random forest's suitability for predicting player memorization without data preprocessing. however, applying smote to the dataset boosted random forest's performance by at least 4%, and gscv showed slight optimization potential. further optimization possibilities include parameters like max_depth, min_samples_split, and more. our approach optimized the random forest based on n_estimators using gscv. however, when considering multiple hyperparameters in optimizing the classifiers, applying gscv may become more complex. hence, we suggest using a more sophisticated search algorithm, such as genetic algorithm search cross-validation. in addition, a mechanism in the search algorithm to stop earlier whenever the performance of the classifiers enters the convergence state. we are confident that other researchers can replicate our procedure to determine the optimal classification of players in other games. however, we know that the selection of in-game actions can bias the classification model—for instance, too few or arbitrary in-game actions. for now, these ingame actions are human-observed ones. in contrast, low-level in-game activities, such as player positions and time-based events, are too noisy to be classified using rf and are not interpretable for humans. hence, a neural network or the deep neural network is a potential candidate for this classification problem. given that the low-level in-game is preferable, we can identify them via behavior recognition computationally. alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 112 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] j. p. gee, “what video games have to teach us about learning and literacy,” comput. entertain., vol. 1, no. 1, pp. 20– 20, oct. 2003. [2] j. j. c. u. lee and j. c. u. hammer, “gamification in education: what, how, why bother?,” acad. exch. q., vol. 15, no. 2, pp. 1–5, 2011. [3] k. squire and h. jenkins, “harnessing the power of games in education,” insight, vol. 3, no. 1, pp. 5–33, 2003. [4] v. shute et al., “maximizing learning without sacrificing the fun: stealth assessment, adaptivity and learning supports in educational games,” j. comput. assist. learn., vol. 37, no. 1, pp. 127–141, feb. 2021. [5] k. mayfield et al., “designing a molecular biology serious educational game,” in proceedings of the 2019 acm southeast conference, apr. 2019, pp. 210–213. [6] j. p. gee, “good video games and good learning,” univ. wisconsin–madison. recuper., 2009. [7] s. çiftci, “trends of serious games research from 2007 to 2017: a bibliometric analysis.,” j. educ. train. stud., vol. 6, no. 2, pp. 18–27, 2018. [8] o. irmade, suwarno, and n. anisa, “research trends of serious games: bibliometric analysis,” j. phys. conf. ser., vol. 1842, no. 1, p. 012036, mar. 2021. [9] n. kara, “bibliometric and content analysis of research trends on the use of serious games to assist people with disabilities,” j. comput. educ. res., vol. 9, no. 17, pp. 278–299, apr. 2021. [10] s. meşe and c. meşe, “research trends on digital games and gamification in nursing education,” j. comput. educ. res., vol. 10, no. 20, pp. 734–750, dec. 2022. [11] s. p. smith, k. blackmore, and k. nesbitt, “a meta-analysis of data collection in serious games research,” in serious games analytics, cham: springer international publishing, 2015, pp. 31–55. [12] r. aylett, m. vala, p. sequeira, and a. paiva, “fearnot! – an emergent narrative approach to virtual dramas for anti-bullying education,” in virtual storytelling. using virtual reality technologies for storytelling, berlin, heidelberg: springer berlin heidelberg, pp. 202–205. [13] m. d. kickmeier-rust, c. hockemeyer, d. albert, and t. augustin, “micro adaptive, non-invasive knowledge assessment in educational games,” in 2008 second ieee international conference on digital game and intelligent toy enhanced learning, 2008, pp. 135–137. [14] m. d. kickmeier-rust, an alien’s guide to multi-adaptive educational computer games. informing science, 2012. [15] m. d. kickmeier-rust, c. m. steiner, and d. albert, “non-invasive assessment and adaptive interventions in learning games,” in 2009 international conference on intelligent networking and collaborative systems, nov. 2009, pp. 301–305. [16] b. magerko, c. heeter, j. fitzgerald, and b. medler, “intelligent adaptation of digital game-based learning,” in proceedings of the 2008 conference on future play: research, play, share, nov. 2008, pp. 200–203. [17] f. bellotti, r. berta, a. de gloria, and l. primavera, “adaptive experience engine for serious games,” ieee trans. comput. intell. ai games, vol. 1, no. 4, pp. 264–280, dec. 2009. [18] a. plotnikov et al., “measuring enjoyment in games through electroencephalogram (eeg) signal analysis,” in proceedings of the 6th european conference on games-based learning (ecgbl 2012), 2012, pp. 393–400. [19] os, k. lee, p. moreno-ger, and r. berta, “assessment in and of serious games: an overview,” adv. humancomputer interact., vol. 2013, pp. 1–11, 2013. [20] d. ismailović, j. haladjian, b. köhler, d. pagano, and b. brügge, “adaptive serious game development,” in 2012 second international workshop on games and software engineering: realizing user engagement with game engineering techniques (gas), 2012, pp. 23–26. [21] b. magerko, b. stensrud, and l. holt, “bringing the schoolhouse inside the box-a tool for engaging, individualized training,” 2006. [22] j. rowe and j. lester, “modeling user knowledge with dynamic bayesian networks in interactive narrative environments,” proc. aaai conf. artif. intell. interact. digit. entertain., vol. 6, no. 1, pp. 57–62, oct. 2010. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1145/950566.950595 https://doi.org/10.1145/950566.950595 https://dialnet.unirioja.es/servlet/articulo?codigo=3714308 https://dialnet.unirioja.es/servlet/articulo?codigo=3714308 https://www.academia.edu/download/68325707/digital_20gaming_20education.pdf https://doi.org/10.1111/jcal.12473 https://doi.org/10.1111/jcal.12473 https://doi.org/10.1145/3299815.3314462 https://doi.org/10.1145/3299815.3314462 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bd579f8bb23b5e63a82cb590f8eebf3fb899e8b7 https://eric.ed.gov/?id=ej1171079 https://eric.ed.gov/?id=ej1171079 https://doi.org/10.1088/1742-6596/1842/1/012036 https://doi.org/10.1088/1742-6596/1842/1/012036 https://doi.org/10.18009/jcer.858500 https://doi.org/10.18009/jcer.858500 https://doi.org/10.18009/jcer.1175412 https://doi.org/10.18009/jcer.1175412 https://doi.org/10.1007/978-3-319-05834-4_2 https://doi.org/10.1007/978-3-319-05834-4_2 https://link.springer.com/chapter/10.1007/978-3-540-77039-8_19 https://link.springer.com/chapter/10.1007/978-3-540-77039-8_19 https://link.springer.com/chapter/10.1007/978-3-540-77039-8_19 https://doi.org/10.1109/digitel.2008.10 https://doi.org/10.1109/digitel.2008.10 https://doi.org/10.1109/digitel.2008.10 https://books.google.com/books?hl=en&lr=&id=okzjnhkdchqc&oi=fnd&pg=pa6&dq=an+alien%e2%80%99s+guide+to+multi-adaptive+educational+computer+games&ots=d352fptyer&sig=c8rbykxqbgq6rjkdhggykbdx4ru https://doi.org/10.1109/incos.2009.30 https://doi.org/10.1109/incos.2009.30 https://doi.org/10.1109/incos.2009.30 https://doi.org/10.1145/1496984.1497021 https://doi.org/10.1145/1496984.1497021 https://doi.org/10.1109/tciaig.2009.2035923 https://doi.org/10.1109/tciaig.2009.2035923 https://books.google.com/books?hl=en&lr=&id=l5b9mjjhxqyc&oi=fnd&pg=pa393&dq=measuring+enjoyment+in+games+through+electroencephalogram+(eeg)+signal+analysis,%e2%80%9d+in+proceedings+of+the+6th+european+conference+on+gam&ots=mjdhfoy8kf&sig=u7nmk6vfoyo7luzch1pfqjfxbt8 https://books.google.com/books?hl=en&lr=&id=l5b9mjjhxqyc&oi=fnd&pg=pa393&dq=measuring+enjoyment+in+games+through+electroencephalogram+(eeg)+signal+analysis,%e2%80%9d+in+proceedings+of+the+6th+european+conference+on+gam&ots=mjdhfoy8kf&sig=u7nmk6vfoyo7luzch1pfqjfxbt8 https://doi.org/10.1155/2013/136864 https://doi.org/10.1155/2013/136864 https://ieeexplore.ieee.org/abstract/document/6225922/ https://ieeexplore.ieee.org/abstract/document/6225922/ https://ieeexplore.ieee.org/abstract/document/6225922/ https://ieeexplore.ieee.org/abstract/document/6225922/ https://apps.dtic.mil/sti/citations/ada481593 https://apps.dtic.mil/sti/citations/ada481593 https://doi.org/10.1609/aiide.v6i1.12403 https://doi.org/10.1609/aiide.v6i1.12403 113 alzuhdi et al. / knowledge engineering and data science 2023, 6 (1): 103–113 [23] t. hainey et al., “serious games as innovative formative assessment tools for programming in higher education,” 2022. [24] p. f. lazarsfeld, “the language of social research: a reader in the methodology of social research,” 1964. [25] f.-l. fu, r.-c. su, and s.-c. yu, “egameflow: a scale to measure learners’ enjoyment of e-learning games,” comput. educ., vol. 52, no. 1, pp. 101–112, jan. 2009. [26] r. lopes and r. bidarra, “adaptivity challenges in games and simulations: a survey,” ieee trans. comput. intell. ai games, vol. 3, no. 2, pp. 85–99, jun. 2011. [27] c. alonso-fernández, m. freire, i. martínez-ortiz, and b. fernández-manjón, “improving evidence-based assessment of players using serious games,” telemat. informatics, vol. 60, p. 101583, jul. 2021. [28] devottam gaurav, yash kaushik, santhoshi supraja, manav yadav, m p gupta, and manmohan chaturvedi, “empirical study of adaptive serious games in enhancing learning outcome,” int. j. serious games, vol. 9, no. 2, pp. 27–42, may 2022. [29] a. s. osman, “data mining techniques: review. 2 (1), 1–4.” 2019. [30] j. han, m. kamber, and j. pei, “data mining, southeast asia edition: concepts and techniques. 2006.” morgan kaufmann. [31] j. a. caballero-hernández, m. palomo-duarte, j. m. dodero, and d. gaševic, “supporting skill assessment in learning experiences based on serious games through process mining techniques,” int. j. interact. multimed. artif. intell., vol. in press, no. in press, p. 1, 2023. [32] s. abbasi and h. kazi, “stealth assessment in serious games to improve oo learning outcomes,” in 2019 international conference on advances in the emerging computing technologies (aect), feb. 2020, pp. 1–5. [33] m. manske and c. conati, “modelling learning in an educational game,” in aied, 2005, vol. 2005, pp. 411–418. [34] z. feng, y. shi, d. zhou, and l. mo, “research on human activity recognition based on random forest classifier,” in 2023 ieee international conference on control, electronics and computer technology (iccect), apr. 2023, pp. 1507–1513. [35] h. a. rosyid, m. palmerlee, and k. chen, “deploying learning materials to game content for serious education game development: a case study,” entertain. comput., vol. 26, pp. 1–9, may 2018. [36] l. bao, “dynamic models of learning and education measurement,” arxiv prepr. arxiv0710.1375, 2007. [37] a. statnikov, l. wang, and c. f. aliferis, “a comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” bmc bioinformatics, vol. 9, no. 1, p. 319, dec. 2008. [38] t. g. dietterich, “approximate statistical tests for comparing supervised classification learning algorithms,” neural comput., vol. 10, no. 7, pp. 1895–1923, oct. 1998. https://books.google.com/books?hl=en&lr=&id=inoyeaaaqbaj&oi=fnd&pg=pa253&dq=serious+games+as+innovative+formative+assessment+tools+for+programming+in+higher&ots=gd8s35imzs&sig=tu0yi5b5rairapqm7yusayyttqi https://books.google.com/books?hl=en&lr=&id=inoyeaaaqbaj&oi=fnd&pg=pa253&dq=serious+games+as+innovative+formative+assessment+tools+for+programming+in+higher&ots=gd8s35imzs&sig=tu0yi5b5rairapqm7yusayyttqi https://ixtheo.de/record/1146371659 https://doi.org/10.1016/j.compedu.2008.07.004 https://doi.org/10.1016/j.compedu.2008.07.004 https://doi.org/10.1109/tciaig.2011.2152841 https://doi.org/10.1109/tciaig.2011.2152841 https://doi.org/%2010.1016/j.tele.2021.101583 https://doi.org/%2010.1016/j.tele.2021.101583 https://doi.org/10.17083/ijsg.v9i2.486 https://doi.org/10.17083/ijsg.v9i2.486 https://doi.org/10.17083/ijsg.v9i2.486 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=%5b29%5d%09a.+s.+osman%2c+%e2%80%9cdata+mining+techniques%3a+review.+2+%281%29%2c+1%e2%80%934.%e2%80%9d+2019.&btng= https://scholar.google.com/scholar?cluster=9059628637488976880&hl=en&as_sdt=2005&sciodt=0,5 https://scholar.google.com/scholar?cluster=9059628637488976880&hl=en&as_sdt=2005&sciodt=0,5 https://doi.org/10.9781/ijimai.2023.05.002 https://doi.org/10.9781/ijimai.2023.05.002 https://doi.org/10.9781/ijimai.2023.05.002 https://doi.org/10.1109/aect47998.2020.9194170 https://doi.org/10.1109/aect47998.2020.9194170 https://books.google.com/books?hl=en&lr=&id=oalvagaaqbaj&oi=fnd&pg=pa411&dq=modelling+learning+in+an+educational+game&ots=ngdbxxcugr&sig=j0f-rcaptpo7htteomzrgkw4ydm https://doi.org/10.1109/iccect57938.2023.10140545 https://doi.org/10.1109/iccect57938.2023.10140545 https://doi.org/10.1109/iccect57938.2023.10140545 https://doi.org/10.1016/j.entcom.2018.01.001 https://doi.org/10.1016/j.entcom.2018.01.001 https://arxiv.org/abs/0710.1375 https://doi.org/10.1186/1471-2105-9-319 https://doi.org/10.1186/1471-2105-9-319 https://doi.org/10.1162/089976698300017197 https://doi.org/10.1162/089976698300017197 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 5, no 1, december 2022, pp. 53–66 eissn 2597-4637 https://doi.org/10.17977/um018v5i12022p53-66 ©2022 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) optimized three deep learning models based-pso hyperparameters for beijing pm2.5 prediction andri pranolo a, b, 1, *, yingchi mao a, 2 , aji prasetya wibawa c, 3 , agung bella putra utama c, 4 , felix andika dwiyanto c, 5 a department of computer and technology, college of computer and information, hohai university 1 xikang road, nanjing, jiangsu 211100, china b department of informatics, faculty of industrial technology, universitas ahmad dahlan jl. prof. dr. soepomo, s.h., janturan, warungboto, umbulharjo, yogyakarta 55164, indonesia c department of electrical engineering, faculty of engineering, universitas negeri malang jl semarang 5, malang, east java 65145, indonesia 1 andri.pranolo@tif.uad.ac.id *; 2 maoyingchi@gmail.com; 3 aji.prasetya.ft@um.ac.id; 4 agungbpu02@gmail.com; 5 felix@ascee.org * corresponding author i. introduction in air quality monitoring systems, pm2.5 concentration is a crucial measure. as public awareness rises, analyzing and anticipating pollution levels is vital. monitoring stations can only perform a small role in pm2.5 pollution control due to the nonlinear character of pm2.5 concentrations in both time and space. as a result, improving pm2.5 concentrations prediction accuracy is crucial for preventing and controlling air pollution. several studies have been conducted using machine learning techniques, such as neural networks, applied to environmental science issues. as a part of a neural network, deep learning is a technique that achieves high performance for various applications such as natural language processing, visual recognition, and forecasting has recently gained attention in the machine learning field. machine learning models are characterized by large hyperparameter spaces and lengthy training times in their application. these properties, combined with the growth of parallel computing and the increasing demand for producing machine learning workloads. therefore, developing mature hyperparameter optimization functionality for distributed computing environments is vital. in most cases, machine learning provides more sensible advice than humans can. the design and training of neural networks, called alchemy, are tricky and unpredictable [1]. therefore, hyperparameter tuning has been extensively studied to lower entry barriers for non-technical users. article info a b s t r a c t article history: received 4 august 2022 revised 15 august 2022 accepted 17 august 2022 published online 7 november 2022 deep learning is a machine learning approach that produces excellent performance in various applications, including natural language processing, image identification, and forecasting. deep learning network performance depends on the hyperparameter settings. this research attempts to optimize the deep learning architecture of long short term memory (lstm), convolutional neural network (cnn), and multilayer perceptron (mlp) for forecasting tasks using particle swarm optimization (pso), a swarm intelligence-based metaheuristic optimization methodology: proposed m-1 (pso-lstm), m-2 (pso-cnn), and m-3 (pso-mlp). beijing pm2.5 datasets was analyzed to measure the performance of the proposed models. pm2.5 as a target variable was affected by dew point, pressure, temperature, cumulated wind speed, hours of snow, and hours of rain. the deep learning network inputs consist of three different scenarios: daily, weekly, and monthly. the results show that the proposed m-1 with three hidden layers produces the best results of rmse and mape compared to the proposed m-2, m-3, and all the baselines. a recommendation for air pollution management could be generated by using these optimized models. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: air pollution beijing pm2.5 deep learning forecasting hyperparameter tuning 54 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 hyperparameter refers to parameters that cannot be changed during machine learning training. it can be involved in the model structure, such as the hidden layer and the activation function. two recent deep learning model development has made hyperparameter an increasingly important technique. the first is the scaling up of neural networks to achieve greater accuracy [2], and the second is the development of an intricate lightweight model to achieve greater accuracy with fewer data and parameters [3][4]. furthermore, hyperparameter tuning plays an essential role in both cases. in its application, there are more hyperparameters to tune in a model with a complex structure than in a model with a well-defined structure. several hypermeters for an lstm model are necessary to improve performance, such as the number of hidden layers and neurons, dense layer, and weight initialization. the first consideration of hyperparameter is the number of nodes and hidden layers. hidden layers are the layers between the input and output layers. no specific number of hidden layers in its application should be used. therefore, it depends on each problem to use a trial-and-error tuning approach. one hidden layer will suffice for most simple problems, and for more complex ones, two layers are recommended. even though many nodes within a layer can improve accuracy, fewer nodes may result in underfitting [5]. the next are some units in a dense layer, which is the most used layer and essentially layer where all neurons as input for each neuron in the prior densely connected layer, which can increase the accuracy, while 5–10 units or nodes per layer is an ideal starting point for dense layers. as a result, the final dense layer‟s shape is influenced by the number of neurons/units specified [6]. then, a dropout layer should be present between each lstm layer, such as a layer that reduces the network‟s sensitivity to specific weights of individual neurons. the dropout layer can be used with an input layer. however, it cannot be used with the output layer since it can make the model and calculation errors. the dropout can alleviate the risk of overfitting when adding complexity by increasing the number of nodes in dense layers or adding more dense layers, resulting in poor validation accuracy [7]. in other cases, weight initialization can be a hyperparameter that should be considered. ideally, the weight initialization schemes should differ depending on the activation function. however, weight values are chosen using a uniform distribution. initially, it is impossible to set all weights to 0.0 because the optimization algorithm highlights the asymmetry in the error gradient. different weights can lead to different starting points for the optimization process, leading to different final sets with different performance characteristics [8]. stochastic optimization assumes that weights will be randomly assigned to small numbers at the start of the search. as long as there is no weight update, weight decay can be included in the weight update rule. the weights are multiplied with slightly less than one factor to limit the weight growth. for references, the initial value of 0.97 should be sufficient. moreover, the output of a node is defined by its activation functions, either on or off. using these functions, deep learning models can learn nonlinear prediction boundaries. although it is technically possible to include activation functions in the dense layers, it is preferable to separate them into separate layers so that it could be reduced density layer output. the activation layer‟s choice depends on the application, but the most popular activation function is the rectifier [8]. the next hyperparameter is a learning rate. by using this hyperparameter, the network can update its parameters more quickly. to speed up the learning process, it is possible that increasing the learning rate will cause the model to diverge or even fail to converge. learning will take longer, but the model will smoothly converge [9]. alternatively, this hyperparameter is used in the training phase, with values between 0.0 and 0.1. then, this hyperparameter specifies the number of epochs (integer) until the validation accuracy decreases even though training accuracy increases, thus risking overfitting. an ideal move is to use the early stopping method to specify the epochs number and stop training when the performance of the approach on the trained dataset drops below a pre-set threshold. the last consideration of hyperparameter tuning is batch size. this hyperparameter specifies the number of samples before updating internal model parameters. a more extensive sample size produces more significant gradient steps than smaller ones. the initial batch size is 32. however, it can adjust with multiples of 32, such as 64, 128, and 256, to determine which is better [8]. the research reveals that the pso optimized deep learning models (lstm, cnn, and mlp) for beijing pm2.5 multivariate time series prediction acquire a minimum error and improve its a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 55 accuracy. the seven optimizer hyperparameters are the optimizer, type of activation function, loss function, number of batch sizes, hidden units, neurons, and epochs. the contribution of the research are: 1) to improve the accuracy of the multivariate time-series forecasting analysis applied to beijing pm2.5 dataset using the proposed model m1 (pso-lstm), m2 (pso-cnn), and m3 (pso-mlp). 2) to generate the computer-based forecasting model that could as a recommendation for governmental regulations such as pollution prevention, clean air technology center, and transportation-emissions reduction. the research may present the alternative use of pso as a tuning hyperparameter on deep learning instead of using it as a feature selection. the automatic tuning process may reduce the computational time due to the random parameter selection. finally, this paper determines the best optimized deep learning approaches to predict beijing pm2.5 concentrations. ii. method the proposed hyperparameter tuning of deep learning for forecasting is shown in figure 1. as shown, the selected dataset will be preprocessed using normalization. the use of the pso carries out the hyperparameter selection. the best-selected hyperparameter values will be used in the forecasting. then, the forecasting process will take place by a deep learning method, namely lstm, cnn, and mlp. in the end, the proposed models and the baseline performance were tested using mape and rmse. fig. 1. the proposed hyperparameter tuning of deep learning for forecasting 56 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 a. dataset in this study, an evaluation of the hyperparameter setting of the lstm method based on the pso dataset using pm2.5 beijing was carried out, which was obtained from the uci machine learning repository [10]. this dataset represents the weather conditions, and pollution levels reported hourly by the us. in beijing, china, from 2010 to 2014, with 43.825 instances removed and 2.068 data row values missing in data preprocessing. pre-processing is the initial process of datasets to improve data quality and selection to obtain high-performance results. preprocessing data used are feature selection and data normalization. the feature selection process selects the attributes to be used by following a similar study conducted by zhang [11] using seven attribute data features which include pm2.5 concentration (pm2.5), dew point (dewp), temperature (temp), pressure (pres), accumulated wind speed (lr), hourly snow accumulation (l), and hourly rain accumulation (lr) as shown in figure 2. normalization is a technique for reducing errors by converting the real number to a value range of 0 to 1. the min-max scaling approach is used for normalization [12]. equation (1) presents the normalization min-max. (1) is a normalization result, represents the data to be normalized while and is the values of minimum and maximum of entire data. in this study, from the dataset, there were three scenarios used as testing data. they are monthly, weekly, and daily. b. hyperparameter optimized using pso developing an efficient machine learning model is a complex process that requires selecting a suitable algorithm and modifying the model‟s hyperparameters [13]. the primary goal of hyperparameter optimization is to simplify the selection of parameters to get the optimal results of the process and enable users to implement efficient machine learning models to solve practical issues [14]. the process of hyperparameter optimization predicts the best machine learning (ml) architecture [15]. it decreases the amount of human work necessary, enhances machine learning models' performance, and increases models' reproducibility. particle swarm optimization (pso) is a swarm optimization model that could use to select hyperparameters and is used in this research as an integrated approach with the other baseline deep learning models. pso is a family of evolutionary algorithms frequently used to solve optimization problems and has been effectively applied as parameter optimization techniques [16]. pso takes its inspiration from biological populations that exhibit individual and social behavior. pso works by allowing a fig. 2. visualize the dataset of beijing pm2.5 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 57 swarm of particles to navigate semi-random search space. through integrated information sharing between individual particles in a group, pso algorithms determine the optimal solution. in pso, a swarm consists of a group of particles [17] as in (2), and a vector is used to represent each particle , as seen in (3). (2) ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ (3) where ⃗⃗⃗⃗ denotes the current position, ⃗⃗⃗⃗ denotes the current velocity, and ⃗⃗⃗⃗ denotes the swarm‟s best-known position. after initializing each particle‟s position and velocity, the current position and records are analyzed with their performance score. the following iteration modifies the velocity ⃗⃗⃗⃗ of each particle following the current global optimal position ⃗⃗⃗ and the prior position ⃗⃗⃗⃗ , as in (4). ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ (4) where denotes distributions of continuous uniform based on the and acceleration constants. equation (5) represents that the particles move following their new velocity vectors. ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ (5) the technique outlined above is performed until convergence or termination constraints are met. the pso algorithm has a computational complexity of [18]. additionally, this approach can be parallelized to increase model efficiency because pso particles act independently and share information only after each iteration. pso‟s primary restriction requires adequate population initialization. it may reach a local rather than global optimum in discrete hyperparameters [19]. in carrying out the appropriate population initialization, using population initialization techniques or utilizing the developer‟s experience is necessary. numerous population initialization strategies, such as the opposition-based optimization algorithm [20] and the space transformation search approach [21] have been developed to increase the performance of evolutionary algorithms. thus, execution time and resource optimization can be increased by performing an extra population initialization strategy through hyperparameter selection, pso can improve good values of deep learning (dl) models. dl is based on artificial neural network theory (ann). multilayer perceptrons (mlp), convolutional neural networks (cnns), recurrent neural networks (rnn), deep neural networks (dnn), and long short-term memory (lstms) are modified from the standard of ann for deep learning designs [22]. the hyperparameters in the dl that pso can optimize for selecting hyperparameters include the optimizer, activation function, loss function, batch size, number of neurons, and epochs. hyperparameters tuning with pso can be done by calling the optimal configuration „particle swarm‟ in the opportunity function in the tensorflow keras package. the used pso parameters consist of 10 particles in the swarm, 5 generations (iterations), velocity minimum 0, velocity maximum 1, 1.5, 2.0, and 10 permitted function evaluations. the hyperparameters optimized by tuning pso and retested using the deep learning method can be seen in table 1 and applied a dropout value of 0.2. the parameters that are tuned are parameters that are shared by all deep learning methods in general. table 1. deep learning method hyperparameter space no. hyperparameters search space type 1. hidden layers (hl) [2,10] continuous 2. neurons [1,100] continuous 3. activation function linear, sigmoid, relu discrete with step=1 4. loss function mse, mae discrete with step=1 5. optimizer adam, rmsprop discrete with step=1 6. batch size [32, 64, 128] discrete with step=1 7. epoch [5,100] continuous 58 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 c. multilayer perceptron (mlp) the forecasting method often used in research is mlp [23]. mlp belongs to the feedforward network. the characteristics possessed by mlp are advantages in determining the value of weights that are better than other methods, mlp can be used without prior knowledge, and the algorithm can be implemented quickly and can solve linear and nonlinear problems [24]. mlp characteristics make the forecasting value better. mlp in forecasting is used for time series [25] and stock prices [26][27]. as illustrated in figure 3, the mlp model architecture consists of three layers of nodes: an input layer, a hidden layer, and an output layer. each layer is connected to the network architecture nodes. the nodes in the input layer are connected to nodes in the hidden layer, and the hidden layer‟s nodes are directly connected to nodes in the output layer‟s node. the elements of a multilayer perceptron consist of network architecture, learning algorithms, and activation functions [28]. activation function for an in a hidden neuron could be defined as in (6). ∑ (6) where is hidden neuron of , denotes a link function that adds non-linearity to the relationship between the input and hidden layers, denotes weight as input in a weight matrix, is represents an input value. is output values as in (7). ( ) ∑ (7) d. long short-term memory (lstm) long short-term memory (lstm) is developed from the recurrent neural network (rnn) that could implement to solve the problem of accuracy in time-series data prediction. lstm can overcome long-term dependencies on its inputs [29]. lstm creates rnn architectures capable of resolving learning challenges associated with information linkage. in an rnn, the old memory fig. 3. mlp architecture a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 59 becomes increasingly ineffective as the new memory overwrites it [30]. however, rnns suffer from vanishing and bursting gradients, which occur when the range of values across layers in architecture changes. the lstm was developed and designed to address the issue of rnn gradient disappearing while faced with vanishing and bursting gradients [31]. time series forecasting using lstm can be used for time-series predictions [32], both short-term loads [33] or long-term [34], weather predictions [35], price movements [36][37][38][39]. the lstm uses memory cells and gate units to manage memory at each input, with an architecture similar to the rnn. in lstm, the hidden layer comprises memory cells with three gates: input, forget, and output, as illustrated in figure 4. the input gate specifies the amount of data stored in the cell state and keeps the cell from holding extraneous data. forget gate functions limit the time a value remains in a memory cell. the output gate determines the amount of data or value stored in a memory cell and calculates the output. on the lstm, the gate is a unique network structure with an input vector and output intervals of 0 and 1. no information is permitted to flow when output is set to 0. in contrast, all information is permitted to pass when set to 1 [40]. if the input vector ) and output vector ) are defined, then gates could be formulated as in (8). (8) sigmoid ; where denotes the weights and denotes the bias vector. the cell state represents the current condition of the cell as being determined as (9). tanh [ ] bc (9) denotes the cell state matrix‟s weight, denotes the cell state‟s bias vector as the input gate, and is the forget gate used to assist the network in forgetting input information and repeating memory cells. the input and forget gates can be computed using the formulas (10) and (11). [ ] bi (10) [ ] bf (11) and denote the weights of the input and forget gates, respectively, while bi denotes the bias vectors of the input-gate, and b denotes the forget-gate bias vectors. the output-gate of the lstm regulates the amount of information processed into the output from the latest cell state. the output fig. 4. memory cells lstm 60 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 can be estimated using the formula in (12). [ ] bo (12) denotes output gate matrix weight, and is the gate output bias vector. the lstm process‟s ultimate output is computed as in (13). (13) then the output will be used for forecasting the following time chosen. e. convolutional neural network (cnn) cnn is part of the dl approach, which is included in the sub-field of ml, which applies the basic concepts of the ann algorithm with more layers [41]. cnn is a feedforward network because information flow occurs in one direction only, from their inputs to their outputs cnn was applied and extremely popular in image classification research. therefore, it could be implemented for 1dimensional (1d) problems, such as forecasting the following values in a time series dataset [42]. the model used is a 1d cnn with architecture, as in figure 5. many types of cnn models can be used for each problem in predicting data time series. the model consists of univariate, multivariate, multi-step, and multivariate multi-step [43]. cnn in forecasting data is time series often used to estimate stock prices [44][45], gold prices [46][47][48], health [49][50][51], time series [52][53][54], solar cells and weather forecasts [55]. f. evaluation the mean absolute percentage error (mape) as error evaluation metrics and the root mean square error (rmse) [56] was used to evaluate and compare the implemented methods‟ performances. mape shows errors that can represent accuracy. at the same time, rmse detect irregularities or outliers in the designed projection system. the formulas are given as in (14) and (15). from the calculation of the mape and rmse value, it will be known which model has the best performance in forecasting. the smaller mape and rmse values produced, the better the forecasting results, so the method was better [57]. ∑ (14) fig. 5. 1d cnn architecture a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 61 √∑ (15) iii. results and discussion the original deep learning (lstm, cnn, and mlp) architecture are 7 input layers, 2 to 10 hidden layers (hl), and 1 output layer with the same setting parameter values. the parameters are 32 neurons, dropout 0.2, mse for loss function, adam optimizer, 100 epoch, and 72 batch size. unlike lstm and mlp, cnn used the parameters in the fully connected layer. the specific cnn architecture setting uses 1d convolution layer with 2 kernel sizes, 64 filters, relu for activation function, pooling layer with maxpooling1d type, size 1, and drop out 0.2. then there was 1 flattened layer and a fully connected layer. based on the current tuning results, test the pso tuning results using the deep learning method with the result settings as shown in table 2. pso hyperparameter tuning was integrated with various deep learning models (lstm, cnn, and mlp) to produce new models of proposed model m-1 (pso-lstm), m-2 (pso-cnn), and mtable 2. pso hyperparameter search results deep learning method no. hyperparameters proposed m-1 proposed m-2 proposed m3 1. hidden layers (hl) 3 4 3 2. neurons 24 41 61 3. activation function sigmoid relu linear 4. loss function mse mae mse 5. optimizer adam rmsprop rmsprop 6. batch size 32 32 64 7. epoch 46 60 68 table 3. mape forecasting results model mape hl-2 hl-3 hl-4 hl-5 hl-6 hl7 hl-8 hl-9 hl-10 monthly lstm 9.1216 8.8909 9.1385 9.1935 9.2448 9.2544 9.2612 9.2711 9.3865 cnn 8.6255 8.6195 8.5849 8.9762 9.1778 10.3037 10.7662 10.8264 11.1073 mlp 9.3308 9.2286 9.5347 9.6395 10.6010 10.6280 10.6702 10.6008 10.6035 proposed m-1* 8.4576 proposed m-2** 8.5281 proposed m-3* 9.0930 weekly lstm 8.9777 8.8327 10.1041 10.2722 10.3538 11.5553 11.5812 11.5852 11.5940 cnn 9.8238 8.9021 8.8092 8.9096 9.1951 10.2191 11.7623 12.7759 13.3261 mlp 9.9057 9.7078 10.0382 11.6556 11.6180 11.6290 11.6234 11.6228 11.6118 proposed m-1* 8.6379 proposed m-2** 8.6987 proposed m-3* 9.2903 daily lstm 5.5329 5.5306 5.5343 5.5351 7.7324 8.7756 9.0327 10.1076 10.4688 cnn 6.7490 6.9270 6.4845 6.8979 6.9833 6.8275 6.8986 6.8067 8.9088 mlp 6.4448 6.2857 7.4463 8.5707 8.5792 8.5765 8.5684 8.5739 8.5703 proposed m-1* 5.4676 proposed m-2** 6.3742 proposed m-3* 6.0990 * the best selection parameter was hidden layer 3 (hl-3) ** the best selection parameter was hidden layer 4 (hl-4) 62 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 3 (pso-mlp). mape and rmse measured the performances of the proposed model and its comparison with the baselines, as shown in table 3 and table 4, respectively. in general, all proposed models have better accuracy performance for all monthly, weekly, and daily scenarios, is indicated by the minimum mape (table 3) and rmse (table 4) values obtained by the three proposed models compared to the other models. more specifically, in the monthly scenario, proposed m-1 has the best performance of the three proposed models, followed by m-2, and m-3, with mape values of 8.4576, 8.5281, and 9.0930, respectively. in addition, the rmse value also shows the same order of performance for the three proposed models, namely 0.0250, 0.0346, and 0.0259, respectively. the same thing happened in weekly and daily scenarios. however, if it was sorted based on the scenarios, the accuracy of the three proposed models with the best performance was shown in the daily scenario, followed by weekly and monthly. the increasing amount of data and precise outliers or distance precision within values on the dataset has contributed to the proposed model performance. proposed m-1 (pso-lstm) can also reduce the yield value of rmse and mape to be better than lstm as a baseline model. the tuning results for m-2 (pso-cnn) have better rmse and mape values than cnn when the hidden layer is 4 (hl-4). as for the proposed m-3 (pso-mlp), the use of hl-3 has a better evaluation value when compared to mlp. from the overall results in table 3 and table 4, the best results can be visualized as shown in figure 6 and figure 7. figure 6 demonstrates that, when compared to all other models, the proposed model has the best mape value in every scenario. in the monthly scenario, proposed m-1 outperforms regular lstm, cnn, and mlp with a mape of 8.4576. the weekly scenario's mape proposed m-1 has a superior mape than previous techniques, with a score of 8.6379. the mape generated by proposed m-1 in the daily scenario was 5.4676, which was likewise better and more effective than other techniques. figure 7 shows that every proposed model has the best rmse in every scenario. compared to other models, the monthly scenario's rmse of 0.025, which belongs to table 4. rmse forecasting results model rmse hl-2 hl-3 hl-4 hl-5 hl-6 hl7 hl-8 hl-9 hl-10 monthly lstm 0.0260 0.0257 0.0263 0.0265 0.0270 0.0952 0.0952 0.0952 0.0953 cnn 0.0362 0.0357 0.0351 0.0369 0.0429 0.0636 0.0668 0.0764 0.0773 mlp 0.0263 0.0262 0.0265 0.0266 0.0945 0.0944 0.0944 0.0945 0.0945 proposed m-1* 0.0250 proposed m-2** 0.0346 proposed m-3* 0.0259 weekly lstm 0.0299 0.0297 0.0302 0.0303 0.0311 0.1182 0.1183 0.1183 0.1183 cnn 0.0523 0.0437 0.0412 0.0475 0.0497 0.0556 0.0927 0.1019 0.1092 mlp 0.0304 0.0302 0.0310 0.1185 0.1184 0.1184 0.1184 0.1184 0.1184 proposed m-1* 0.0232 proposed m-2** 0.0362 proposed m-3* 0.0301 daily lstm 0.0041 0.0039 0.0043 0.0049 0.0091 0.0844 0.0845 0.0848 0.0852 cnn 0.0192 0.0172 0.0101 0.0188 0.0157 0.0178 0.0168 0.0181 0.0241 mlp 0.0056 0.0049 0.0109 0.0785 0.0772 0.0776 0.0788 0.0780 0.0786 proposed m-1* 0.0023 proposed m-2** 0.0031 proposed m-3* 0.0031 * the best selection parameter was hidden layer 3 (hl-3) ** the best selection parameter was hidden layer 4 (hl-4) a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 63 proposed m-1, has the best value. the best result for the rmse proposed m-1 in the weekly scenario is 0.0232, which is lower than the rmse of other models. the rmse value for the daily data ranges from 0.0023 (proposed m-1) to 0.0039 (lstm), 0.0101 (cnn), and 0.0049 (mlp). overall, it can be seen that the pso hyperparameter tuning in this research case study can improve the baseline models' performance. the rmse and mape evaluation values of the m-1 produce the best values in all scenarios (monthly, weekly, and daily) compared to other proposed models and the baselines. the government may use this research finding to reference their regulations as a benefit of this research. the first regulation is pollution-prevention approaches aiming to minimize, remove, and avoid pollution. the government promotes the use of less hazardous raw resources or fuels, a less toxic industrial operation, and increased process efficiency. the second policy is to establish the clean air technology center, which will provide information on technologies for preventing and controlling air pollution, including mechanical collectors, fabric filtration, combustion systems, wet scrubbers, and biological degradation and their use, cost, and effectiveness. the third regulation reduces transportation-related emissions by requiring car emission controls and cleaner fuels. finally, economic incentives for air pollution control agencies, such as emissions banking and trading, can be created. iv. conclusion this paper proposed improved deep learning approaches based on pso hyperparameters tuning to select the best parameters. the experiment shows that all proposed models outperformed the baseline model. the best performance of proposed m-1 (pso-lstm) outperformed other produced models, m-2 (pso-cnn) and m-3 (pso-mlp), and the baseline models, lstm, cnn, and mlp. governmental regulations such as pollution prevention, clean air technology center, and transportation-emissions reduction could be generated based on this promising finding. the proposed model in this study has good performance, which only applies to the dataset used. fig. 6. comparison of mape in all scenarios fig. 7. comparison of rmse in all scenarios 64 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 therefore, future research will use various datasets to produce a generally applicable model to all time-series datasets. acknowledgment the authors are grateful for the support provided by the chinese government scholarship (cgs), which has contributed funding to conduct this research through the cgs scholarship. in addition, appreciate hohai university, universitas ahmad dahlan, and universitas negeri malang, which have contributed to supporting laboratory facilities. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this work is supported by the chinese government scholarship (cgs) that received by the corresponding author with csc number 2018gbj006341 and by universitas ahmad dahlan under grant number pd-226/sp3/lppmuad/vii/2022. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher‟s note: department of electrical engineering universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] t yu and h zhu “hyper-parameter optimization: a review of algorithms and applications ” arxiv prepr. arxiv2003.05689, mar. 2020. [2] m tan and q v le “efficientnet: rethinking model scaling for convolutional neural networks ” arxiv prepr., may 2019. [3] n ma x zhang h t zheng and j sun “shufflenet v2: practical guidelines for efficient cnn architecture design ” in proceedings of the european conference on computer vision (eccv), 2018, pp. 116–131. [4] m. sandler a howard m zhu a zhmoginov and l c chen “mobilenetv2: inverted residuals and linear bottlenecks ” in proceedings of the ieee conference on computer vision and pattern recognition, 2018, pp. 4510– 4520. [5] x zhang x chen l yao c ge and m dong “deep neural network hyperparameter optimization with orthogonal array tuning ” in neural information processing, t. gedeon, k. wong, and m. lee, eds. springer, 2019, pp. 287–295. [6] n gorgolis i hatzilygeroudis z istenes and l n g gyenne “hyperparameter optimization of lstm network models through genetic algorithm ” in 2019 10th international conference on information, intelligence, systems and applications (iisa), jul. 2019, pp. 1–4. [7] g e hinton n srivastava a krizhevsky i sutskever and r r salakhutdinov “improving neural networks by preventing co-adaptation of feature detectors ” arxiv prepr. arxiv1207.0580, jul. 2012. [8] a farzad h mashayekhi and h hassanpour “a comparative performance analysis of different activation functions in lstm networks for classification ” neural comput. appl., vol. 31, no. 7, pp. 2507–2521, jul. 2019. [9] m d zeiler “adadelta: an adaptive learning rate method ” arxiv prepr. arxiv1212.5701, dec. 2012. [10] x. liang et al. “assessing beijing‟s pm2 5 pollution: severity weather impact apec and winter heating ” proc. r. soc. a math. phys. eng. sci., vol. 471, no. 2182, 2015. [11] m. zhang d wu and r xue “hourly prediction of pm2 5 concentration in beijing based on bi-lstm neural network ” multimed. tools appl., vol. 80, no. 16, pp. 24455–24468, 2021. [12] s e buttrey “ data mining algorithms explained using r ” j. stat. softw., vol. 66, no. book review 2, 2015. [13] r elshawi m maher and s sakr “automated machine learning: state-of-the-art and open challenges ” jun 2019. [14] l yang and a shami “on hyperparameter optimization of machine learning algorithms: theory and practice ” neurocomputing, vol. 415, pp. 295–316, nov. 2020. [15] f. hutter, l. kotthoff, and j. vanschoren, automated machine learning. cham: springer international publishing, 2019. http://journal2.um.ac.id/index.php/keds https://arxiv.org/abs/2003.05689 https://arxiv.org/abs/2003.05689 https://arxiv.org/abs/1905.11946 https://arxiv.org/abs/1905.11946 https://arxiv.org/abs/1807.11164 https://arxiv.org/abs/1807.11164 https://doi.org/1801.04381v4 https://doi.org/1801.04381v4 https://doi.org/1801.04381v4 https://doi.org/10.1007/978-3-030-36808-1_31 https://doi.org/10.1007/978-3-030-36808-1_31 https://doi.org/10.1007/978-3-030-36808-1_31 https://doi.org/10.1109/iisa.2019.8900675 https://doi.org/10.1109/iisa.2019.8900675 https://doi.org/10.1109/iisa.2019.8900675 https://arxiv.org/abs/1207.0580 https://arxiv.org/abs/1207.0580 https://doi.org/10.1007/s00521-017-3210-6 https://doi.org/10.1007/s00521-017-3210-6 https://arxiv.org/abs/1212.5701 https://doi.org/10.1098/rspa.2015.0257 https://doi.org/10.1098/rspa.2015.0257 https://doi.org/10.1007/s11042-021-10852-w https://doi.org/10.1007/s11042-021-10852-w https://doi.org/10.18637/jss.v066.b02 https://arxiv.org/abs/1906.02287 https://arxiv.org/abs/1906.02287 https://doi.org/10.1016/j.neucom.2020.07.061 https://doi.org/10.1016/j.neucom.2020.07.061 https://doi.org/10.1007/978-3-030-05318-5 https://doi.org/10.1007/978-3-030-05318-5 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 65 [16] n. xue, i. triguero, g. p. figueredo, and d. landa-silva “evolving deep cnn-lstms for inventory time series prediction ” 2019 ieee congr. evol. comput. cec 2019 proc., pp. 1517–1524, 2019. [17] m.-a zöller and m f huber “benchmark and survey of automated machine learning frameworks ” apr 2019 [18] x.-h. yan, f.-z. he, and y.-l chen “a novel hardware/software partitioning method based on position disturbed particle swarm optimization with invasive weed optimization ” j. comput. sci. technol., vol. 32, no. 2, pp. 340– 355, mar. 2017. [19] m.-y. cheng, k.-y huang and m hutomo “multiobjective dynamic-guiding pso for optimizing work shift schedules ” j. constr. eng. manag., vol. 144, no. 9, p. 04018089, sep. 2018. [20] s rahnamayan h r tizhoosh and m m a salama “a novel population initialization method for accelerating evolutionary algorithms ” comput. math. with appl., vol. 53, no. 10, pp. 1605–1614, may 2007. [21] h wang z wu j wang x dong s yu and c chen “a new population initialization method based on space transformation search ” in 2009 fifth international conference on natural computation, 2009, pp. 332–336. [22] m hiransha e a gopalakrishnan v k menon and k p soman “nse stock market prediction using deeplearning models ” in procedia computer science, 2018, vol. 132, pp. 1351–1362. [23] y. s. park and s. lek, artificial neural networks: multilayer perceptron for ecological modeling, vol. 28. elsevier, 2016. [24] t marwala “multi-layer perceptron ” handb. mach. learn., no. 2001, pp. 23–42, 2018. [25] j gamboa “deep learning for time-series analysis ” arxiv, 2017. [26] p gao r zhang and x yang “the application of stock index price prediction with neural network ” math. comput. appl., vol. 25, no. 3, 2020. [27] w lu j li y li a sun and j wang “a cnn-lstm-based model to forecast stock prices ” complexity, vol. 2020, 2020. [28] j. m. nazzal, i. m. el-emary s a najim a ahliyya p o box and k s arabia “multilayer perceptron neural network mlps for analyzing the properties of jordan oil shale ” world appl. sci. j., vol. 5, no. 5, pp. 546–552, 2008. [29] g van houdt c mosquera and g nápoles “a review on the long short-term memory model ” artif. intell. rev., vol. 53, no. 8, pp. 5929–5955, dec. 2020. [30] ferdiansyah, s h othman r zahilah raja md radzi d stiawan y sazaki and u ependi “a lstm-method for bitcoin price prediction: a case study yahoo finance stock market ” icecos 2019 3rd int. conf. electr. eng. comput. sci. proceeding, no. march 2020, pp. 206–210, 2019. [31] m lechner and r hasani “learning long-term dependencies in irregularly-sampled time series ” arxiv, 2020. [32] h. wang, z. yang, q. yu, t. hong and x lin “online reliability time series prediction via convolutional neural network and long short term memory for service-oriented systems ” knowledge-based syst., vol. 159, pp. 132–147, 2018. [33] j lu q zhang z yang and m tu “a hybrid model based on convolutional neural network and long short-term memory for short-term load forecasting ” ieee power energy soc. gen. meet., vol. 2019-augus, 2019. [34] a k jain c grumber p gelhausen i häring and a stolz “a toy model study for long-term terror event time series prediction with cnn ” eur. j. secur. res., vol. 5, no. 2, pp. 289–309, 2020. [35] s s baek j pyo and j a chun “prediction of water level and water quality using a cnn-lstm combined deep learning approach ” water (switzerland), vol. 12, no. 12, 2020. [36] s selvin r vinayakumar e a gopalakrishnan v k menon and k p soman “stock price prediction using lstm, rnn and cnn-sliding window model ” in 2017 international conference on advances in computing, communications and informatics, icacci 2017, 2017, vol. 2017-janua, pp. 1643–1647. [37] c yang j zhai g tao and p haajek “deep learning for price movement prediction using convolutional neural network and long short-term memory ” math. probl. eng., vol. 2020, 2020. [38] s mehtab and j sen “stock price prediction using cnn and lstm-based deep learning models ” 2020 int. conf. decis. aid sci. appl. dasa 2020, pp. 447–453, 2020. [39] j m t wu z li n herencsar b vo and j c w lin “a graph-based cnn-lstm stock price prediction algorithm with leading indicators ” multimed. syst., no. special issue paper, 2021. [40] a. j. dautel, w. k. härdle, s. lessmann, and h.-v seow “forex exchange rate forecasting using deep recurrent neural networks ” digit. financ., vol. 2, no. 1, pp. 69–96, 2020. [41] a s lundervold and a lundervold “an overview of deep learning in medical imaging focusing on mri ” z. med. phys., vol. 29, no. 2, pp. 102–127, may 2019. [42] e lewinson “python for finance cookbook ” in over 50 recipes for applying modern python libraries to financial data analysis, 1st ed., packt publishing, 2020, p. 434. [43] k wang k li l zhou y hu and z cheng “multiple convolutional neural networks for multivariate time series prediction ” neurocomputing, vol. 360, pp. 107–119, 2019. [44] e. hoseinzade and s. haratizadeh “cnnpred: cnn-based stock market prediction using a diverse set of variables ” expert syst. appl., vol. 129, pp. 273–285, 2019. [45] l ni y li x wang j zhang j yu and c qi “forecasting of forex time series data based on deep learning ” procedia comput. sci., vol. 147, pp. 647–652, 2019. https://doi.org/10.1109/cec.2019.8789957 https://doi.org/10.1109/cec.2019.8789957 https://arxiv.org/abs/1904.12054 https://doi.org/10.1007/s11390-017-1714-2 https://doi.org/10.1007/s11390-017-1714-2 https://doi.org/10.1007/s11390-017-1714-2 https://doi.org/10.1061/(asce)co.1943-7862.0001548 https://doi.org/10.1061/(asce)co.1943-7862.0001548 https://doi.org/10.1016/j.camwa.2006.07.013 https://doi.org/10.1016/j.camwa.2006.07.013 https://doi.org/10.1109/icnc.2009.371 https://doi.org/10.1109/icnc.2009.371 https://doi.org/10.1016/j.procs.2018.05.050 https://doi.org/10.1016/j.procs.2018.05.050 https://doi.org/10.1016/b978-0-444-63623-2.00007-4 https://doi.org/10.1016/b978-0-444-63623-2.00007-4 https://doi.org/10.1142/9789813271234_0002 https://arxiv.org/abs/1701.01887 https://doi.org/10.3390/mca25030053 https://doi.org/10.3390/mca25030053 https://doi.org/10.1155/2020/6622927 https://doi.org/10.1155/2020/6622927 https://www.idosi.org/wasj/wasj5(5)/5.pdf https://www.idosi.org/wasj/wasj5(5)/5.pdf https://www.idosi.org/wasj/wasj5(5)/5.pdf https://doi.org/10.1007/s10462-020-09838-1 https://doi.org/10.1007/s10462-020-09838-1 https://doi.org/10.1109/icecos47637.2019.8984499 https://doi.org/10.1109/icecos47637.2019.8984499 https://doi.org/10.1109/icecos47637.2019.8984499 https://arxiv.org/abs/2006.04418 https://doi.org/10.1016/j.knosys.2018.07.006 https://doi.org/10.1016/j.knosys.2018.07.006 https://doi.org/10.1016/j.knosys.2018.07.006 https://doi.org/10.1109/pesgm40551.2019.8973549 https://doi.org/10.1109/pesgm40551.2019.8973549 https://doi.org/10.1007/s41125-019-00061-w https://doi.org/10.1007/s41125-019-00061-w https://doi.org/10.3390/w12123399 https://doi.org/10.3390/w12123399 https://doi.org/10.1109/icacci.2017.8126078 https://doi.org/10.1109/icacci.2017.8126078 https://doi.org/10.1109/icacci.2017.8126078 https://doi.org/10.1155/2020/2746845 https://doi.org/10.1155/2020/2746845 https://doi.org/10.1109/dasa51403.2020.9317207 https://doi.org/10.1109/dasa51403.2020.9317207 https://doi.org/10.1007/s00530-021-00758-w https://doi.org/10.1007/s00530-021-00758-w https://doi.org/10.1007/s42521-020-00019-x https://doi.org/10.1007/s42521-020-00019-x https://doi.org/10.1016/j.zemedi.2018.11.002 https://doi.org/10.1016/j.zemedi.2018.11.002 https://www.packtpub.com/product/python-for-finance-cookbook/9781789618518 https://www.packtpub.com/product/python-for-finance-cookbook/9781789618518 https://doi.org/10.1016/j.neucom.2019.05.023 https://doi.org/10.1016/j.neucom.2019.05.023 https://doi.org/10.1016/j.eswa.2019.03.029 https://doi.org/10.1016/j.eswa.2019.03.029 https://doi.org/10.1016/j.procs.2019.01.189 https://doi.org/10.1016/j.procs.2019.01.189 66 a. pranolo et al. / knowledge engineering and data science 2022, 5 (1): 53-66 [46] i halimi g i marthasari and y azhar “prediksi harga emas menggunakan univariate convolutional neural network ” j. repos., vol. 1, no. 2, p. 105, 2019. [47] a vidal and w kristjanpoller “gold volatility prediction using a cnn-lstm approach ” expert syst. appl., vol. 157, 2020. [48] i e livieris e pintelas and p pintelas “a cnn–lstm model for gold price time-series forecasting ” neural comput. appl., vol. 32, no. 23, pp. 17351–17360, 2020. [49] r yamashita m nishio r k g do and k togashi “convolutional neural networks: an overview and application in radiology ” insights imaging, vol. 9, no. 4, pp. 611–629, aug. 2018. [50] s singhal h kumar and v passricha “prediction of heart disease using dnn ” am. interantional j. res. sci. technol. eng. math., no. november, pp. 257–261, 2018. [51] g. t. taye h j hwang and k m lim “application of a convolutional neural network for predicting the occurrence of ventricular tachyarrhythmia using heart rate variability features ” sci. rep., vol. 10, no. 1, pp. 1–7, 2020. [52] m afrasiabi h khotanlou and m mansoorizadeh “dtw-cnn: time series-based human interaction prediction in videos using cnn-extracted features ” vis. comput., vol. 36, no. 6, pp. 1127–1139, 2020. [53] p liu j liu and k wu “cnn-fcm: system modeling promotes stability of deep learning in time series prediction ” knowledge-based syst., vol. 203, p. 106081, 2020. [54] z. zhang, y dong and y yuan “temperature forecasting via convolutional recurrent neural networks based on time-series data ” complexity, vol. 2020, 2020. [55] a. g. salman b kanigoro and y heryadi “weather forecasting using deep learning techniques ” icacsis, pp. 281–285, 2015. [56] t t kieu tran t lee j y shin j s kim and m kamruzzaman “deep learning-based maximum temperature forecasting assisted with meta-learning for hyperparameter optimization ” atmosphere (basel)., vol. 11, no. 5, pp. 1– 21, 2020. [57] z alameer m a elaziz a a ewees h ye and z jianhua “forecasting gold price fluctuations using improved multilayer perceptron neural network and whale optimization algorithm ” resour. policy, vol. 61, no. september 2018, pp. 250–260, 2019. https://doi.org/10.22219/repositor.v1i2.612 https://doi.org/10.22219/repositor.v1i2.612 https://doi.org/10.1016/j.eswa.2020.113481 https://doi.org/10.1016/j.eswa.2020.113481 https://doi.org/10.1007/s00521-020-04867-x https://doi.org/10.1007/s00521-020-04867-x https://doi.org/10.1007/s13244-018-0639-9 https://doi.org/10.1007/s13244-018-0639-9 https://doi.org/10.1109/icirca48905.2020.9182991 https://doi.org/10.1109/icirca48905.2020.9182991 https://doi.org/10.1038/s41598-020-63566-8 https://doi.org/10.1038/s41598-020-63566-8 https://doi.org/10.1038/s41598-020-63566-8 https://doi.org/10.1007/s00371-019-01722-6 https://doi.org/10.1007/s00371-019-01722-6 https://doi.org/10.1016/j.knosys.2020.106081 https://doi.org/10.1016/j.knosys.2020.106081 https://doi.org/10.1155/2020/3536572 https://doi.org/10.1155/2020/3536572 https://doi.org/10.1109/icacsis.2015.7415154 https://doi.org/10.1109/icacsis.2015.7415154 https://doi.org/10.3390/atmos11050487 https://doi.org/10.3390/atmos11050487 https://doi.org/10.3390/atmos11050487 https://doi.org/10.1016/j.resourpol.2019.02.014 https://doi.org/10.1016/j.resourpol.2019.02.014 https://doi.org/10.1016/j.resourpol.2019.02.014 knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 57–68 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p57-68 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) maximum marginal relevance and vector space model for summarizing students' final project abstracts gunawan a,1,*, fitria a,2, esther irawati setiawan a,3, kimiya fujisawa b,3 a institut sains dan teknologi terpadu surabaya, surabaya 60284, indonesia b tokyo university of technology, tokyo, japan 1gunawan@stts.edu*; 2 fitriatahir@gmail.com; 3 esther@stts.edu; 4 fujisawa@stf.teu.ac.jp * corresponding author i. introduction a summary represents the article's overview and conveys essential ideas to the reader [1]. automatic text summarization reduces a text document with a computer program to create a summary that retains the essential parts of the original document [2][3]. the amount of data is increasing to deal with information overload, so automatic summarization is necessary [4]. summary automation can be applied to single-multi documents and languages [5]. therefore, an automatic summarizer may ease people in summarizing the data from the web page [6][7], as in the final project and thesis abstract [8]. maximum marginal relevance (mmr) is an extractive summary method that is used to summarize a single document or multiple documents [9][10]. mmr summarizes documents by calculating the similarity between parts of the text [11][12]. the document segmentation process is carried out in sentences summarizing documents using the mmr method. mmr combines the cosine similarity matrix and vsm to rank sentences in response to the query [13][14]. most modern information retrieval (ir) search engines produce ranking lists of documents as measured by decreasing relevance to user queries [15][16]. the first assessment to measure the relevant summary results is to measure the relationship between the information in the document and the query given by the user and add the linear combination as a matrix. this linear combination is called marginal relevance [17]. a collaborative initiative to collect and unify existing resources for indonesian languages, including opening access to previously non-public resources [18]. the paper describes the datasets and standardized data loaders that were brought together through this initiative and discusses the article info a b s t r a c t article history: received 13 june 2023 revised 03 july 2023 accepted 28 july 2023 published online 31 july 2023 automatic summarization is reducing a text document with a computer program to create a summary that retains the essential parts of the original document. automatic summarization is necessary to deal with information overload, and the amount of data is increasing. a summary is needed to get the contents of the article briefly. a summary is an effective way to present extended information in a concise form of the main contents of an article, and the aim is to tell the reader the essence of a central idea. the simple concept of a summary is to take an essential part of the entire contents of the article. which then presents it back in summary form. the steps in this research will start with the user selecting or searching for text documents that will be summarized with keywords in the abstract as a query. the proposed approach performs text preprocessing for documents: sentence breaking, case folding, word tokenizing, filtering, and stemming. the results of the preprocessed text are weighted by term frequency-inverse document frequency (tf-idf), then weighted for query relevance using the vector space model and sentence similarity using cosine similarity. the next stage is maximum marginal relevance for sentence extraction. the proposed approach provides comprehensive summarization compared with another approach. the test results are compared with manual summaries, which produce an average precision of 88%, recall of 61%, and f-measure of 70%. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: summary query relevance sentence similarity maximum marginal relevance http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 58 quality of the datasets, which were assessed manually and automatically. we compared the performance of our approach with a summarization initiative from nusacrowd. this article consists of four sections. the introduction and context are covered within the first section. the second section describes the research method. the fourth segment describes the results and discussion, while the final section summarizes the conclusions. ii. method this research summarizes a document and generates its abstract using an automatic summary system [19][20]. the stages in this research are preprocessing, tf-idf weighting, weighting query relevance, sentence similarity weighting, and mmr for summary extraction [21], as displayed in figure 1 [22][23]. fig. 1. system architecture the abstract documents are generally preprocessed (sentence splitting, tokenization, case folding, stopword, and stemming). after preprocessing, tf-idf weighting is carried out, namely, automatic weighting based on the number of occurrences of a word in a document (term frequency) and the number of occurrences in the document collection (inverse document frequency) [24]. the tf-idf weights and calculates the query relevance and sentence similarity weights for weighting query relevance using the vector space model and sentence similarity using cosine similarity [25]. the calculation of the query relevance weight is the weight of the results of comparing the similarity between queries (keywords) to the entire document. at the same time, the sentence similarity weight is the weight of the results of comparing similarities between documents. the next stage of iterative calculations uses maximum marginal relevance by comparing query relevance and sentence similarity to obtain summary extraction to determine the relevant document as a summary [26]. the first step in the text preprocessing stage. is sentence division breaking down documents into sentences. sentence splitting is breaking long document text strings into a collection of sentences. in breaking the document into sentences using the split () function, with a period ".", question mark "?" and an exclamation point “!” as a delimiter. the sentence splitting stage breaks the document string into a collection of sentences by removing the end of the sentence marks (delimiters). from the results of sentence splitting, the following steps are tokenization, case folding, stop words removal, stemming, tf-idf weighting, vsm, cosine similarity, and mmr to obtain a summary. 59 gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 tokenization is cutting or separating a row of words in a sentence, paragraph, or page into tokens or single-word chunks. this stage also removes certain characters in the form of punctuation marks. splitting sentences into single words is done by scanning sentences with white space delimiters (spaces, tabs, and newlines). case folding is a text processing process in which all text is converted into the same case; in this case, the text is represented in all lowercase letters. the orthographic model error will be corrected by changing all letters to lowercase or lowercase. the following is an example of implementing case folding in summarization [27][28]. stop words can be referred to as unimportant words, for example, "in", "by", "on", "a", "because", and so on [29][30]. stop words are removed to remove words that have no connection with documents contained in the database. examples of other stop words are there, is, is, while, somewhat, he, i, how, and others. stemming removes a word's prefix or suffix to get the basic word form. for example, registered words and registrations share a common term, stem list [31]. the weighting is obtained based on the number of occurrences of a term in a tf document and the number of occurrences in the idf document collection [32]. the more frequently a word appears in a document, the greater its weight and the smaller it appears in many documents. to calculate the tfidf weight, use the formula in (1) and (2). weighting can be obtained based on the number of occurrences of a term in a term frequency (tf) document and the number of occurrences of a term in the inverse document frequency (idf) document collection. the idf value of a term can be calculated as in (3). 𝐼𝐷𝐹 = 𝐿𝑜𝑔 ( 𝑑 𝑑𝑓 𝑖) (1) 𝑑 is the number of documents containing the term (𝑡), and 𝑑𝑓𝑖 is the number of term occurrences against 𝑑. the algorithm is used to calculate the weight (𝑊) of each document against keywords (queries). 𝑊𝑑, 𝑡 = 𝑡𝑓𝑑, 𝑡 ∗ 𝐼𝐷𝐹𝑡 (2) 𝑑 is the d-th document, 𝑡 is the t-th term of the keyword, 𝑡𝑓 is the term frequency or word frequency, and 𝑊 is the weight of the d-th document against the t-term. after each document's weight (𝑊) is known, a sorting process is carried out where the greater the value of 𝑊, the greater the degree of similarity of the document to the word you are looking for, and vice versa. after calculating each document's w weight, calculate the query relevance weighting using (2). from the query relevance values and rankings obtained in table 1, documents with the highest query relevance weight are displayed sequentially based on their ranking, namely d3, d4, d6, d1, d2, d8, d7, and d5. the query relevance value will be compared with the sentence similarity value for summary extraction. table 1. query relevance value d1 d2 d3 d4 d5 d6 d7 d8 cosine 0.468 0.459 0.678 0.669 0 0.574 0.139 0.150 rank #4 #5 #1 #2 #8 #3 #7 #6 the vector space model measures the similarity between a document and a query [33]. in this model, queries and documents are considered vectors in an n-dimensional space, where n is the sum of all the terms in the lexicon. the lexicon is a list of all the terms in the index. one way to overcome this in the vector space model is to expand the vector. the expansion process can be performed on query vectors, document vectors, or both of these vectors. the relationship between words in databases, documents, and keywords [34]. gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 60 cosine similarity is used to calculate the query relevance approach to documents. determining the relevance of a query to a document is seen as a measurement of the similarity between the query vector and the document vector. the greater the similarity value of the query vector with the document vector, the more relevant the query is to the document [35][36]. when the engine receives a query, it will build a vector 𝑄 (𝑤𝑞1, 𝑤𝑞2, … , 𝑤𝑞𝑡) based on the terms in the query and a vector 𝐷 (𝑑𝑖1, 𝑑𝑖2, … , 𝑑𝑖𝑡) of size t for each document. in general, cosine similarity (cs) is calculated using the cosine measure formula [37][38]. this study calculates it using cosine similarity, namely the similarity approach between documents. this study measures the distance between the two documents (𝑑𝑖 and 𝑑𝑗), using the cosine similarity formula to calculate the similarities between documents. in vector space, the document model is represented in the form 𝑑 = {𝑤1, 𝑤2, 𝑤3, . . . , 𝑤𝑛} where 𝑑 is the document and 𝑤 is the weight value of each term in the document. the cosine 0o is one and is less than 1 for every other angle. thus two vectors with the same orientation have a similarity cosine of 1, and two vectors at 90o have a similarity of 0. cosine similarity is mainly used in positive space, where the result is bounded by (0,1). 𝐶𝑜𝑠𝑖𝑛𝑒 → 𝑠𝑖𝑚 (𝑑𝑗, 𝑞) = 𝑑𝑗 ⃗⃗⃗⃗ ⃗ . �⃗� |𝑑𝑗⃗⃗⃗⃗ | .|�⃗� | = ∑ (𝑊𝑖𝑗.𝑊𝑖𝑞)𝑡=1 √∑ 𝑊𝑖𝑗2𝑡=1 . ∑ 𝑊𝑖𝑞 2𝑡 =1 (3) 𝑡 is a word in the database, 𝑑 is a document resulting from splitting sentences, and 𝑞 is a keyword in the abstract. 𝐶𝑜𝑠𝑖𝑛𝑒(𝐷𝑖) = 𝑠𝑢𝑚(𝑘𝑘2. 𝐷𝑖)/[𝑠𝑞𝑟𝑡(𝑘𝑘2) ∗ 𝑠𝑞𝑟𝑡(𝐷𝑖2)] (4) cosine similarity is used to calculate sentence similarity weights, where each document is compared to others. the flow of calculating sentence cosine similarity is the same for calculating the query relevance weights using (4). table 2 shows the results of the sentence similarity weighting calculation resulting from the cosine similarity calculation. the results obtained on sentence similarity weight values are used to calculate the mmr iteration by comparing the results of query relevance weights and sentence similarity. table 2. sentence similarity weight values d1 d2 d3 d4 d5 d6 d7 d8 d1 0.255 0.458 0.375 0 0.207 0.215 0.270 d2 0.355 0,000 0,000 0,000 0,000 0,000 0,000 d3 0.458 0,000 0,000 0,000 0,000 0,000 0,000 d4 0.375 0,000 0,000 0.204 0,000 0.100 0.097 d5 0 0 0 0.204 0.219 0.128 0.107 d6 0.207 0,000 0,000 0,000 0.219 0.162 0.302 d7 0.215 0,000 0,000 0.100 0.128 0.162 0,000 d8 0..270 0 0,000 0.097 0.107 0.302 0,000 summary extraction was performed using (5). the mmr calculation is done by comparing the query relevance results and sentence similarity results. documents have high marginal relevance if the document is relevant to the contents of the document and has the maximum weight similarity with the query. the final value is given to the si sentence in mmr calculated by (1). 𝑀𝑀𝑅 = 𝑎𝑟𝑔𝑚𝑎𝑥 [𝜆 ∗ 𝑆𝑖𝑚1 (𝑆𝑖𝑄) − (1 − 𝜆) ∗ max 𝑆𝑖𝑚2 (𝑆𝑖, 𝑆𝑢𝑚𝑚)] (5) 61 gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 𝑆𝑖 is a sentence in the document, while 𝑆𝑢𝑚𝑚 is a sentence selected or extracted. the coefficient 𝜆 is used to adjust the value combinations to emphasize the sentence's relevance and reduce redundancies. in this study, 𝑆𝑖𝑚1 and 𝑆𝑖𝑚2 are two similarity functions that represent the similarity of sentences in all documents and choose each sentence to be used as a summary. 𝑆𝑖𝑚1 is the 𝑆𝑖 sentence similarity matrix to the query, while 𝑆𝑖𝑚2 is the 𝑆𝑖 sentence similarity matrix to the sentence [31]. the parameter value 𝜆 is from 0 to 1 (range [0,1]). when the parameter =1, the mmr value obtained will tend to be relevant to the original document. when =0, the mmr value obtained tends to be relevant to the previously extracted sentences. therefore, a linear combination of the two criteria is optimized when the value  is in the interval [0,1]. for summarizing small documents, such as news, use the parameter value  = 0.7 or  = 0.8 because it will produce a good summary [39]. to get relevant summary results, we set the value  to a value that is closer to . the sentence with the highest mmr value will be repeatedly selected into the summary until the desired summary size is reached as in table 3. table 3. mmr iteration results iteration d1 d2 d3 d4 d5 d6 d7 d8 1 0.283 0.296 0.451 0.460 -0.044 0.399 0.068 0.060 2 0.135 0.166 0.269 -0.079 0.259 0.012 -0.013 3 0.016 0.062 -0.107 0.147 -0.034 -0.071 4 -0.079 -0.022 -0.129 -0.070 -0.117 because in the mmr calculation, the values taken as a result of the iteration are more significant than 0, the iteration stops at the 4th iteration because all values are less than 0. then the values from documents d4, d3, and d6 are considered relevant for summary results. table 4. the result of the maximum mmr iteration mmr weight iteration id mmr mmrmax(1) d4 0.46 mmrmax(2) d3 0.269 mmrmax(3) d6 0.147 table 4 shows the maximum iteration mmr weight obtained from the iteration calculation. iterations are carried out as many times as the number of documents resulting from sentence splitting, but the one with a positive value or 0 to 1 is taken as a summary. the mmr calculation results show that the document is a summary based on the sequence the highest sentence mmr weight is in table 5. from the results of the maximum mmr iteration mmr, it has been determined the order of the relevant documents to be used as a summary, and these documents are sorted by highest to lowest value between weights 0 to 1. moreover, higher results will place the initial position in the summary. because of the results of the maximum marginal relevance calculation, the highest value is taken from all iterations. documents (d4), (d3), and (d6) are the most relevant and are considered sentences that match the keywords or queries between documents. the maximum marginal relevance for summary extraction can be seen in algorithm 1. lines 1 to 6 of algorithm 1, delete and create tables starting from the cosine, nmr, and summary tables. the next stage is to call data in the cosine table as in program line 7. then proceed with reading records in the form of repetition as much as the number of data in the eighth program line. gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 62 lines 9 to 15 determine the number of documents stored in the cosine table with a call value field based on the value of the paper. furthermore, in program line 16, it is repeated for the total number of documents. in this iteration, we call the cosine table with the sql command, which is in the 17th program line. lines 18 to 24 are calculated; the results of the mmr calculation are stored in the mmr table, which is located in line 25. in addition, data updates in the table are also performed. table 5. summary extraction id document d4 in the student affairs section of sma negeri 1 tarakan, the process of class promotion and student majors is still carried out in a simple manner by holding meetings and data is processed and stored using microsoft excel, so it takes a long time to calculate the process of class promotion and student majors due to the large number of students that must be handled by sma negeri 1 tarakan, so the quality of the results of the process of increasing majors is less accurate, slow and tends to experience differences in decisions between students. d3 the student affairs section is also a center for processing student data and is also tasked with determining the process of grade promotion and student majors at sma negeri 1 tarakan. d6 the application in this program starts from the decision tree process for class increases, the decision tree process for the science majors (natural science), social sciences majors, and language majors, student entry processes, class promotion processes, major processes, and reports class promotion and majors report. lines 26 to 31 are called the mmr table by reading the document field and storing it in the summary table. then delete the cosine table based on the document field, delete the mmr table, and create the mmr table as in the 32nd to 34th program line. lines 35 to 37 call and read summary tables accommodated in the variable data_mmr in the form arrays. the program line 38 is a sentence variable with an empty value to combine the following sentence. lines 39 to 45 are summary table readings with sql commands based on the final field of more than 0, which is repeated as much as the sum of the summary data according to the results of the sql search. then the sentence is combined with the previous sentence repeatedly. the summary evaluation is measured by comparing the manual and automated summaries [41]. manual summaries were obtained from manual summaries of 20 respondents and calculated with precision, recall, and f-measure values as in (6) to (8). 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (#𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒  𝑚𝑎𝑛𝑢𝑎𝑙 𝑠𝑢𝑚𝑚𝑎𝑟𝑦) /  𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 (6) 𝑅𝑒𝑐𝑎𝑙𝑙 = (#𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒  𝑚𝑎𝑛𝑢𝑎𝑙 𝑠𝑢𝑚𝑚𝑎𝑟𝑦) /  𝑚𝑎𝑛𝑢𝑎𝑙 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 (7) f – measure = (2 * precision * recall) / (recall + precision) (8) 63 gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 algorithm 1: maximum marginal relevance for summary extraction 1: alter file cosine drop field value 2: alter file cosine and add field value as real 3: drop file mmr 4: create file mmr has field document, last 5: drop file summary 6: create a file summary that has a field document, last 7: read all fields and count as field name total (from file cosine) 8: while(row is not empty) 9: total ← get field total 10: end while 11: for no ←1 until no<=total as no=no+1 12: read document, value where field value in \ 13: read maximum value of field value (from file cosineperkal) \ where field document has a string like “no” 14: while (row is not empty) 15: value ← get field value 16: document ← "d".no 17: endwhile 18: write file cosine set field valuekal=value where field \ document=document 19: for no1 ←1 until no1<=total as no1=no+1 20: read all fields (from file cosine) 21: while(row is not emmpty) 22: document ← get field document 23: value ← get field value 24: valuekal ← get field valuekal 25: left ← 0.8 * value 26: right ← ((1-0.8)*valuekal) 27: finalresult ← left-right 28: endwhile 29: write file mmr set field document= document, field last=finalresult 30: write file cosine set field value= finalresult where field \ document=document 31: read document, last where field last in \ 32: read maximum value of field last (from file mmr) 33: while (row is not empty) 34: document ← get field document 35: last ← get field last 36: endwhile 37: write file summary set field document= document, \ field last=last 38: delete file cosine where field document=document 39: drop file mmr 40: create table mmr has fields document, last 41: read all fields from summary 42: while (row is not empty) 43: data_mmr[] ← get field document + " = " + field last 44: set sentence1 ← ""; 45: end while 46: read all fields where field last > 0 (from file summary) 47: while (row is not empty) 48: document ← get field document 49: end while 50: read all fields where field code=document (from sentence) 51: while(row is not empty) 52: sentence ← get field sentence 53: sentence1 ← sentence1 + " " + sentence + "." 54: end while 55: endfor 56: endfor gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 64 iii. result and experiments the data used in the experiments consisted of 200 final project and student thesis abstract documents obtained from the stmik ppkia tarakan library. testing is done by entering the contents of the student’s final project abstract and abstract keywords. the query is a keyword of the abstract. sentences taken as a summary represent queries and have a maximum mmr [40] weight between the maximum weight values of 1 to a minimum of 0. the more words similar to the query, the greater the chance for data to be retrieved as a summary. table 6 shows an example of an evaluation calculation using three documents taken randomly from the data abstract document. table 6. automated summarization and manual summarization abstract id summarization of our model manual summarization k098 2,3,5 2,4,5 k101 3,4 1,3 k104 11,12,13 12,13 from the summarization results, a comparison was made with the respondents' manual summary. the recall, precision, and f-measure can be seen in table 7. table 7 shows the calculation results obtained from precision, recall, and f-measure calculations. the average obtained from these calculations produces an average for precision of 61%, recall of 72%, and f-measure of 66%. table 7. results of example calculations of precision, recall, f-measure comparison of summarization and manual summary abstract id precision recall f-measure k098 67% 67% 67% k101 50% 50% 50% k104 67% 100% 80% average 61% 72% 66% table 8 summarizes the results of 200 abstract documents of student final assignments. the summary of this model is then compared with the manual summary, which has been done by 20 people summarizing the manual with a summary of 200 abstract documents. table 8. summarization results abstract id number of documents summary (doc id) abstract id number of documents summary (document id) k001 2 7,8 k101 2 3,4 k002 1 10 k102 1 1 k003 2 1,2 k103 2 1,7 k004 1 1 k104 3 11,12,13 k005 2 1,4 k105 2 5,15 : : : : : : : : : : : : k095 2 1,3 k195 2 1,4 k096 1 3 k196 3 2,3,5 k097 2 1,4 k197 1 1 k098 3 2,3,5 k198 3 1,2,6 k099 2 6,10 k199 2 1,11 k100 2 1,2 k200 3 3,5,8 65 gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 table 9. comparison results code overlap precision recall f-measure k001 2 100% 67% 80% k002 1 100% 50% 67% k003 1 50% 33% 40% k004 1 100% 33% 50% k005 2 100% 50% 67% : : : : : : : : : : k195 1 50% 50% 50% k196 2 67% 100% 80% k197 1 100% 50% 67% k198 2 67% 67% 67% k199 2 100% 67% 80% k200 2 67% 100% 80% average 88% 61% 70% table 9 shows that the results of the comparison between summarization and manual summaries have an average recall value of 61%, precision of 88%, and f-measure of 70%. table 10 shows the comparison between the mmr summary result and another model [41]. as seen in table 10, the bert2-gpt-id has a shorter summary than mmr. however, the mmr summary has more comprehensive results than the baseline. in other words, mmr has better performance than bert2gpt-id. table 10. an example of mmr test compared to bert2-gpt-id code abstract mmr bert2-gpt-id modeling is a real system representation of objects by taking a mathematical form and a logical relation. in general, a simulation is defined as a dynamic representation of a portion of the real world using a computer and running at a certain time. one of the modeling techniques is discrete event simulation (des), modeling a system that changes every unit of time. this method is stochastic, dynamic, and discreteevent. many fast food restaurants offer a variety of menus and services to satisfy consumers. kentucky fried chicken restaurant, tarakan branch, is one of the most popular fast food restaurants. the increasing number of users of delivery services and different distances, of course, the travel time is also different, resulting in the emergence of new problems in the delivery process. the problem that often occurs at the tarakan branch kfc restaurant is that at certain times kfc receives orders from very many consumers and can make the process of sending orders to consumers slow due to limited employees who specifically handle message delivery services. this will create a queue in the process of sending orders. in this final project, a discrete event simulation model will be implemented using a combination of fixed-increment time advance and next-event time advance to overcome problems that occur at the kentucky fried chicken restaurant, tarakan branch, using the delphi 7.0 programming language. in this final project, a discrete event simulation model will be implemented using a combination of fixed-increment time advance and next-event time advance to overcome problems that occur at the kentucky fried chicken restaurant, tarakan branch, using the delphi 7.0 programming language. the simulation is based on the subject of a real system of objects using a computer and runs at a certain time. keywords discrete events simulation, message delivery, kentucky fried chicken gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 66 iv. conclusion several conclusions are obtained from the discussion and experiments previously conducted in this study. documents with the highest maximum marginal relevance value from the calculation will be taken as a summary. sentences taken as a summary represent similar sentences in documents with queries and similarities between sentences in documents. the maximum marginal calculation is done by calculating iterations between combinations of query relevance and sentence similarity matrices. calculation of query relevance weights is the weight of comparing similarity between queries to documents, while sentence similarity is the weight of the results of comparison of similarities between documents. vector space modeling is used to query relevance and cosine similarity for sentence similarity. from the results of the lambda test with a comparison between the lambda values of 0.8, lambda 0.3, and lambda 0.9, it can be concluded that using a lambda value closer to 1 produces a more relevant summary. the results of the experiments are an average precision of 88%, recall of 61%, and f-measure of 70% based on a comparison between the summarization and the manual summary. the test data was taken from 200 student final assignments and thesis documents. furthermore, used as data in the summarization and manual summary, the summarization results data are compared with the manual summary to obtain accurate results. moreover, the time needed to summarize one document depends on the number of sentences obtained from document splitting. the more sentences in the document, the longer it takes to summarize. some future works are as the results of the comparison with the manual summary show that several abstracts have a low f-measure value because the query sometimes does not describe the content. the retrieved sentences are not in good sentence order. also, it is recommended to use a generator for abstract keywords. quality measurement with other parameters, such as f-score and nmi, is also possible in future research. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] e. i. setiawan, v. natalie, j. santoso, and k. fujisawa, “sequential pattern mining to support customer relationship management at beauty clinics,” bulletin of social informatics theory and application, vol. 6, no. 2, pp. 168–176, 2022. [2] m. f. mridha, a. a. lima, k. nur, s. c. das, m. hasan, and m. m. kabir, “a survey of automatic text summarization: progress, process and challenges,” ieee access, vol. 9, pp. 156043–156070, 2021. [3] m. wang, x. wang, and c. xu, “an approach to concept-obtained text summarization,” in ieee international symposium on communications and information technology, 2005. iscit 2005., 2005, pp. 1337–1340. [4] e. s. negara and d. triadi, “topic modeling using latent dirichlet allocation (lda) on twitter data with indonesia keyword,” bulletin of social informatics theory and application, vol. 5, no. 2, pp. 124–132, 2021. [5] e. hovy, “text summarization chapter 32,” information sciences institute of the university of southern california, 2003. http://journal2.um.ac.id/index.php/keds http://pubs.ascee.org/index.php/businta/article/view/602 http://pubs.ascee.org/index.php/businta/article/view/602 http://pubs.ascee.org/index.php/businta/article/view/602 https://doi.org/10.1109/access.2021.3129786 https://doi.org/10.1109/access.2021.3129786 https://doi.org/10.1109/iscit.2005.1567115 https://doi.org/10.1109/iscit.2005.1567115 http://pubs.ascee.org/index.php/businta/article/view/455 http://pubs.ascee.org/index.php/businta/article/view/455 https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=text+summarization+chapter+32+hovy&btng= https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=text+summarization+chapter+32+hovy&btng= 67 gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 [6] h. haviluddin and r. alfred, “big data: issues, trends, problems, controversies in asean perspective,” bulletin of social informatics theory and application, vol. 3, no. 2, pp. 80–93, 2019. [7] b. prasetyo, f. s. aziz, k. faqih, w. primadi, r. herdianto, and w. febriantoro, “a review: evolution of big data in developing country,” bulletin of social informatics theory and application, vol. 3, no. 1, pp. 30–37, 2019. [8] j. k. lê and t. schmid, “the practice of innovating research methods,” organ res methods, vol. 25, no. 2, pp. 308– 336, 2022. [9] h. c. manh, h. le thanh, and t. l. minh, “extractive multi-document summarization using k-means, centroidbased method, mmr, and sentence position,” in proceedings of the 10th international symposium on information and communication technology, 2019, pp. 29–35. [10] s. tuhpatussania, e. utami, and a. d. hartanto, “comparison of lexrank algorithm and maximum marginal relevance in summary of indonesian news text in online news portals,” jurnal pilar nusa mandiri, vol. 18, no. 2, pp. 187–192, 2022. [11] j. goldstein and j. g. carbonell, “summarization:(1) using mmr for diversity-based reranking and (2) evaluating summaries,” in tipster text program phase iii: proceedings of a workshop held at baltimore, maryland, october 13-15, 1998, 1998, pp. 181–195. [12] d. gunawan, s. h. harahap, and r. f. rahmat, “multi-document summarization by using textrank and maximal marginal relevance for text in bahasa indonesia,” in 2019 international conference on ict for smart society (iciss), 2019, pp. 1–5. [13] d. p. purbawa, r. n. e. anggraini, r. sarno, and others, “automatic text summarization using maximum marginal relevance for health ethics protocol document in bahasa,” in 2021 13th international conference on information & communication technology and system (icts), 2021, pp. 324–329. [14] y. mao, y. qu, y. xie, x. ren, and j. han, “multi-document summarization with maximal marginal relevance-guided reinforcement learning,” arxiv preprint arxiv:2010.00117, 2020. [15] p. gupta, s. nigam, and r. singh, “a ranking based language model for automatic extractive text summarization,” in 2022 first international conference on artificial intelligence trends and pattern recognition (icaitpr), 2022, pp. 1–5. [16] a. mahajani, v. pandya, i. maria, and d. sharma, “ranking-based sentence retrieval for text summarization,” in smart innovations in communication and computational sciences: proceedings of icsiccs-2018, 2019, pp. 465– 474. [17] x. jiang, x.-z. fan, z.-f. wang, and k.-l. jia, “improving the performance of text categorization using automatic summarization,” in 2009 international conference on computer modeling and simulation, 2009, pp. 347–351. [18] s. cahyawijaya et al., “nusacrowd: open source initiative for indonesian nlp resources,” arxiv preprint arxiv:2212.09648, 2022. [19] m. hassel, “evaluation of automatic text summarization,” licentiate thesis, stockholm, sweden, pp. 1–75, 2004. [20] e. hovy and c.-y. lin, “automated text summarization in summarist, advances in automatic text summarization.” mit press, 1999. [21] m. o. el-haj and b. h. hammo, “evaluation of query-based arabic text summarization system,” in 2008 international conference on natural language processing and knowledge engineering, 2008, pp. 1–7. [22] y. mao, “guided text summarization with limited supervision,” 2022. [23] a. p. widyassari et al., “review of automatic text summarization techniques & methods,” journal of king saud university-computer and information sciences, vol. 34, no. 4, pp. 1029–1046, 2022. [24] r. a. garcía-hernández and y. ledeneva, “word sequence models for single text summarization,” in 2009 second international conferences on advances in computer-human interactions, 2009, pp. 44–48. [25] r. m. losee, “term dependence: a basis for luhn and zipf models,” journal of the american society for information science and technology, vol. 52, no. 12, pp. 1019–1025, 2001. [26] b. toth, d. hakkani-tür, and s. yaman, “summarization-and learning-based approaches to information distillation,” in 2010 ieee international conference on acoustics, speech and signal processing, 2010, pp. 5306–5309. [27] s. basak, m. d. d. h. gazi, and s. m. mazharul hoque chowdhury, “a review paper on comparison of different algorithm used in text summarization,” intelligent data communication technologies and internet of things: icici 2019, pp. 114–119, 2020. [28] d. yadav et al., “qualitative analysis of text summarization techniques and its applications in health domain,” comput intell neurosci, vol. 2022, 2022. [29] s. xie, automatic extractive summarization on meeting corpus. the university of texas at dallas, 2010. [30] s. xie and y. liu, “using corpus and knowledge-based similarity measure in maximum marginal relevance for meeting summarization,” in 2008 ieee international conference on acoustics, speech and signal processing, 2008, pp. 4985–4988. [31] n. yusliani, r. primartha, and m. d. marieska, “multiprocessing stemming: a case study of indonesian stemming,” international journal computer and applications (ijca), vol. 182, no. 40, pp. 15–19, 2019. [32] m. i. aziz, “development program application to the measurement of documents resemblance text mining, tfidf, and vector space model algoritm,” undergraduate program, faculty of industrial engineering, gunadarma university, 2010. http://pubs.ascee.org/index.php/businta/article/view/239 http://pubs.ascee.org/index.php/businta/article/view/239 https://www.pubs.ascee.org/index.php/businta/article/view/162 https://www.pubs.ascee.org/index.php/businta/article/view/162 https://doi.org/10.1177/1094428120935498 https://doi.org/10.1177/1094428120935498 https://doi.org/10.1145/3368926.3369688 https://doi.org/10.1145/3368926.3369688 https://doi.org/10.1145/3368926.3369688 https://ejournal.nusamandiri.ac.id/index.php/pilar/article/view/3190 https://ejournal.nusamandiri.ac.id/index.php/pilar/article/view/3190 https://ejournal.nusamandiri.ac.id/index.php/pilar/article/view/3190 https://aclanthology.org/x98-1025.pdf https://aclanthology.org/x98-1025.pdf https://aclanthology.org/x98-1025.pdf https://doi.org/10.1109/iciss48059.2019.8969785 https://doi.org/10.1109/iciss48059.2019.8969785 https://doi.org/10.1109/iciss48059.2019.8969785 https://doi.org/10.1109/icts52701.2021.9607951 https://doi.org/10.1109/icts52701.2021.9607951 https://doi.org/10.1109/icts52701.2021.9607951 https://arxiv.org/abs/2010.00117 https://arxiv.org/abs/2010.00117 https://doi.org/10.1109/icaitpr51569.2022.9844187 https://doi.org/10.1109/icaitpr51569.2022.9844187 https://doi.org/10.1109/icaitpr51569.2022.9844187 https://doi.org/10.1007/978-981-13-2414-7_43 https://doi.org/10.1007/978-981-13-2414-7_43 https://doi.org/10.1007/978-981-13-2414-7_43 https://doi.org/10.1109/iccms.2009.29 https://doi.org/10.1109/iccms.2009.29 https://arxiv.org/abs/2212.09648 https://arxiv.org/abs/2212.09648 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=8c5384d4a5b346b000789ba2210089a5e0cef2ec https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=automated+text+summarization+in+summarist%2c+advances+in+automatic+text+summarization&btng= https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=automated+text+summarization+in+summarist%2c+advances+in+automatic+text+summarization&btng= https://doi.org/10.1109/nlpke.2008.4906790 https://doi.org/10.1109/nlpke.2008.4906790 https://www.ideals.illinois.edu/items/124644 https://doi.org/10.1016/j.jksuci.2020.05.006 https://doi.org/10.1016/j.jksuci.2020.05.006 https://doi.org/10.1109/achi.2009.58 https://doi.org/10.1109/achi.2009.58 https://doi.org/10.1002/asi.1155 https://doi.org/10.1002/asi.1155 https://doi.org/10.1109/icassp.2010.5494971 https://doi.org/10.1109/icassp.2010.5494971 https://books.google.co.id/books?hl=id&lr=&id=l1s9dwaaqbaj&oi=fnd&pg=pa114&dq=a+review+paper+on+comparison+of+different+algorithm+used+in+text+summarization&ots=wzmksmxhs8&sig=thhof7iithfbrwhiipumtfyif3c&redir_esc=y#v=onepage&q=a%20review%20paper%20on%20comparison%20of%20different%20algorithm%20used%20in%20text%20summarization&f=false https://books.google.co.id/books?hl=id&lr=&id=l1s9dwaaqbaj&oi=fnd&pg=pa114&dq=a+review+paper+on+comparison+of+different+algorithm+used+in+text+summarization&ots=wzmksmxhs8&sig=thhof7iithfbrwhiipumtfyif3c&redir_esc=y#v=onepage&q=a%20review%20paper%20on%20comparison%20of%20different%20algorithm%20used%20in%20text%20summarization&f=false https://books.google.co.id/books?hl=id&lr=&id=l1s9dwaaqbaj&oi=fnd&pg=pa114&dq=a+review+paper+on+comparison+of+different+algorithm+used+in+text+summarization&ots=wzmksmxhs8&sig=thhof7iithfbrwhiipumtfyif3c&redir_esc=y#v=onepage&q=a%20review%20paper%20on%20comparison%20of%20different%20algorithm%20used%20in%20text%20summarization&f=false https://doi.org/10.1155/2022/3411881 https://doi.org/10.1155/2022/3411881 https://www.proquest.com/openview/2a0be9d183840be8ee57d6b14b2732af/1?pq-origsite=gscholar&cbl=18750 https://doi.org/10.1109/icassp.2008.4518777 https://doi.org/10.1109/icassp.2008.4518777 https://doi.org/10.1109/icassp.2008.4518777 https://www.researchgate.net/profile/novi-yusliani/publication/331133691_multiprocessing_stemming_a_case_study_of_indonesian_stemming/links/5ca2060ba6fdcc1ab5ba0004/multiprocessing-stemming-a-case-study-of-indonesian-stemming.pdf https://www.researchgate.net/profile/novi-yusliani/publication/331133691_multiprocessing_stemming_a_case_study_of_indonesian_stemming/links/5ca2060ba6fdcc1ab5ba0004/multiprocessing-stemming-a-case-study-of-indonesian-stemming.pdf https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=development+program+application+to+the+measurement+of+documents+resemblance+text+mining%2c+tfidf%2c+and+vector+space+model+algoritm&btng= https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=development+program+application+to+the+measurement+of+documents+resemblance+text+mining%2c+tfidf%2c+and+vector+space+model+algoritm&btng= https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=development+program+application+to+the+measurement+of+documents+resemblance+text+mining%2c+tfidf%2c+and+vector+space+model+algoritm&btng= gunawan et al. / knowledge engineering and data science 2023, 6 (1): 57–68 68 [33] g. patil and a. patil, “web information extraction and classification using vector space model algorithm,” int. j. emerg. technol. adv. eng, vol. 1, no. 2, 2011. [34] j. golstein, “genre oriented summarization,” unpublished doctoral thesis submitted to carnegie melon university. received in march, vol. 2, p. 2019, 2008. [35] i. r. musyaffanto, g. b. herwanto, and m. riasetiawan, “automatic extractive text summarization for indonesian news articles using maximal marginal relevance and non-negative matrix factorization,” in 2019 5th international conference on science and technology (icst), 2019, pp. 1–6. [36] j. d. kapoor and k. k. devadkar, “generating auto text summarization from document using clustering,” int. j. appl. eng. res. dev., vol. 4, no. 2, pp. 31–34, 2014. [37] m. chen and y. song, “summarization of text clustering based vector space model,” in 2009 ieee 10th international conference on computer-aided industrial design & conceptual design, 2009, pp. 2362–2365. [38] r. singh and s. singh, “text similarity measures in news articles by vector space model using nlp,” journal of the institution of engineers (india): series b, vol. 102, pp. 329–338, 2021. [39] t. xing, z. xiangxian, g. shunli, and z. liman, “automatic summarization of user-generated content in academic q&a community based on word2vec and mmr,” data analysis and knowledge discovery, vol. 4, no. 4, pp. 109– 118, 2020. [40] n. alami, m. el mallahi, h. amakdouf, and h. qjidaa, “hybrid method for text summarization based on statistical and semantic treatment,” multimed tools appl, vol. 80, pp. 19567–19600, 2021. [41] s. cahyawijaya et al., “nusacrowd: open source initiative for indonesian nlp resources,” arxiv preprint arxiv:2212.09648, 2022. https://www.researchgate.net/profile/gokul-patil-3/publication/337906767_web_information_extraction_and_classification_using_vector_space_model_algorithm/links/5df20c1fa6fdcc28371a4690/web-information-extraction-and-classification-using-vector-space-model-algorithm.pdf https://www.researchgate.net/profile/gokul-patil-3/publication/337906767_web_information_extraction_and_classification_using_vector_space_model_algorithm/links/5df20c1fa6fdcc28371a4690/web-information-extraction-and-classification-using-vector-space-model-algorithm.pdf https://doi.org/10.1145/1321440.1321568 https://doi.org/10.1145/1321440.1321568 https://doi.org/10.1109/icst47872.2019.9166376 https://doi.org/10.1109/icst47872.2019.9166376 https://doi.org/10.1109/icst47872.2019.9166376 http://www.tjprc.org/view-archives.php http://www.tjprc.org/view-archives.php https://doi.org/10.1109/caidcd.2009.5375265 https://doi.org/10.1109/caidcd.2009.5375265 https://doi.org/10.1007/s40031-020-00501-5 https://doi.org/10.1007/s40031-020-00501-5 https://manu44.magtech.com.cn/jwk_infotech_wk3/en/10.11925/infotech.2096-3467.2019.0533 https://manu44.magtech.com.cn/jwk_infotech_wk3/en/10.11925/infotech.2096-3467.2019.0533 https://manu44.magtech.com.cn/jwk_infotech_wk3/en/10.11925/infotech.2096-3467.2019.0533 https://doi.org/10.1007/s11042-021-10613-9 https://doi.org/10.1007/s11042-021-10613-9 https://arxiv.org/abs/2212.09648 https://arxiv.org/abs/2212.09648 knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 114–128 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p114-128 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) multivariate analysis approach to factor-affected tuberculosis disease zuli agustina gultom 1, farid akbar siregar 2, mahardika abdi prawira tanjung 3*, al-hamidy hazidar 4 universitas muhammdiyah sumatera utara, jalan kapten muchtar basri, medan 20238, indonesia 1 zuliagustina@umsu.ac.id; 2 faridakbar@umsu.ac.id; 3 dika.abdi@gmail.com*; 4 alhamidy@umsu.ac.id * corresponding author i. introduction tuberculosis is an infectious disease caused by mycobacterium tuberculosis, which attacks organs other than the lungs [1]. this disease is a problem for developing countries with declining socioeconomic conditions [2]. the prevalence rate of cases of pulmonary tuberculosis in indonesia is 130/100,000 [3]. every year, there are 539,000 new cases, and the number of deaths is around 101,000 people per year [4]. afb (acid fast bacilli) pulmonary tuberculosis (+) incidence is around 110/per 100,000 population [5]. tbc (tuberculosis) is the third leading cause of death, after heart disease and respiratory disease [6]. according to [5], indonesia is fifth after india, china, south africa, and nigeria. the leading causes of increased tuberculosis problems are declining socio-economic conditions in people in developing countries [7], environmental conditions inside and outside the home that are very supportive for the occurrence of tb (tuberculosis) disease [8], demographic changes due to the increasing world population and changes in the age structure of the population [9], the impact of the article info a b s t r a c t article history: received 14 june 2023 revised 14 september 2023 accepted 29 september 2023 published online 19 october 2023 tuberculosis is a disease caused by infection with the mycobacterium tuberculosis complex. tuberculosis attack organ besides the lung, such as the pleura, lining of the brain, lining of the heart, lymph gland, bones, joint, skin, intestines, kidney, urinary tract, and genital. this disease is found in densely populated settlements with poor sanitation, lack of ventilation and sunlight and lack of rest. moreover, the factors that will be analyzed in this research are population density (x1), number of hiv/aids (x2), number of toddlers who experience nutrition (x3), number of toddlers who experience bcg immunization (x4), number of toddlers who get exclusive breastfeeding (x5), total families with phbs (x6), number of residents with healthy homes (x7), number of families with clean water facilities (x8), number of families with ownership of latrine sanitation (x9), number of families with have landfills (x10), number of families have management waste place (x11), number of elementary education facilities (x12), number of junior school education facilities (x13), number of senior school education facilities (x14), number of institutions fostered by neighborhood health (x15), number of posyandu (x16), number life expectancy (x17), literacy rate (x18), human development index (x19), number of tuberculosis sufferers (x20). this research aims to analyze what variables influence each other on the prevalence rate of tuberculosis in the city of surabaya. the method used in this research is a multivariate analysis using factor analysis, cluster analysis, biplot analysis and discriminant analysis. this discriminant analysis determines accuracy by calculating the value (1-aper). the resulting research the number of hiv/aids, number of residents with healthy homes, and number of families with ownership of sanitation (latrine, landfills, waste management) have a high correlation with the spread of tuberculosis in surabaya. meanwhile, areas with a high rate of tuberculosis are tambaksari, wonokromo, sawahan, and semampir. the classification analysis accuracy level was 90.32% and the accuracy of the resulting model or discriminant function was very high. so that discriminant analysis can be used for predicting the accuracy of tuberculosis prevalence rates. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: tuberculosis multivariate analysis surabaya http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 115 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 hiv/aids pandemic [1]. the tuberculosis program has not been optimally implemented, which includes poor health infrastructure in countries that experienced an economic crisis, lack of implementation of tuberculosis services (less accessible to the public, not guaranteed provision of oat, and non-standard monitoring, recording, and reporting [1]. the prevalence rate of tuberculosis is not only a medical problem; socio-economic conditions and environmental factors also have an influence [11]. for example, those with a low socioeconomic status will have a house in a slum area, an unhealthy house with a lack of air circulation, no sanitation, poor nutritional conditions, and a lack of clean water in their environment. according to research conducted by sejati and sofiana [12], people with family incomes below the minimum wage have a 1.123 times higher risk of being infected with tb(tuberculosis) than those above the minimum wage. factors that influence the prevalence rate of tuberculosis are population density, number of hiv/aids, number of toddlers who experience nutrition, number of toddlers who experience bcg (bacillus calmette guerin) immunization, number of toddlers who get exclusive breastfeeding, total families with phbs (clean and healthy living behavior), number of residents with healthy homes, number of families with clean water facilities, number of families that have latrine sanitation, number of families that have landfills, number of families that have waste management sites, number of basic education facilities, number of middle school education facilities, number of high school education facilities, number of institutions fostered by environmental health, number of posyandu (integrated healthcare center), expectation rate life, literacy rate, human development index, number of tb sufferers. education level is one of the factors that influences the incidence of tuberculosis [13]. the higher a person's education level, the lower the incidence of tuberculosis [14], this happens because someone who has a good education will get more information and be able to absorb information about tuberculosis well and be able to treat it well. apart from education, lighting or sunlight entering the house and the ventilation conditions of the house are also factors that influence the incidence of pulmonary tuberculosis [15]. surabaya is the second largest city in indonesia, with an area of approximately 326.37 km2; administratively, it is divided into 31 districts and 163 sub-districts with a population of approximately 2,912,197 people [16]. based on [17], the highest tuberculosis in east java is in surabaya. at least 4,493 residents living in surabaya have tuberculosis. this disease is found in densely populated settlements with poor sanitation, lack of ventilation and sunlight, and lack of rest [18]. tb cases in surabaya are pretty significant compared to other cities [19], so there is a need for research or theoretical studies on the factors influencing the tuberculosis prevalence rate. different characteristics, such as economic conditions and sociocultural factors in each region in surabaya, will cause different health quality [20] so it is necessary to group areas with tuberculosis incidence characteristics. the goal of this research is to find out what factors influence the prevalence rate of tuberculosis in the city of surabaya and to group regions based on the characteristics of the incidence of tuberculosis, with the hope that this research can help the surabaya city government in handling tuberculosis prevalence rates quickly and accurately the analysis technique used is multivariate analysis. this analysis is used to test more than two variables simultaneously. the multivariate analysis approach is divided into two main methods, namely dependency and interdependence [21]. this research was carried out using an interdependence approach. the types of multivariate analysis methods used are factor analysis, cluster analysis, biplot analysis and discriminant analysis. factor analysis is used to reduce variables into new variables with fewer numbers. cluster analysis groups observe areas based on the variable number of tuberculosis cases and the factors influencing them. biplot analysis shows the closeness between objects, characteristics, or variables that characterize each object and the relationship between variables. discriminant analysis was conducted to determine the differentiating variable and classification accuracy of the groupings obtained. all factors that will be examined are independent variables. the variables will be grouped into new variables, grouping sub-districts based on characteristics, knowing the mapping of the sub-district area and the accuracy of the classification of each factor used. the appropriate analysis in this research is multivariate analysis, by knowing what factors influence the prevalence rate of tuberculosis and knowing which areas have the number of tuberculosis sufferers, it z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 116 is hoped that the government will be more responsive and quick in its handling of tuberculosis sufferers. ii. methods the analysis step for the method in this research present in figure 1. the data used in this research is secondary data from the health services, badan pusat statistika (bps) and badan perencanan pembanguna kota surabaya (bappeko) [22]. the data taken is data related to the prevalence rate of tuberculosis in the city of surabaya. the observation units studied were 31 sub-districts in the city of surabaya. namely krembengan, gubeng, tegalsari, bubutan, simokerto, kenjeran, tandes, rungkut, sukolilo, mulyorejo, sukomanunggal, lakasantri, gayungan, genteng, tenggilis, karang pilang, wonocolo, gunang anyar, dukuh pakis, jambangan, bulak, wiyung, asemrowo, benowo, pakal, sambikerep, pabean cantikan, tambaksari, wonokromo, sawahan, semampir. fig. 1. analysis steps the epidemiological factors for tuberculosis are bcg vaccination, inaccurate diagnosis, inadequate treatment, and control programs not implemented. appropriately, endemic hiv infection, migration residents, self-medicate (self-treatment), increasing poverty, and services inadequate health [23]. a factor that is no less important in tb epidemiology is socioeconomic status, low income, low income, overcrowded housing, unemployment, and low education [24]. so the variables that will be examined in this research are population density (x1), number of hiv/aids (x2), number of toddlers who experience nutrition (x3), number of toddlers who experience bcg immunization (x4), number of toddlers who get exclusive breastfeeding (x5), total families with phbs (clean and healthy living behavior) (x6), number of residents with healthy homes (x7), number of families with clean water facilities (x8), number of families that have latrine sanitation (x9), number of families that have landfills (x10), number of families that have waste management sites (x11), number of elementary education facilities (x12), number of middle school education facilities (x13), number of high school education facilities (x14), number of institutions fostered by neighborhood health (x15), number of posyandu (x16), number life expectancy (x17), literacy rate (x18), human development index (x19), number of tuberculosis sufferers (x20). the clustering method used is single linkage, complete linkage, average linkage and ward's method. the single linkage method determines the distance between clusters by knowing the distance between two existing clusters and then choosing the closest distance or close neighbor rule [25]. the complete linkage method (farthest-neighbor method) is used for the furthest inter-cluster distance (farthest-neighbor) between two objects in different clusters [26]. ward's method aims to 117 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 obtain clusters with the smallest possible cluster internal variance [27]. this method is very commonly used in determining clusters. this method is obtained by calculating the average value of each cluster and then calculating the euclidean distance between each object. iii. result and discussion figure 2 shows a map of the number of tuberculosis patients in surabaya. marked in purple is the sub-district group that has the lowest number of tuberculosis, namely ranging from which ranges from 61 to 114, with the sub-districts of tandes, sukomanunggal, customs cantikan, bubutan, simokerto, genteng, tegalsari, gubeng, wonokromo, wonokolo, rungkut, sukolilo. the color brown indicates the classification of the sub-district with the highest number of tuberculosis, ranging from 114 to 201, with the sub-district of sawahan, krembengan, semampir, kenjeran, and 16 to 61, with the districts of pakal, below, asemrowo, sambikerep, lakasntri, dukuh pakis, wiyung, karang pilang, pots, gayungan, gunung anyar, mulyorejo, bulak, tenggilis mejoyo. white color is the classification of sub-districts with tuberculosis in the moderate category, tambaksari. fig. 2. a map of the number of tuberculosis patients in surabaya reduce data dimensions that can explain as much as possible the diversity of data with several sets of variables that are fewer than the initial variable without losing the important information contained therein. the inter-correlation test uses the barlett test and data adequacy with kmo. the kaiser–meyer– olkin (kmo) test is a statistical measure to determine how suited data is for factor analysis [28]. the test measures sampling adequacy for each variable in the model and the complete model. the statistic measures the proportion of variance among variables that might be common variance. the higher the proportion, the higher the kmo value, and the more suited the data is to factor analysis [29]. the following is the correlation testing hypothesis. h0 : ρ = i (between variables from the data of the factors that influence tuberculosis disease are not correlated) h1 : ρ ≠ i (between variables from the data of the factors that influence tuberculosis disease are correlated) z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 118 table 1. correlation test and data adequacy method correlation test and data adequacy test statistic value kmo (kaiser mayor olkin) 0.771 bartlett’s test approx. chi-square 649.145 df 190 sig. 0.000 table 1 shows that the chi-square value of the factors that influence tuberculosis is 649.145 with a p_value of 0.000. it was decided that p_value rejects h0, because the value of p value (0.000) < α (0.05). so it can be concluded that there is a correlation between the data variable that affect tuberculosis. the kmo value of the data is 0.771. from this value, it can be decided that it failed to reject h0, because the value of kmo (0.771) > 0.5, which means that the data on the factors that influence tuberculosis have accepted the data adequacy test to be analyzed further. table 2. initial eigenvalues component initial eigenvalues total varian (%) cumulative variance 1 10,856 54,281 54,281 2 2,473 12,363 66,644 3 1,310 6,548 73,452 4 1,033 5,165 78,357 from table 2, it is known that there are four mutually independent factors, with a cumulative variance of 78.357%. the variable is divided into certain factor groups by selecting the most considerable loading factor value between loadings 1, 2, 3 and 4. the loading factor used is the loading factor, which is rotated varimax. table 3. loading factor variable factor 1 factor 2 factor 3 factor 4 x2 0,829 0,117 0,061 -0,021 x4 0,742 0,445 0,191 0,116 x7 0,813 0,291 0,018 0,279 x8 0,601 0,208 0,144 0,586 x9 0,875 0,364 0,001 0,249 x10 0,801 0,364 0,036 0,366 x11 0,839 0,365 0,068 0,300 x16 0,779 0,459 0,165 0,088 x20 0,622 0,505 -0,115 0,455 x1 0,126 0,663 0,070 0,396 x5 0,444 0,583 -0,173 -0,191 x6 0,307 0,813 -0,046 0,064 x12 0,497 0,687 0,108 0,417 x13 0,458 0,668 0,266 0,245 x14 0,398 0,620 0,387 -0,085 x15 0,561 0,582 0,128 -0,062 x17 -0,082 -0,064 0,846 -0,088 x18 0,054 0,060 0,823 -0,002 x19 0,220 0,164 0,820 -0,005 x3 0,189 0,027 -0,189 0,879 table 3 shows the variable grouped in factor 1 have hiv/aids, the number of children under five who received bcg immunization, the number of residents who have healthy homes, families who have clean water facilities, the number of families with ownership of sanitation (latrines, landfills, waste management sites), number of posyandu, number of tb patients. factor 1 reviews the quality of a person's health. factor 1 is very prominent in the development of the spread of tuberculosis in surabaya. factor 2 includes population density, exclusive breastfeeding, clean and healthy living behavior (phbs), and educational facilities (elementary school, junior high school, senior high school). factor 2 reviews demography and education. factor 3 includes life expectancy, literacy rate, and human development index. factor 3 reviews the human development index. factor 4 includes the number of toddlers who experience nutrition. 119 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 the cluster analysis that will be explicitly used in this study is ward's linkage method with square euclidian distance. in figure 3, the dendrogram is cut into four groups, and the 31 subdistricts in surabaya are grouped as in table 4. fig. 3. a map of the number of tuberculosis patients in surabaya table 4 shows that group 1 consists of 11 sub-districts, group 2 consists of 9 sub-districts, group 3 consists of 7 sub-districts, and group 4 consists of 4 sub-districts, namely tambaksari, wonokromo, sawahan, semampir sub-districts. table 4. results of subdistrict grouping in surabaya city group district 1 krembengan, gubeng, tegalsari, bubutan, simokerto, kenjeran, tandes, rungkut, sukolilo, mulyorejo, sukomanunggal 2 lakasantri, gayungan, genteng, tenggilis, karang pilang, wonocolo, gunang anyar, dukuh pakis, jambangan 3 bulak, wiyung, asemrowo, benowo, pakal, sambikerep, pabean cantikan 4 tambaksari, wonokromo, sawahan, semampir. z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 120 the figure 4 is a picture of health, demographics and education, hdi and nutrition. figure 4 shows that the krembengan, semampir, sukolilo, kenjeran, wonokromo, and tambaksari districts have high-quality health, high demography and high education. tegalsari, mulyorejo, pabean cantikan, genteng, simokerto, gubeng, wonocolo, and sukomanunggal sub-district have high demographics and education, while the quality of health is low. tenggilis, sambikerep, dukuh pakis, benowo, pakal, jambangan, gayungan, lakasantri, bulak, and asemrowo sub-district have the characteristic of low health quality, low demography and low education. rungkut, tandes, gunung anyar, karang pilang, and sawahan sub-districts have high-quality health, high demography and low education. fig. 4. relationship between health and education demographic figure 5 have tegalsari, tendes, tenggilis, krembengan, dukuh pakis, and gubeng sub-district have high hdi characteristics and nutritional deficiencies. semampir, kenjeran, simokerto, bubutan, and asemrowo sub-districts have high hdi (human development index) characteristics and low malnutrition. pabean cantikan, wiyung, bulak, karang pilang, mulyorejo, jambangan, gununganyar, benowo, pakal, sukomanunggal, and sambikerep districts have regional characteristics of hdi and low malnutrition. genteng, gayungan, lakasantri, wonokromo, tambaksari, and sawahan sub-districts have high hdi characteristics and low malnutrition. biplot analysis of that area has been formed to find out the sub-district mapping seen from the tendency of the variable that influences tuberculosis. figure 6 shows that the variable waste management sites (x11) have the most incredible diversity because the vector length is the longest among the others. at the same time, the nutrition variable (x3) has a minor diversity or tends to be homogeneous because the vector length is the shortest. variables that have a positive correlation are the number of toddlers who experience nutrition (x3), literacy rate (x18), clean water facilities (x8), hiv/aids (x2), healthy homes (x7), latrine sanitation (x9), landfills (x10), waste management sites (x11), bcg immunization (x4), number of posyandu (x16), environmental health development institution (x15), number of tb sufferers (x20), exclusive breastfeeding (x5), elementary education facilities (x12), junior high school education facilities (x13), senior high school education facilities (x14), population density (x1), clean and healthy living behavior (x6), hdi (x19). at the same time, the variable that has a negative correlation is life expectancy (x17). 121 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 fig. 5. relationship between human development index (hdi) and nutrition fig. 6. biplot between variable 1 factor z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 122 variable waste management sites, landfills, latrine sanitation , healthy homes, bcg immunization, exclusive breastfeeding, clean and healthy living behavior, elementary, junior high school education facilities, population, senior high school education facilities, clean water facilities, nutrition, literacy rate, hdi, hiv’aids, number of posyandu, health development institutions, the number of tb contributes a lot to the sub-district of tambaksari, wonokromo, kenjeran, semampir, krembengan. life expectancy variable contributes to the sub-district of asemrowo, bulak, jambang, gayungan, lakasantri, pakal, tenggilis, benowo, dukuh pakis, and sambikerep. meanwhile, the sub-district of sawahan, rungkut, tandes, gunung anyar, karang pilang, benowo, tenggilis mejoyo, pabaen cantikan, genteng, bubutan, tandes, tegalsari, simokerto, gubeng, sukomanunggal do not dominate the variables that affect tuberculosis. figure 7 shows that life expectancy (x17) has the most incredible diversity because the length of the vector is the longest among the others. the population variable (x1) has a minor vector diversity or tends to be homogeneous because the vector length is the shortest. variables that have a positive correlation are the human development index (x19), junior high school education facilities (x13), senior high school (x14), population density (x1), health development institutions (x15), number of posyandu-integrated healthcare center (x16), hiv/aids (x2), bcg immunization (x4), waste management site (x11), latrine sanitation (x9), healthy homes (x7), landfills (x10), clean water facilities (x8), exclusive breastfeeding (x5), elementary education facilities (x12), number of toddlers who experience nutrition (x3), literacy rate (x18), number of tb sufferers (x20), clean and healthy living behavior (x6). the variable that has a negative correlation is life expectancy (x17). fig. 7. biplot between variable 2 factor variables human development index, educational facilities for junior and senior high schools, population density, a health development institution, number of posyandu, hiv/aids, number of babies immunized with bcg, waste management sites, latrine sanitation, healthy homes, landfills, number of families with facilities clean water, the number of babies who are exclusive breastfeeding, elementary school education facilities, number of toddlers who experience nutrition, the literacy rate, the number of tb sufferers, the number of clean and healthy living behavior have contributed a lot to 123 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 the sub-district of sawahan, tandes, krembengan, wonokromo, rungkut, sukolilo, tambaksari, gunung anyar, kenjeran, semampir, sukomanunggal. the life expectancy variable significantly contributes to the sub-district of gubeng, tegalsari, genteng, lakasantri, gayungan, wonocolo, dukuh pakis, jambangan, and tenggilis mejoyo. whereas for the district of bubutan, asemrowo, karang pilang, mulyorejo, bulak, simokerto, pakal, benowo, pabean cantikan, wiyung, sambikerep did not dominate the variables that affect tuberculosis. figure 8 shows that the nutritional variable (x3) has the greatest diversity because the vector length is the longest among the others. the development index has the smallest diversity because it has the shortest vector. positive correlation with number of toddlers who experience nutrition (x3), clean water facilities (x8), healthy homes (x7), elementary education facilities (x12), junior high schools (x13), senior high school (x14), latrine sanitation (x9), landfills (x10), waste management sites (x11), hiv/aids(x2), health development institution (x15), exclusive breastfeeding (x5), number of tb (x20), population density (x1), bcg immunization (x4), clean and healthy living behavior (x6), hdi (x19), literacy rate (x18), number of posyandu (x16). the life expectancy variable (x17) negatively correlates with the factors influencing tuberculosis. subdistricts of tandes, semampir, kenjeran, rungkut, krembengan, sukolilo, gunung anyar, sawahan, tambaksari, and wonokromo have contributed to the variables of nutrition, clean water facilities, healthy homes, educational facilities for elementary school, junior high school, senior high school, latrine sanitation, landfills, waste management, number of people living with hiv/aids, health development institutions, exclusive breastfeeding, number of tb, population density, bcg immunization, clean and healthy living behavior, hdi, literacy rate, number of posyandu. karang pilang, jambangan, sukomanunggal, wiyung, benowo, sambikerep, lakasantri, pakal, bulak, pabean cantikan, and genteng contribute significantly to the life expectancy variable. the districts of gubeng, tegalsari, asemrowo, bubutan, wonocolo, mulyorejo, gayungan, simokerto, and tenggilis mejoyo have no contribution to the factors that influence tuberculosis. fig. 8. biplot between variable 3 factor z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 124 before proceeding to discriminant analysis, the multivariate normal assumptions and assumption of homogeneity of covariance variant matrix. the average multivariate assumptions are tested to determine whether the data used is usually distributed [30]. the primary requirement in conducting multivariate analysis is that data is multi-normally distributed. from figure 9, it can be concluded that the data is usually distributed. visually, the qq plot tends to form a straight line so that it can be concluded that the data assumptions follow a multivariate normal distribution and have been accepted. the results of the covariance variant matrix can be seen in table 5. fig. 9. multinormally distributed test the homogeneity of the covariance variant matrix using the box'm test statistic with the hypothesis: 𝐻0:∑1 = ∑2 = ∑3 = ∑4 𝐻0: at least one is different ∑𝑗 table 5. the results of the covariance variant matrix method covariance variant matrix test statistic value box’m 60.093 f approx. chi-square 2.211 df1 190 df2 1508.643 sig. 0.002 reject 𝐻0 if the p_value is less than 0.05 (this study uses a 95% confidence level). from the test results, it can be concluded that the data analyzed have the same covariance matrix. then, the discriminant analysis can be continued. from the discriminant analysis using the stepwise method, the following table 6 shows the obtained results. table 6 shows that 19 variables are confirmed in the grouping, and only four variables meet the criteria as differentiators. these variables are elementary education, landfills, exclusive breastfeeding, and literacy rate (amh). so, it can be concluded that the groups distinguishing tuberculosis in surabaya are education, exclusive breastfeeding, literacy and sanitation. 125 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 table 6. the results of the covariance variant matrix variable wilks’lambda elementary education 0,179 landfills 0,224 exclusive breastfeeding 0,511 literacy rate 0,696 table 7. the function of the discriminant equation variable function 1 function 2 function 3 exclusive breastfeeding 0,667 -0,021 0,827 landfills 0,714 0,376 -0,311 elementary education 0,867 -0,337 -0,105 literacy rate -0,224 1.019 0.265 after obtaining the discriminant equation, it is obtained in the discriminant equation function, as shown in table 7. based on table 7, the function of the discriminant equation can be described as follow. function 1 = 0.667 exclusive breastfeeding + 0.714 landfills + 0.867 elementary education – 0.224 literacy rate function 2 = -0.021 exclusive breastfeeding + 0.376 landfills – 0.337 elementary education + 1.019 literacy rate function 3 = 0.827 exclusive breastfeeding – 0.311 landfills – 0.105 elementary education + 0.265 literacy rate the variable with the most significant coefficient contributes to differentiating groups. based on the above, function one shows that elementary school education is a factor that plays a role in distinguishing the first and second groups. the second function shows that the literacy rate variable significantly differentiates the second and third groups. the third function shows that the exclusive breastfeeding variable has a role in differentiating the third and fourth factors. figure 10 shows that the grouping based on the discriminant function is correct because not all group members are spread around the centroid point of the group. in group 1, there are members of group three who enter group one. in group 2, two members enter group 3. groups 3 and 4 are around the group centroid point. in determining the results of discriminant analysis, the results of the total accuracy value (1-aper) are needed which are based on the classification table (hosmer & lemeshow, 2000). fig. 10. plot of the discriminant function z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 126 table 8 shows that the accuracy of the classification result for the four groups that have been formed is 0.9032 or 90.32%. there was a classification error in the grouping of variables that affected tuberculosis in surabaya at 0.9032. so the aper value is known to be 0.0968, which means the error level in the data using discriminant analysis is 0.0968. there is an incorrect unit of observation (district) in the grouping. there is one observation in group 3 (simokerto) that must be included in the first row of actual group 1. that is, if we look at the cluster analysis results, simokerto is in group 1 (by grouping tb patient numbers by region). based on cluster analysis on actual data, the grouping of the gunung anyar area is in group 2. in the observations, there is group three (gunung anyar). wiyung, based on actual data in cluster analysis, they are in group 3, while the observation group is in row two. table 8. accuracy of classification the real group predicted group 1 2 3 4 1 10 0 1 0 2 0 7 2 0 3 0 0 7 0 4 0 0 0 4 accuracy 0.9032 iv. conclusion the study's extensive analysis has yielded significant findings regarding the tuberculosis situation in surabaya. it is worth mentioning that the sub-districts exhibiting the most significant tuberculosis burden have been identified as sawahan, krembengan, semampir, kenjeran, and tambaksari. the categorization as mentioned above, plays a crucial role as an essential initial step in developing focused intervention tactics within these domains. significantly, the study has shed light on the complex relationship between an individual's health status and the spread of tuberculosis prevalence in surabaya—the discovery, as mentioned earlier results from a rigorous factor analysis, which considered multiple variables. the factors examined in this study included the population of individuals living with hiv/aids, the rate of immunization coverage among toddlers for bcg, the prevalence of households with adequate living conditions, access to clean water facilities, availability of sanitation facilities such as latrines and waste disposal sites (tps), the provision of posyandu services, and the incidence of tuberculosis cases. the convergence of these factors has shown the complex network of elements that contribute to the tuberculosis situation in the city. furthermore, the study has identified regions exhibiting a pronounced susceptibility to the exacerbation of tuberculosis prevalence. tambaksari, wonokromo, sawahan, and semampir districts have been identified as areas of significant concern. this conceptualization of vulnerability enables the implementation of proactive actions in various domains, which may encompass the intensification of healthcare provision, the dissemination of public health awareness campaigns, and the improvement of healthcare facility accessibility. the study's implementation of discriminant analysis has produced a noteworthy degree of precision, surpassing the criterion of 0.5. the discriminant technique has demonstrated a noteworthy accuracy of 90.32% in predicting tuberculosis prevalence data. the translation results in a meager error rate of only 0.0968, which highlights the strong performance of the model utilized in this study. however, although this study represents a substantial advancement in our comprehension of the prevalence of tuberculosis in surabaya, it also emphasizes the necessity for more investigation. in order to enhance our understanding and improve the effectiveness of intervention approaches, it is recommended that future analyses consider including supplementary health variables that were not within the purview of this study. in addition, it is essential to conduct comparisons with various analytical methodologies in order to determine their effectiveness and accuracy, ensuring the utilization of the most efficient strategies to address the issue of tuberculosis in surabaya. this study establishes the groundwork for a more comprehensive and efficient approach to addressing the issue of tuberculosis prevalence in urban areas 127 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. a. chackerian, j. m. alt, t. v perera, c. c. dascher, and s. m. behar, “dissemination of mycobacterium tuberculosis is influenced by host factors and precedes the initiation of t-cell immunity,” infect. immun., vol. 70, no. 8, pp. 4501–4509, 2002. [2] m. buheji et al., “the extent of covid-19 pandemic socio-economic impact on global poverty. a global integrative multidisciplinary review,” am. j. econ., vol. 10, no. 4, pp. 213–224, 2020. [3] e. a. wikurendra, n. herdiani, y. g. tarigan, and a. a. kurnianto, “risk factors of pulmonary tuberculosis and countermeasures: a literature review,” open access maced. j. med. sci., vol. 9, no. f, pp. 549–555, 2021. [4] s. andarmoyo, “the effect of home ventilation on the incidence of lung tuberculosisin ponorogo regency,” dev. nurs. curric. to improv. qual. nurs. educ. islam. values int. perspect., pp. 85–91, 2015. [5] w. h. organization, “tuberculosis surveillance and monitoring in europe 2021: 2019 data,” 2021. [6] a. l. byrne, b. j. marais, c. d. mitnick, l. lecca, and g. b. marks, “tuberculosis and chronic respiratory disease: a systematic review,” int. j. infect. dis., vol. 32, pp. 138–146, 2015. [7] l. c. rodrigues and p. g. smith, “tuberculosis in developing countries and methods for its control,” trans. r. soc. trop. med. hyg., vol. 84, no. 5, pp. 739–744, 1990. [8] w. h. organization, “intersectoral collaboration to end hiv, tuberculosis and viral hepatitis in europe and central asia: a framework for action to implement the united nations common position,” 2020. [9] j. b. dowd et al., “demographic science aids in understanding the spread and fatality rates of covid-19,” proc. natl. acad. sci., vol. 117, no. 18, pp. 9696–9698, 2020. [10] t. togun, b. kampmann, n. g. stoker, and m. lipman, “anticipating the impact of the covid-19 pandemic on tb patients and tb control programmes,” ann. clin. microbiol. antimicrob., vol. 19, no. 1, pp. 1–6, 2020. [11] a. k. kashyap and j. c. stein, “monetary policy when the central bank shapes financial-market sentiment,” j. econ. perspect., vol. 37, no. 1, pp. 53–75, 2023. [12] a. sejati and l. sofiana, “faktor-faktor terjadinya tuberkulosis,” kemas j. kesehat. masy., vol. 10, no. 2, pp. 122– 128, 2015. [13] a. mollalo, l. mao, p. rashidi, and g. e. glass, “a gis-based artificial neural network model for spatial distribution of tuberculosis across the continental united states,” int. j. environ. res. public health, vol. 16, no. 1, p. 157, 2019. [14] s. k. singh, g. c. kashyap, and p. puri, “potential effect of household environment on prevalence of tuberculosis in india: evidence from the recent round of a cross-sectional survey,” bmc pulm. med., vol. 18, no. 1, p. 66, 2018. [15] i. yuniar, s. rusmindarti, and s. sarwono, “the level of lighting and ventilation on the incidence rate of pulmonary tb,” 2021. [16] s. c. n. tang, m. rusli, and p. lestari, “climate variability and dengue hemorrhagic fever in surabaya, east java, indonesia,” 2019. [17] n. n. juliasih, n. m. mertaniasih, c. hadi, soedarsono, r. m. sari, and i. n. alfian, “factors affecting tuberculosis patients’ quality of life in surabaya, indonesia,” j. multidiscip. healthc., pp. 1475–1480, 2020. [18] p. pardeshi et al., “association between architectural parameters and burden of tuberculosis in three resettlement colonies of m-east ward, mumbai, india,” cities heal., vol. 4, no. 3, pp. 303–320, 2020. [19] d. s. rachmawati, n. nursalam, m. amin, and r. hargono, “developing family resilience models: indicators and dimensions in the families of pulmonary tb patients in surabaya,” 2019. [20] s. hawken and r. y. sunindijo, “city of kampung: risk and resilience in the urban communities of surabaya, indonesia,” int. j. build. pathol. adapt., vol. 36, no. 5, pp. 543–568, 2018. [21] y. yulianto, n. robihaningrum, and b. d. elinda, “management multivariate analysis methods for variables measurement in scientific papers,” aptisi trans. manag., vol. 3, no. 1, pp. 65–72, 2019. [22] b. p. statistik, “statistik lingkungan hidup indonesia,” jakarta. bps indones., 2018. [23] d. sharma, j. sharma, n. deo, and d. bisht, “prevalence and risk factors of tuberculosis in developing countr ies through health care workers,” microb. pathog., vol. 124, pp. 279–283, 2018. [24] r. duarte et al., “tuberculosis, social determinants and co-morbidities (including hiv),” pulmonology, vol. 24, no. 2, pp. 115–119, 2018. [25] ö. akay and g. yüksel, “clustering the mixed panel dataset using gower’s distance and k-prototypes algorithms,” commun. stat. comput., vol. 47, no. 10, pp. 3031–3041, 2018. [26] a. e. ezugwu et al., “a comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects,” eng. appl. artif. intell., vol. 110, p. 104743, 2022. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1128/iai.70.8.4501-4509.2002 https://doi.org/10.1128/iai.70.8.4501-4509.2002 https://doi.org/10.1128/iai.70.8.4501-4509.2002 https://doi.org/10.5923/j.economics.20201004.02 https://doi.org/10.5923/j.economics.20201004.02 https://doi.org/10.3889/oamjms.2021.7287 https://doi.org/10.3889/oamjms.2021.7287 http://eprints.umpo.ac.id/2037/ http://eprints.umpo.ac.id/2037/ https://apps.who.int/iris/handle/10665/340210 https://doi.org/10.1016/j.ijid.2014.12.016 https://doi.org/10.1016/j.ijid.2014.12.016 https://doi.org/10.1016/0035-9203(90)90172-b https://doi.org/10.1016/0035-9203(90)90172-b https://apps.who.int/iris/handle/10665/334255 https://apps.who.int/iris/handle/10665/334255 https://doi.org/10.1073/pnas.2004911117 https://doi.org/10.1073/pnas.2004911117 https://doi.org/10.1186/s12941-020-00363-1 https://doi.org/10.1186/s12941-020-00363-1 https://doi.org/10.1257/jep.37.1.53 https://doi.org/10.1257/jep.37.1.53 https://doi.org/10.15294/kemas.v10i2.3372 https://doi.org/10.15294/kemas.v10i2.3372 https://doi.org/10.3390/ijerph16010157 https://doi.org/10.3390/ijerph16010157 https://doi.org/10.1186/s12890-018-0627-3 https://doi.org/10.1186/s12890-018-0627-3 http://dx.doi.org/10.4108/eai.18-11-2020.2311724 http://dx.doi.org/10.4108/eai.18-11-2020.2311724 https://www.researchgate.net/profile/sandra-tang/publication/342681772_climate_variability_and_dengue_hemorrhagic_fever_in_surabaya_east_java_indonesia/links/5f95bcdb299bf1b53e45cf7c/climate-variability-and-dengue-hemorrhagic-fever-in-surabaya-east-java-indonesia.pdf https://www.researchgate.net/profile/sandra-tang/publication/342681772_climate_variability_and_dengue_hemorrhagic_fever_in_surabaya_east_java_indonesia/links/5f95bcdb299bf1b53e45cf7c/climate-variability-and-dengue-hemorrhagic-fever-in-surabaya-east-java-indonesia.pdf https://doi.org/10.2147/jmdh.s274386 https://doi.org/10.2147/jmdh.s274386 https://doi.org/10.1080/23748834.2020.1731919 https://doi.org/10.1080/23748834.2020.1731919 https://repository.unar.ac.id/jspui/handle/123456789/514 https://repository.unar.ac.id/jspui/handle/123456789/514 https://doi.org/10.1108/ijbpa-02-2018-0025 https://doi.org/10.1108/ijbpa-02-2018-0025 https://doi.org/10.33050/atm.v3i1.826 https://doi.org/10.33050/atm.v3i1.826 https://www.academia.edu/download/56174191/48275-id-statistik-lingkungan-hidup-indonesia-2016.pdf https://doi.org/10.1016/j.micpath.2018.08.057 https://doi.org/10.1016/j.micpath.2018.08.057 https://doi.org/10.1016/j.rppnen.2017.11.003 https://doi.org/10.1016/j.rppnen.2017.11.003 https://doi.org/10.1080/03610918.2017.1367806 https://doi.org/10.1080/03610918.2017.1367806 https://doi.org/10.1016/j.engappai.2022.104743 https://doi.org/10.1016/j.engappai.2022.104743 z. a. gultom et al. / knowledge engineering and data science 2023, 6 (2): 114–128 128 [27] s. sharma and n. batra, “comparative study of single linkage, complete linkage, and ward method of agglomerative clustering,” in 2019 international conference on machine learning, big data, cloud and parallel computing (comitcon), 2019, pp. 568–573. [28] n. shrestha, “factor analysis as a tool for survey analysis,” am. j. appl. math. stat., vol. 9, no. 1, pp. 4–11, 2021. [29] v. victor, j. joy thoppan, r. jeyakumar nathan, and f. farkas maria, “factors influencing consumer behavior and prospective purchase decisions in a dynamic pricing environment—an exploratory factor analysis approach,” soc. sci., vol. 7, no. 9, p. 153, 2018. [30] j. w. osborne and e. waters, “four assumptions of multiple regression that researchers should always test,” pract. assessment, res. eval., vol. 8, no. 1, p. 2, 2019. https://doi.org/10.1109/comitcon.2019.8862232 https://doi.org/10.1109/comitcon.2019.8862232 https://doi.org/10.1109/comitcon.2019.8862232 http://article.sciappliedmathematics.com/pdf/ajams-9-1-2.pdf https://doi.org/10.3390/socsci7090153 https://doi.org/10.3390/socsci7090153 https://doi.org/10.3390/socsci7090153 https://doi.org/10.7275/r222-hv23 https://doi.org/10.7275/r222-hv23 microsoft word 5-4798-ameliaismail-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 2, september 2018, pp. 74–78 eissn 2597-4637 https://doi.org/10.17977/um018v1i22018p74-78 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) change vulnerability forecasting for southeast asia using deep learning algorithm amelia ritahani ismail a, 1, *, nur ‘atikah binti mohd ali a, junaida sulaiman b, 2 a department of computer science, kulliyyah of information and communication technology, international islamic university malaysia, p.o. box 10, 50728 kuala lumpur, malaysia b soft computing and intelligent system (spint), faculty of computer systems & software engineering, universiti malaysia pahang, lebuhraya tun razak, 26300, kuantan, pahang, malaysia 1 amelia@iium.edu.my*; 2 junaida@ump.edu.my * corresponding author i. introduction south east asia geographically located in the place that highly affected by the climate change. particularly draughts, floods and tropical cyclones. most of the area located at the ocean coast is highly prone to flash floods which are influenced by the monsoons [1][2]. the livelihood of the poor in these areas will be most threatened by these problems due to their limited adaptive capacities [2]. on the other hand, some countries already have the adaptive action by utilizing the capital investment for both public and private sectors through the cooperation between the government and the communities [3]. most countries in southeast asia depend so much on climate to support their life. the change in climate means a change in their life as well whether it is for good or bad. for example, changes in rainfall occurrences and variability affect economics and infrastructure in countries such as indonesia and malaysia [1][2]. there are a lot of studies with different approaches have been conducted to see how much the climate change vulnerability affect their life [3]. the results show different outcomes in european countries based on the readiness factors. the level of readiness toward climate change in the developed country is good to moderate [3]. on the other hand, the developing country has under average level of readiness due to lack of access to climate information [2]. community and its supporting agency are not really aware about the climate change towards their country in the future. therefore, they do not well prepared for any circumstances including foods, water, habitation, income per capita, infrastructures, health and governance. when the climate change occurs, the most affected sector is agriculture because it totally depends on the climate. the worst case is that other sectors are depending on it. therefore, they cannot work properly and the condition for every factor of that country will be in trouble. this study uses machine learning approach to predict of climate change vulnerability in south east asia. recently, researchers are interested to use deep learning method in malware detection [4], green energy efficiency [5], marine ecosystems classification [6] and weather forecasting [7]. deep learning article info a b s t r a c t article history: received 19 april 2018 revised 6 august 2018 accepted 14 august 2018 published online 31 august 2018 climate change is expected to change people’s livelihood in significant ways. several vulnerability factors and readiness factors used for measuring the prediction index of that particular country on how vulnerable of a country towards global change. primary data was collected from university of notre dame global adaptation index (ndgain). the data has been trained for the forecasting purpose with support from the validated statistical analysis. the summary of the predicted index is visualized using machine learning tools. the results developed the correlation between vulnerability and readiness factors and shows the stability of the country towards climate change. the framework is applied to synthesize findings from prediction index studies in south east asia in dealing with vulnerability to climate change. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: deep learning forecasting climate change a.r. ismail et al. / knowledge engineering and data science 2018, 1 (2): 74–78 75 allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction [8]. this method improved the data learning process which used back propagation algorithm with multiple hidden layers and iterations to discover how it changed the internal parameters by computing the previous layers. deep learning is also able to classify unlabeled data which are difficult to handle by supervised machine learning methods such as support vector machine, decision trees and naïve bayes [4]. by using progressive learning algorithm [8], data mining task is more effective in recognizing hidden patterns in large data. in this paper, the prediction index measures the correlation between the vulnerability factors and the readiness factors. the elements of the vulnerability are capacity, ecosystem, exposure, food, habitat, health, infrastructure, sensitivity, and water. while for readiness are economic, governance and social. ii. methods the data used in this study was collected from university of notre dame global adaptation index (nd-gain) [9]. the data is about the readiness and vulnerability index for southeast asia within 5 years. the data will be used to predict both indexes for another three years (2015 – 2017). furthermore, the data generated will be used as a reference for countries to increase their country's resilience toward climate change. the description of the elements used for the experiments based on the dataset is shown in table 1. to survive the climate change, the government and its supporting agencies need to improve these factors by increasing the readiness level of their country. it can be achieved through well communication and cooperation between the community, supporting agencies and the government itself. this study also provides data that can be used as a part of references for the decision making in implementing some actions for the development of that country. there are six procedures in acquiring the predictive index. the first step is data acquisition. data collected from the university of notre dame global adaptation index (nd-gain). they summarized a country’s vulnerability to climate change and other global challenges in combination with its readiness to improve resilience. the second step is data training. the collected data from 1995 until 2014 is transformed using windowing time series forecasting with the window size of 5. windowing is a mathematical function with zero-valued outside its chosen interval [10]. this operator transforms a given example set containing series data into a new example set which contains single valued examples4. windows with a specified window and step size are moved across the series and the attribute value is lying horizon values after the window end is used as label, which should be predicted. then, the windowed data is trained using deep learning model to get the desired output. with the multiple numbers of input nodes, table 1. description of the elements element description capacity the society capabilities together with the support from both public and private sectors to mitigate the potential distraction and to counter the negative outcome of climate affair. ecosystem the capability of the society and its supporting sectors to maintain the relationship between organism and environment. exposure the awareness of the society and its supporting sectors toward the climate condition. food the capability of the society and its supporting sectors to maintain the food sustainability. habitat the degree of habitat resilience to climate change. health the readiness of the society and its supporting sectors to the climate change in term of health. infrastructure the readiness of the infrastructure to respond the threats of climate change. sensitivity the dependency level of society towards the affected areas caused by climate change. water the capability of the society and its supporting sectors to maintain the water supply. economic the investment from the supporting sectors to facilitate the mobilizing capitals governance the government level of assurance to secure the invested capitals to grow with help of public services without any disturbance. social social condition for the utilization of the investment by the government to help the society with efficient and equitable usage 76 a.r. ismail et al. / knowledge engineering and data science 2018, 1 (2): 74–78 15 input nodes for readiness model, 45 input nodes for the vulnerability model and 50 hidden nodes with 25 epochs, the output can be obtained. the output layer resulted in the value for forecasted data and it went through some performance model and had provided its accuracy. the next step is data testing. the trained data be tested for the next three years (2015 – 2017) by using rapidminer. it is a software application for visualizing the programming environment to build complete predictive analytic workflows. it is a tool for predefined data preparation and machine learning algorithm to support the data science needs. it can be used for various tasks in data analysis process such as data preparation, data training, data testing, data validation and data visualization. the fourth step is index predicting. the predicted index is gained from deep learning prediction method. first, the training data is processed using a windowing series with the size of 5. windowing is suitable for time series forecasting in rapidminer. it allows the user to take any time series data and transform it into a cross-sectional format. after that, the windowing results are validated using the deep learning algorithm with 29 epochs. deep learning is used for multiple running processes. it is the improvement of the backpropagation algorithm. after running the algorithm, the results performance is calculated to gain the percentage value of the validation. finally, the model for training data is applied to the testing data to get the predicted value for both vulnerable and readiness level of each country. after that, the predicted score for each country is calculated by using equation 1. 𝑆𝑐𝑜𝑟𝑒 = (𝑅𝑒𝑎𝑑𝑦𝑛𝑒𝑠𝑠 − 𝑉𝑢𝑙𝑛𝑒𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 1) ∗ 50 (1) index validation is the fifth step. validating the predicted data for 2015 until 2017. checking whether it is correct or not. if the predicted data is the same with the real data, then it is true. in the last step, index visualization, the information is described through visual rendering. the obtained result will be visualized using software called tableau to make it more understandable and presentable. tableau is a software application that is widely used in business intelligence for an organization to see and understand their data. it connects and visualizes the data that has been analyzed in a meaningful way by drag and drop the attributes which can be shared and published publicly on the web. fig. 1 and fig. 2 show part of visualizations of the predicted index. fig. 1. visualizing the forecasted adaptation index using tableau a.r. ismail et al. / knowledge engineering and data science 2018, 1 (2): 74–78 77 iii. experimental result after doing the experiment by using the deep learning approach, the predicted adaptation index for each southeast asia country is obtained. it is found that each country has different value for its predicted vulnerability index and readiness index as well as the performance accuracy. after that, the predicted index vulnerability and readiness will go through the statistical analysis to validate either to accept or to reject the hypothesis. when validating the country’s vulnerability and readiness level based on the five years’ data to prove the deep learning study as in table 1, it is found 60 % of the countries rejected the hypothesis, those are indonesia, laos, malaysia, myanmar, philippines, and singapore. the obtained values from the wilcoxon rank sum test are 0.3095, 0.1493, 0.3095, 0.1337, 0.1587 and 0.09524 respectively. it means that the predicted value has the smallest different compared to the real data and the results are promising which leads to high potential to be accepted. while the other 40 % are from brunei, cambodia, thailand, and vietnam accepted the hypothesis because their predicted values are significantly different compared to the real data. fig. 2. deployment of the results on the web table 2. average prediction value experiment parameters predicted value readiness 0.467 vulnerability 0.46 nd-gain 51.767 readiness-prediction 0.476 vulnerability-prediction 0.493 prediction index 49.175 p value readiness 0.41 p value vulnerability 0.013 accuracy readiness 0.68 accuracy vulnerability 0.564 78 a.r. ismail et al. / knowledge engineering and data science 2018, 1 (2): 74–78 iv. conclusions the study conducted has proved that there is a slight different in the result using deep learning analysis with the original given data. based on the findings it is the truth that the predicted vulnerability and readiness values obtained using deep learning do differ and affect the adaptation index of the countries. therefore, the government and any private sectors should take actions in order to prepare for any critical aspects in terms of life, economic and others that can affect the population of the country. references [1] hardwinarto, sigit, and marlon aipassa. "rainfall monthly prediction based on artificial neural network: a case study in tenggarong station, east kalimantan-indonesia." procedia computer science 59 (2015): 142-151. [2] sulaiman, junaida binti, herdianti darwis, and hideo hirose. "monthly maximum accumulated precipitation forecasting using local precipitation data and global climate modes." journal of advanced computational intelligence and intelligent informatics 18.6 (2014): 999-1006. [3] ezra, c. alyosha. "climate change vulnerability assessment in the agriculture sector: typhoon santi experience." procedia-social and behavioral sciences 216 (2016): 440-451. [4] yuxin, ding, and zhu siyi. "malware detection based on deep learning algorithm." neural computing and applications (2017): 1-12. [5] wang, huai-zhi, gang-qiang li, gui-bing wang, jian-chun peng, hui jiang, and yi-tao liu. "deep learning based ensemble approach for probabilistic wind power forecasting." applied energy 188 (2017): 56-70. [6] a. grover, a. kapoor, and e. horvitz, “a deep hybrid model for weather forecasting,” proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining kdd '15, 2015 [7] y. lecun, y. bengio, and g. hinton, “deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. [8] a. g. salman, b. kanigoro, and y. heryadi, “weather forecasting using deep learning techniques,” 2015 international conference on advanced computer science and information systems (icacsis), 2015. [9] “nd-gain index,” nd-gain index. [online]. available: http://index.gain.org/. [accessed: 06-oct-2016]. [10] b. deshpande, “time series forecasting: from windowing to predicting in rapidminer,” http://www.simafore.com/blog/bid/110752/time-series-forecasting-from-windowing-to-predicting-in-rapidminer. simafore. 5 november 2012. web. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 92–102 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p92-102 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) long-term traffic prediction based on stacked gcn model atkia akila karim 1,*, naushin nower 2 institute of information and technology, university of dhaka, suhrawardi udyan rd, dhaka 1200, bangladesh 1 msse1760@iit.du.ac.bd*; 2 naushin@iit.du.ac.bd * corresponding author i. introduction traffic flow prediction is a crucial research domain focused on anticipating forthcoming traffic patterns within a road network [1]. recently, this field has garnered increasing interest due to the rapid advancements and adoption of intelligent transportation systems (its). traffic flow prediction plays a fundamental role within the framework of its, serving as a pivotal component, plays a critical role in traffic management and planning, and aims to provide better transport management by avoiding congestion. in most megacities, traffic congestion is a significant issue that hinders residents' daily lives and the nation's economic progress [2]. the significant causes of traffic congestion include rising population, urbanization, poor traffic management, and inadequate transportation infrastructure [3]. the economic burden of traffic congestion in urban centers is steadily increasing globally, affecting nearly every major city. for instance, in dhaka, traffic congestion results in the loss of five million working hours daily, translating to an annual economic toll ranging from 200 to 550 billion takas [4]. such severe traffic congestion can harm a nation's economy, hinder foreign investments, disrupt the supply and demand dynamics, and contribute to heightened emotional stress among the population [5]. consequently, timely and precise traffic flow forecasting is immensely valuable to urban residents. travelers can create better trip arrangements with accurate traffic flow forecasting, reducing traffic congestion, fuel consumption, and carbon emissions [6]. however, because of its intricate spatial and temporal connections and abrupt accidents, traffic flow prediction has always been a complex problem. numerous specialists and academics have dedicated their research to studying traffic flow prediction and have developed numerous prediction article info a b s t r a c t article history: received 20 august 2023 revised 10 september 2023 accepted 18 september 2023 published online 24 september 2023 with the recent surge in road traffic within major cities, the need for both short and long-term traffic flow forecasting has become paramount for city authorities. previous research efforts have predominantly focused on short-term traffic flow estimations for specific road segments and paths. however, applications of paramount importance, such as traffic management and schedule routing planning, demand a deep understanding of long-term traffic flow predictions. however, due to the intricate interplay of underlying factors, there exists a scarcity of studies dedicated to long-term traffic prediction. previous research has also highlighted the challenge of lower accuracy in long-term predictions owing to error propagation within the model. this model effectively combines graph convolutional network (gcn) capacity to extract spatial characteristics from the road network with the stacked gcn aptitude for capturing temporal context. our developed model is subsequently employed for traffic flow forecasting within urban road networks. we rigorously compare our method against baseline techniques using two real-world datasets. our approach significantly reduces prediction errors by 40% to 60% compared to other methods. the experimental results underscore our model's ability to uncover spatiotemporal dependencies within traffic data and its superior predictive performance over baseline models using real-world traffic datasets. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: traffic flow prediction long-term prediction graph convolutional network segment http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 93 karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 techniques that can be categorized depending on the model: parametric or nonparametric. parametric models derive their parameters by analyzing the original data, and traffic forecasts are subsequently executed based on predefined regression functions. various traditional parametric models like the arima model [7], the kf model [8][9], and different variations of arima have been utilized for traffic flow prediction. nevertheless, due to traffic flow's nonlinear and stochastic nature, these models often struggle to provide accurate predictions. consequently, nonparametric models, including random forest [10], support vector machine [11][12], fuzzy logic models [13], bayesian networks [14], k-nearest neighbors methods [15][16], neural network models [17][18], and hybrid combinations of these algorithms, have been introduced. these models can handle spatiotemporal data, although their effectiveness may vary depending on the application and dataset size. despite their superior performance, these models encounter challenges when dealing with extensive traffic datasets. to address these challenges, recent advancements in deep learning networks have become increasingly prevalent, as they can handle large datasets and improve prediction accuracy by utilizing multiple layers to extract intricate traffic characteristics. for instance, wu and tan [19] introduced a model featuring a one-dimensional convolutional neural network (cnn) for capturing spatial features and incorporated two long short-term memory (lstm) layers to capture temporal patterns. duan et al. [20] adopted cnn for spatial features and combined it with lstm for temporal feature extraction. additionally, they employed a greedy training policy to reduce training time and enhance accuracy, especially in deeper networks. however, cnn has inherent limitations when dealing with complex topological structures, as it was initially designed for euclidean spaces like images and regular grids, making it less suitable for adequately characterizing the spatial intricacies and dependencies within road networks. the graph convolutional network (gcn) [21] was introduced to address this limitation. gcn represents the traffic network as a graph and effectively captures spatial attributes from neighboring nodes. in another study [22], a combination of gcn was utilized for traffic flow prediction, incorporating lstm and multitask learning to capture global and local traffic flow correlations along road segments. this model leveraged gcn within an undirected graph framework to depict the spatial distribution patterns of taxi trips and used lstms to capture temporal features. additionally, the implementation of multitask learning enhanced the model's generalizability. in [23], an approach called hierarchical graph convolution networks (hgcn) was proposed, operating on both micro and macro traffic graphs. this study recognized the hierarchical structure of traffic systems, comprising microlayers (road networks) and macro layers (region networks). in [24], the authors emphasized the importance of learning node-specific patterns without relying on predefined graphs. to achieve this, they introduced two adaptive modules: the node adaptive parameter learning (napl) module, capturing node-specific patterns, and the data adaptive graph generation (dagg) module, inferring interdependencies among traffic series automatically. these modules were integrated with recurrent networks to create the adaptive graph convolutional recurrent network (agcrn), effectively capturing fine-grained spatial and temporal correlations in traffic data. however, it is worth noting that these innovative methods predominantly focused on short-term traffic prediction despite the increased complexity associated with long-term prediction. long-term traffic prediction is particularly challenging due to its essential applications in traffic management and schedule routing planning. consequently, research is scarce in this domain, primarily because predicting the distant future presents more considerable difficulties compared to short-term forecasting. long-term traffic flow prediction is a less frequently explored research area, and achieving accurate long-term predictions poses challenges due to performance degradation over extended timeframes compared to short-term predictions. a previous study [25] employed a recurrent neural network (rnn) with gpu acceleration to forecast long-term traffic flow in odense and beijing. however, it is worth noting that rnns are susceptible to the vanishing gradient problem, which can impact their performance. in another study [26], a spatial-temporal graph attention network was introduced, designed to capture the data's dynamic graph structure and spatial-temporal dependencies. their model is tested using two public datasets gathered in california. in their study, karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 94 wang et al. [27] introduced a deep learning architecture comprising two main components: a bottomup lstm encoder-decoder structure and a top-down calibration layer. on the other hand, li et al. [28] proposed a hybrid model for forecasting next-day traffic flow. this model incorporates wavelet decomposition, cnn, and lstm techniques. in [29], cnn and bilstm are incorporated to predict long-term traffic flow. however, cnn is unsuitable for capturing the complex traffic road network structure since it is based on euclidean distance. moreover, as those prediction techniques do not use separate models, errors can propagate quickly, and those models find difficulties in handling sudden incidents. accurately predicting traffic patterns beyond short time frames remains challenging due to the inherent complexities of error accumulation in existing models, which undermines long-term forecasting precision. to solve those problems, we proposed a stacked gcn that can handle sudden incidents, and as there is a gcn for every segment, the error does not propagate. most models use rnn or its variant to capture the temporal feature and cnn or gcn to capture the spatial feature. however, using separate models has drawbacks; it cannot capture the inherent interrelationship between temporal and spatial features. to overcome this, we used stacked gcn, where segmented modules inherit the temporal feature that helps gcn capture both the spatial and temporal features simultaneously. in the proposed architecture, we design a segmented module that segments input data to extract the temporal features and then incorporates a gcn for every segment to give day-long predictions. thus, we use stacked gcn to get the final prediction outcome based on the segment, and as a result, because of stacked architecture, the error from the previous outcome is not propagated in the next prediction. gcn is utilized in the proposed method since it improves cnn, which can directly handle graphs and non-euclidian distance and thus works better in road networks. our contributions to this paper are briefly summarized below: ● we proposed a stacked gcn predictive model for traffic flow over extended periods and applied segments to improve the prediction performance without accumulating errors. ● we used two publicly available datasets to evaluate our model and perform a whole-day prediction. we conducted a comparative analysis of our model against the baseline methods, and our model shows superiority in traffic forecasting. ii. method this section introduces the proposed stacked gcn model designed for long-term traffic flow prediction. our architecture leverages gcn to extract intricate spatial relationships within the road network. the road network, represented as the graph g = (v, e), serves as the input to gcn, encapsulating the topological structure of the road network. each road is treated as a node, illustrated in figure 1, and the edges denote connections between the roads. (a) (b) fig. 1. real road structure transformation into graph road network where (a) road map (b) graph structure of the road map within the graph, individual roads are node representations, with v being the set of road nodes v = {v1, v2,· · ·,vn }, n signifying the total number of nodes, and e representing the set of edges. the adjacency matrix a ∈ r n×n characterizes road linkages, with entries in the matrix being 0 for unrelated roads and 1 for connected ones. the feature matrix x ∈ r n×f, with f corresponding to the 95 karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 historical traffic flow data length. our primary objective is to predict traffic flow for the next t time steps, relying on historical data. the proposed stacked gcn model comprises two essential modules: i) a segmented module and ii) a graph convolutional network module. gcn effectively captures spatial traffic data characteristics, while the segments module divides historical data into s segments, enabling the model to learn temporal patterns. the primary goal of our suggested model is to create a more accurate forecast, and the divergence from the actual value should be minimized. as a result, our goal is to reduce prediction error, which can be expressed as in (1). 𝑚𝑖𝑛⁡(||𝑌𝑖 −⁡𝑌_𝑖⁡||) (1) 𝑌𝑖 represents the actual observed value of traffic flow, while 𝑌_𝑖 signifies the predicted output. the methods of the modules are described in the following subsections. a. segmented module to capture the periodic information embedded within the historical data, we employ the segmented module, which transforms the full-length historical traffic data (x) into a collection of periodic segments denoted as s = {s1, s2, . . . , sd}, where d represents the number of segments. each of these segments encapsulates historical data from a distinct period, with si representing a subtime series conveying information about a specific period. here, l signifies the length of each segment, and si is composed of temporal features about the corresponding time interval. figure 2 illustrates an illustrative example of this data segmentation process, where the previous four days' twenty-four-hour data is segmented into six segments. each segment consists of four hours. so, the value of d is six, and the value of l is four hours. the fifth day's data were predicted using the previous four days' data segments. fig. 2. segmentation mechanism of input data in this proposed method to predict a time stamp, we have considered the same time segment from the historical data rather than the whole historical data. typically, traffic behavior within a region exhibits a consistent pattern during the same periods across different days. as a result, historical daily patterns can be characterized as recurring weekly patterns within specific time windows. for instance, the traffic speed observed on a wednesday at 8:00 am and 9:00 am will resemble the corresponding time slots on previous days. consequently, the repetitive patterns in traffic data from preceding days within a specific time window can serve as a valuable reflection of the historical daily karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 96 trends. thus, we have extracted the temporal features from the segmented module from the historical data, and from the stacked gcn module, we have considered the traffic speed for that particular time segment. b. graph convolutional networks the gcn model collects spatial features from its first-order neighborhood. as depicted in figure 3, node a represents a central road, while nodes b and c signify the roads connected to this central road. spatial features are extracted by establishing the topological relationships between the central and neighboring roads. the gcn model generates a fourier domain filter utilizing the adjacency matrix 'a' and the feature matrix 'x.' this filter, applied to the nodes within the graph, gathers spatial characteristics from the first-order neighborhood of each node. the gcn model is constructed by stacking multiple convolutional layers, allowing it to capture increasingly complex spatial relationships among the nodes, as in (2). 𝐻𝑙+1 = 𝜎⁡(𝐷 − 1 2⁡𝐴⁡𝐷 − 1 2⁡𝐻(𝑙)⁡𝑊(𝑙)⁡) (2) 𝐻(𝑙) represents the node feature matrix at layer 𝑙, 𝐴 = 𝐴 + 𝐼 is the adjacency matrix of the graph with self-connections added, d is the degree matrix, w(l) denotes the learnable weight matrix at layer 𝑙, and 𝜎 represents a nonlinear activation function. the number of layers in the model determines the maximum distance over which node characteristics can propagate and interact within the graph structure. with one layer gcn, for instance, each node can only obtain information from its neighbors. each node's information-gathering operation runs simultaneously and independently. we repeat the process of obtaining information when we layer another layer on top of the original one. however, gcn suffers from a vanishing gradient problem if more layers are added, precisely more than four layers, causing limited performance [30]. to avoid this problem, we used two layers in gcn that can better handle non-euclidean road networks compared to cnn without suffering from the vanishing gradients problem. we utilize historical traffic data as our input, segment the input data, and then use a two-layered graph convolution network (gcn) for every segment. fig. 3. graph convolutional network (gcn) c. the proposed stacked-based gcn model the architecture of our proposed model, as depicted in figure 4, incorporates a segmented module responsible for preprocessing the input time series data x and converting it into periodic segments denoted as s. for day-long prediction, we have segmented twenty-four hours into s segments. we have generated results for different numbers of segments. 97 karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 fig. 4. graph convolutional network (gcn) table 1 demonstrates that an increase in the number of segments leads to reduced error. twentyfour segments give less error than others (2, 3, 4, 6, 8, 12). in twenty-four segments, each segment consists of one-hour timestamps. in the gcn model, the processed segments are utilized to generate the final predictions for traffic speed data. as depicted in figure 1, the initial raw historical data is initially input into the system, and from there, the segments are extracted for further processing. in fig. 3, we have demonstrated the segmentation of historical data for our model. previous four days, particular segments (suppose 4:00 pm 5:00 pm) have been considered to predict the fifth day's 4:00 pm 5:00 pm. every day is divided into s segments. after that, in the stacked gcn models, gcn models are used to process each segment separately. every gcn used for the segment is two-layered. the outputs of these modules are then merged to produce the final prediction sequence y. historical data in the segment module helps the model inherit the temporal feature and gcn helps capture spatial features. the proposed method does not incorporate any other model to capture the temporal feature separately, as using separate models cannot capture the inherent interrelationship between temporal and spatial features. the stacked gcn model can effectively capture temporal and spatial features by employing segmentation. table 1. day-long (twenty-four hours) prediction performance for different segments on the sz-taxi dataset number of segments rmse mae r2 24 5.5843 4.4322 0.6995 12 5.6228 4.2262 0.6940 8 5.6179 4.2211 0.6930 6 5.6421 4.2171 0.6876 4 5.6465 4.2279 0.6955 3 5.7155 4.3007 0.6897 2 5.8001 4.3659 0.6808 iii. experimental results a. dataset description in this section, we evaluate the predictive performance of our proposed model using two publicly available real-world datasets: the sz-taxi dataset and the pemsd7 dataset. these datasets have gained popularity in traffic forecasting research and have been employed for performance benchmarking in prior studies. those datasets have both speed and connection data that are needed for gcn. sz-taxi: the sz-taxi dataset, covering taxi trajectories in shenzhen from january 1 to january 31, 2015, is centered on the luohu district's 156 highways. this dataset is structured into two essential components: a 156x156 adjacency matrix illustrating highway connections and a feature matrix capturing the time-varying traffic speeds for each road. each row in the feature matrix corresponds to a unique route, while columns represent traffic speeds at fifteen-minute intervals. the dataset is split into two segments for research purposes, allocating twenty days for training and ten days for testing, facilitating effective model development and evaluation. pemsd7: the pemsd7 dataset provides traffic speed data collected from 228 sensors in california's district seven during weekdays in may and june 2012. it includes two critical karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 98 components: a 228x228 adjacency matrix representing sensor connections within the network and a feature matrix depicting the time-varying traffic speeds for each sensor. each row in the feature matrix corresponds to an individual sensor, while columns represent five-minute intervals of traffic speed measurements. the dataset is segmented into a training set, consisting of the first month's data encompassing 6,336 timestamps and a test set with an equal number of timestamps, enabling practical model training and evaluation for traffic flow prediction research. table 2 illustrates the learning parameters employed in our proposed model. we utilized the adam optimizer during the training process to minimize the rmse. the adam optimizer dynamically adjusts the model's real-time parameters, enhancing its accuracy and computational efficiency. the l2 regularization technique is used to reduce model overfitting. as we have memory limitations, we used 1000 epochs,64 hidden units, batch size 32, and 0.001 learning rate. as per table. 1, we can see that twenty-four segments give better performance. thus, we used twenty-four segments for a one-day prediction. table 2. learning parameters parameter description learning rate 0.001 number of epoch 1000 loss function rmse hidden unit 64 optimizer adam regularization techniques l2 regularization b. evaluation metrics to assess prediction performance, we utilize three metrics. three commonly used performance measurements for model evaluation in various fields are the mean absolute error (mae), mean absolute percentage error (mape), and root mean square error (rmse). in particular, the rmse is an important metric to evaluate the effectiveness of the proposed model. the rmse value indicates the average magnitude of the differences between actual and predicted data values. in general, a smaller rmse suggests that the model and its predictions perform better, reflecting reduced errors in prediction accuracy. the eqaution of rmse as in (3). 𝑅𝑀𝑆𝐸 = √ 1 𝑛 ∑ (𝑦1 − 𝑦1̂ 2 )𝑛𝑖=1 (3) the absolute mathematical operation turns a negative integer into a positive number. indeed, when calculating the mae, the absolute difference between an expected (actual) value and a predicted value is always taken, ensuring that the result is positive regardless of whether the prediction overestimates or underestimates the actual value. the formula of mae as in (4). 𝑅𝑀𝑆𝐸 =⁡ 1 𝑛 ∑ |𝑦1 − 𝑦1̂| 𝑛 𝑖=1 (4) the coefficient of determination, often referred to as r-squared, quantifies the proportion of variation in the dependent variable that can be accounted for by the independent variable (s) in a regression model. it has a value between 0 and 1, with higher values suggesting that the model fits the data more closely as in (5). 𝑅2 = ⁡1 −⁡ ∑ (𝑦�̂�−𝑦𝑖) 2𝑛 𝑖=1 ∑ (𝑦𝑖−𝑦�̅�) 2𝑛 𝑖=1 (5) c. compared methods we conducted a comparative analysis of our proposed model against several widely recognized models for traffic flow prediction. we selected four commonly employed approaches, encompassing both traditional time-series prediction methods and deep learning techniques. first is autoregressive integrated moving average (arima). arima represents a conventional statistical method that captures temporal dependencies within data by employing autoregression, 99 karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 differencing, and moving average techniques. researchers have extensively used it for traffic flow estimation [31]. second is support vector regression (svr). svr is a model that forecasts future traffic data by leveraging existing data to train the model and establish the relationship between input and output variables [32]. this model employs a linear kernel function. the next is k-nearest neighbor (knn). knn is a widely recognized supervised learning approach used for data classification based on the proximity of data points to their neighbors [15]. knn retains all available instances and classifies new cases using a similarity score. the last is graph convolutional network (gcn). gcn represents a semi-supervised deep learning method that captures the spatial characteristics of nodes within a graph. it operates effectively in non-euclidean spaces, making it suitable for modeling road networks iv. result and discussion table 3 shows the performance of the four approaches outlined above and our suggested model on two frequently used datasets. first, we calculate rmse, mae, and r2 for a whole day (twenty-four hour) prediction. table 3 reveals that our proposed approach surpasses the other four methods across both datasets regarding rmse, mae, and r2. lower error values imply higher accuracy, except for r2, where higher values indicate superior performance. the error calculations are conducted twentyfour hours ahead of predictions. in the sz-taxi dataset, our proposed method demonstrates a remarkable 16.9% reduction in rmse compared to arima and a 9.17% decrease compared to gcn. moving to the pemsd7 dataset, our proposed model achieves a substantial 60.4% reduction in rmse compared to arima, 55.5% reduction compared to svr, 45.7% reduction to knn, and 53% reduction to gcn. our proposed model exhibits superior performance, particularly in the pemsd7 dataset. this is attributed to the larger size of the pemsd7 dataset, allowing our model to learn more effectively by relying on historical data for predicting future traffic trends. notably, ∗ it indicates negligible values, signifying poor prediction performance for the model in those cases. table 3. prediction performance of the proposed model and other baseline models using sz-taxi data and pemsd7 datasets for a day (24 hours) model name sz-taxi pemsd7 rmse mae r2 rmse mae r2 arima 6.7963 4.6757 * 11.3038 9.1818 * svr 6.56454 4.55313 0.6552 10.0653 4.9432 0.5032 knn 5.96454 4.25313 0.6752 8.2546 4.8241 0.5134 gcn 6.2163 4.6581 0.6451 9.5362 6.8600 0.5162 proposed model 5.6463 4.2197 0.6871 4.4808 3.2734 0.8105 the poor results of the baseline methods are because of the difficulty for arima, knn, and svr in dealing with complex, irregular time series data. that is why they performed poorly in long datasets like the pemsd7. despite utilizing gcn within the model, its predictive performance is subpar. gcn primarily focuses on spatial characteristics, neglecting the temporal nature inherent in traffic data, which is fundamentally time series data. our proposed model addresses this limitation by segmenting the data, enhancing gcn's ability to handle time series data. consequently, our proposed model exhibits superior day-long traffic flow speed prediction capabilities. additionally, arima, a well-established traffic forecasting method, suffers from reduced prediction accuracy when confronted with extended and irregular data patterns. arima computes its predictions by calculating and averaging errors across individual nodes, and any anomalies in the data can consequently inflate the final total error. on the other hand, in our proposed long-term prediction, error does not propagate, resulting in better results when compared to others. in figure 5, we visualize traffic prediction and actual traffic flow for an entire day on one road for the sz-taxi dataset. the yellow line indicates actual traffic flow, and the blue dotted line indicates predicted traffic flow. the model demonstrates an ability to capture the daily traffic flow data trends. utilizing gcn for each segmented dataset allows for capturing temporal and spatial characteristics throughout the day. karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 100 fig. 5. the visualization results for a prediction horizon of twenty-four hours in the sz-taxi dataset our model has shortcomings as it does not account for external variables such as weather conditions, accidents, or holidays, which can result in limitations in accurately capturing traffic flow dynamics. our plans involve integrating attention mechanisms to detect abrupt incidents and adopting a dynamic adjacency matrix instead of a static one to enhance the information supplied to the gcn. in addition, we aim to integrate weather conditions and holiday data into our analysis alongside speed data. v. conclusion in this research paper, we introduced the concept of a stacked gcn, a deep learning methodology aimed at tackling the complexities associated with long-term traffic flow prediction. accurate longterm prediction is essential in traffic management and sustainable urban planning, particularly as urbanization and population growth exacerbate traffic congestion issues. the proposed stacked gcn model overcomes traditional error accumulation issues by employing a segmented module for temporal feature extraction and leveraging graph convolutional networks' capabilities. incorporating historical data in segmentation helps our model learn the historical pattern. in a comparison between the arima, svr, knn, and gcn models using two real-world traffic datasets, it is evident that the stacked gcn model outperforms the others and yields the most accurate prediction results. our model can reduce error from 40% to 60% compared to other methods that we used for comparison. this produces accurate day-long traffic forecasts, providing travelers with preemptive route planning information. moreover, our model does not use hybrid models like other long-term prediction models, ensuring faster results. in the future, our strategy includes integrating attention mechanisms to detect unexpected events and employing a dynamic adjacency matrix instead of a fixed one to enhance the information available to the gcn. we aim to integrate weather conditions and holiday data into our analysis alongside speed data. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. http://journal2.um.ac.id/index.php/keds 101 karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] m. m. rahman and n. nower, “attention based deep hybrid networks for traffic flow prediction using google maps data,” in proceedings of the 2023 8th international conference on machine learning technologies, mar. 2023, pp. 74–81. [2] m. m. rahman, a. r. m. jamil, and n. nower, “uncertainty-aware traffic prediction using attention-based deep hybrid network with bayesian inference,” int. j. adv. comput. sci. appl., vol. 14, no. 6, 2023. [3] d. rukmana, “rapid urbanization and the need for sustainable transportation policies in jakarta,” iop conf. ser. earth environ. sci., vol. 124, p. 012017, mar. 2018. [4] a. a. haider, “traffic jam: the ugly side of dhaka’s development,” dly. star, vol. 13, 2018. [5] m. sweet, “does traffic congestion slow the economy?,” j. plan. lit., vol. 26, no. 4, pp. 391–404, nov. 2011. [6] t. peng, x. yang, z. xu, and y. liang, “constructing an environmental friendly low-carbon-emission intelligent transportation system based on big data and machine learning methods,” sustainability, vol. 12, no. 19, p. 8118, oct. 2020. [7] t. alghamdi, k. elgazzar, m. bayoumi, t. sharaf, and s. shah, “forecasting traffic congestion using arima modeling,” in 2019 15th international wireless communications & mobile computing conference (iwcmc), jun. 2019, pp. 1227–1232. [8] c. p. i. j. van hinsbergen, t. schreiter, f. s. zuurbier, j. w. c. van lint, and h. j. van zuylen, “localized extended kalman filter for scalable real-time traffic state estimation,” ieee trans. intell. transp. syst., vol. 13, no. 1, pp. 385–394, mar. 2012. [9] j. guo, w. huang, and b. m. williams, “adaptive kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification,” transp. res. part c emerg. technol., vol. 43, pp. 50–64, jun. 2014. [10] y. liu and h. wu, “prediction of road traffic congestion based on random forest,” in 2017 10th international symposium on computational intelligence and design (iscid), dec. 2017, pp. 361–364. [11] x. feng, x. ling, h. zheng, z. chen, and y. xu, “adaptive multi-kernel svm with spatial–temporal correlation for short-term traffic flow prediction,” ieee trans. intell. transp. syst., vol. 20, no. 6, pp. 2001–2013, jun. 2019. [12] z. mingheng, z. yaobao, h. ganglong, and c. gang, “accurate multisteps traffic flow prediction based on svm,” math. probl. eng., vol. 2013, pp. 1–8, 2013. [13] b. sharma, v. kumar katiyar, and a. kumar gupta, “fuzzy logic model for the prediction of traffic volume in week days,” int. j. comput. appl., vol. 107, no. 17, pp. 1–6, 2014. [14] y. gu, w. lu, x. xu, l. qin, z. shao, and h. zhang, “an improved bayesian combination model for short -term traffic prediction with deep learning,” ieee trans. intell. transp. syst., vol. 21, no. 3, pp. 1332–1342, mar. 2020. [15] l. zhang, q. liu, w. yang, n. wei, and d. dong, “an improved k-nearest neighbor model for short-term traffic flow prediction,” procedia soc. behav. sci., vol. 96, pp. 653–662, nov. 2013. [16] d. xu, y. wang, p. peng, s. beilun, z. deng, and h. guo, “real-time road traffic state prediction based on kernelknn,” transp. a transp. sci., vol. 16, no. 1, pp. 104–118, dec. 2020. [17] k. kumar, m. parida, and v. k. katiyar, “short term traffic flow prediction for a non urban highway using artificial neural network,” procedia soc. behav. sci., vol. 104, pp. 755–764, dec. 2013. [18] a. koesdwiady, r. soua, and f. karray, “improving traffic flow prediction with weather information in connected cars: a deep learning approach,” ieee trans. veh. technol., vol. 65, no. 12, pp. 9508–9517, dec. 2016. [19] y. wu and h. tan, “short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework,” pp. 1–14, 2016. [20] z. duan, y. yang, k. zhang, y. ni, and s. bajgain, “improved deep hybrid networks for urban traffic flow prediction using trajectory data,” ieee access, vol. 6, pp. 31820–31827, 2018. [21] t. n. kipf and m. welling, “semi-supervised classification with graph convolutional networks,” 5th int. conf. learn. represent. iclr 2017 conf. track proc., pp. 1–14, 2017. [22] z. chen, b. zhao, y. wang, z. duan, and x. zhao, “multitask learning and gcn-based taxi demand prediction for a traffic road network,” sensors, vol. 20, no. 13, p. 3776, jul. 2020. [23] k. guo, y. hu, y. sun, s. qian, j. gao, and b. yin, “hierarchical graph convolution network for traffic forecasting,” proc. aaai conf. artif. intell., vol. 35, no. 1, pp. 151–159, may 2021. [24] y. xu, y. lu, c. ji, and q. zhang, “adaptive graph fusion convolutional recurrent network for traffic forecasting,” ieee internet things j., no. neurips, pp. 1–12, 2023. [25] a. belhadi, y. djenouri, d. djenouri, and j. c.-w. lin, “a recurrent neural network for urban long-term traffic flow forecasting,” appl. intell., vol. 50, no. 10, pp. 3252–3265, oct. 2020. [26] x. kong, j. zhang, x. wei, w. xing, and w. lu, “adaptive spatial-temporal graph attention networks for traffic flow forecasting,” appl. intell., vol. 52, no. 4, pp. 4300–4316, mar. 2022. [27] z. wang, x. su, and z. ding, “long-term traffic prediction based on lstm encoder-decoder architecture,” ieee trans. intell. transp. syst., vol. 22, no. 10, pp. 6561–6571, oct. 2021. [28] y. li, s. chai, z. ma, and g. wang, “a hybrid deep learning framework for long-term traffic flow prediction,” ieee access, vol. 9, pp. 11264–11271, 2021. [29] m. méndez, m. g. merayo, and m. núñez, “long-term traffic flow forecasting using a hybrid cnn-bilstm model,” eng. appl. artif. intell., vol. 121, p. 106041, may 2023. https://doi.org/10.1145/3589883.3589894 https://doi.org/10.1145/3589883.3589894 https://doi.org/10.1145/3589883.3589894 https://doi.org/10.14569/ijacsa.2023.01406132 https://doi.org/10.14569/ijacsa.2023.01406132 https://doi.org/10.1088/1755-1315/124/1/012017 https://doi.org/10.1088/1755-1315/124/1/012017 https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=traffic+jam%3a+the+ugly+side+of+dhaka%27s+development&btng= https://doi.org/10.1177/0885412211409754 https://doi.org/10.3390/su12198118 https://doi.org/10.3390/su12198118 https://doi.org/10.3390/su12198118 https://doi.org/10.1109/iwcmc.2019.8766698 https://doi.org/10.1109/iwcmc.2019.8766698 https://doi.org/10.1109/iwcmc.2019.8766698 https://doi.org/10.1109/tits.2011.2175728 https://doi.org/10.1109/tits.2011.2175728 https://doi.org/10.1109/tits.2011.2175728 https://doi.org/10.1016/j.trc.2014.02.006 https://doi.org/10.1016/j.trc.2014.02.006 https://doi.org/10.1109/iscid.2017.216 https://doi.org/10.1109/iscid.2017.216 https://doi.org/10.1109/tits.2018.2854913 https://doi.org/10.1109/tits.2018.2854913 https://doi.org/10.1155/2013/418303 https://doi.org/10.1155/2013/418303 https://doi.org/10.5120/18840-0026 https://doi.org/10.5120/18840-0026 https://doi.org/10.1109/tits.2019.2939290 https://doi.org/10.1109/tits.2019.2939290 https://doi.org/10.1016/j.sbspro.2013.08.076 https://doi.org/10.1016/j.sbspro.2013.08.076 https://doi.org/10.1080/23249935.2018.1491073 https://doi.org/10.1080/23249935.2018.1491073 https://doi.org/10.1016/j.sbspro.2013.11.170 https://doi.org/10.1016/j.sbspro.2013.11.170 https://doi.org/10.1109/tvt.2016.2585575 https://doi.org/10.1109/tvt.2016.2585575 http://arxiv.org/abs/1612.01022 http://arxiv.org/abs/1612.01022 https://doi.org/10.1109/access.2018.2845863 https://doi.org/10.1109/access.2018.2845863 https://arxiv.org/abs/1609.02907 https://arxiv.org/abs/1609.02907 https://doi.org/10.3390/s20133776 https://doi.org/10.3390/s20133776 https://doi.org/10.1609/aaai.v35i1.16088 https://doi.org/10.1609/aaai.v35i1.16088 https://doi.org/10.1109/jiot.2023.3244182 https://doi.org/10.1109/jiot.2023.3244182 https://doi.org/10.1007/s10489-020-01716-1 https://doi.org/10.1007/s10489-020-01716-1 https://doi.org/10.1007/s10489-021-02648-0 https://doi.org/10.1007/s10489-021-02648-0 https://doi.org/10.1109/tits.2020.2995546 https://doi.org/10.1109/tits.2020.2995546 https://doi.org/10.1109/access.2021.3050836 https://doi.org/10.1109/access.2021.3050836 https://doi.org/10.1016/j.engappai.2023.106041 https://doi.org/10.1016/j.engappai.2023.106041 karim et al. / knowledge engineering and data science 2023, 6 (1): 92–102 102 [30] g. li, m. muller, a. thabet, and b. ghanem, “deepgcns: can gcns go as deep as cnns?,” in 2019 ieee/cvf international conference on computer vision (iccv), oct. 2019, vol. 2019-octob, pp. 9266–9275, doi: 10.1109/iccv.2019.00936. [31] x. lin and y. huang, “short‐term high-speed traffic flow prediction based on arima-garch-m model,” wirel. pers. commun., vol. 117, no. 4, pp. 3421–3430, apr. 2021. [32] g. lin, a. lin, and d. gu, “using support vector regression and k-nearest neighbors for short-term traffic flow prediction based on maximal information coefficient,” inf. sci. (ny)., vol. 608, pp. 517–531, aug. 2022. https://doi.org/10.1109/iccv.2019.00936 https://doi.org/10.1109/iccv.2019.00936 https://doi.org/10.1109/iccv.2019.00936 https://doi.org/10.1007/s11277-021-08085-z https://doi.org/10.1007/s11277-021-08085-z https://doi.org/10.1016/j.ins.2022.06.090 https://doi.org/10.1016/j.ins.2022.06.090 microsoft word 5.6775-21001-le3r knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 1, juni 2019, pp. 41–46 eissn 2597-4637 https://doi.org/10.17977/um018v2i12019p41-46 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) adam optimization algorithm for wide and deep neural network imran khan mohd jais, amelia ritahani ismail *, syed qamrun nisa department of computer science, kulliyyah of information and communication technology international islamic university malaysia, p.o. box 10, 50728 kuala lumpur, malaysia amelia@iium.edu.my* * corresponding author i. introduction neural networks are often used to solve classification and recommendation problems. this research will work on classifying whether a tumor is malignant or benign. however, the main objective of this research will focus on the effects of adam on the performance of the wide and deep network. a challenge in working with conventional neural networks is to achieve both memorization and generalization. according to [1] : “memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data. generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past.” next, it was found that neural networks with high number of features have a tendency to overgeneralize and give irrelevant outputs [1]. due to this, there is a high tendency to get false predictions or misleading results. the wide and deep network combines the benefits of memorization and generalization by adding a linear part to the neural network’s architecture. it is a jointly trained wide linear models and deep neural networks and it is highly useful for large-scale regression and classification problems. in 2018, [2] investigated a convolutional neural networks (cnns) based computer-aided diagnosis (cad) framework for breast cancer classification. in general, deep learning may require extensive datasets to organize systems while transfer learning method consumes a little datasets of medical images. transfer learning method optimize the training of the cnns. as a result, the cnn achieved the finest outcomes with 98.94% of accuracy. in 2017, [3] proposed cnns to classify the hematoxylin and eosin stained breast biopsy images. the designed network architecture retrieved different scales information such as nuclei and overall tissue organization. this design extend the proposed system to whole-slide histology images. furthermore, the cnns extracted features are also used to train a svm based classification engine. article info a b s t r a c t article history: received 4 march 2019 revised 6 april 2019 accepted 19 may 2019 published online 23 june 2019 the objective of this research is to evaluate the effects of adam when used together with a wide and deep neural network. the dataset used was a diagnostic breast cancer dataset taken from uci machine learning. then, the dataset was fed into a conventional neural network for a benchmark test. afterwards, the dataset was fed into the wide and deep neural network with and without adam. it was found that there were improvements in the result of the wide and deep network with adam. in conclusion, adam is able to improve the performance of a wide and deep neural network. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: wide and deep network neural network adam algorithm breast cancer dataset 42 i.k.m. jais et al. / knowledge engineering and data science 2019, 2 (1): 41–46 the use of cad systems increases the diagnosis efficiency as well as the level of inter-observer agreement. on the other hand, [1] investigate on how wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can simplify to formerly unseen feature interactions through low dimensional embeddings. online experiment results show that wide & deep model significantly increased app acquisitions compared with wide-only and deep-only models. furthermore, [4] analyse the theoretical convergence of the algorithm properties and deliver a regret bound on the convergence rate that is comparable to the best known results under the online convex optimisation framework. empirical results demonstrate that adam works well in practice and compares favorably to other stochastic optimization methods. stochastic gradient descent is an efficient and effective optimization approach that was central in many machine learning success stories. one example of it is a novel advances in deep learning. in 2016, [5] proposed a wide and deep neural network with strong induction ability to model the transformation, and an efficient training strategy. the promising approach has potential application in image based ophthalmologic diseases diagnosis. it may provide a fresh, general, high-performance computing framework for image segmentation. moreover, [6] introduced requik, a multiperspective query suggestion system for children. the system provides the suggestion process by applying a wide and deep neural network ranking strategy that considers both raw text and traits, generally associated with kid-related queries. by applying a multi-perspective approach based on deep learning, the proposed query suggestion module is able to learn distinctive characteristics that portray adults and children queries. the application of deep learning has in recent years lead to a dramatic boost in performance in many areas such as computer vision, speech recognition or natural language processing [7]. despite this huge empirical success, the theoretical understanding of deep learning is still limited. in this paper we address the non-convex optimization problem of training a feedforward neural network. this problem turns out to be very difficult as there can be exponentially many distinct local minima [8]. it has been shown that the training of a network with a single neuron with a variety of activation functions turns out to be np-hard. ii. method a. data collection the dataset was taken from uci machine learning and was titled “breast cancer wisconsin” [9]. the dataset consisted of 3 categories which are mean, standard error, and worst. then, each category contains 10 features which are radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. in total, there are 30 features. b. data preprocessing the figure 1 shows one example of the distribution of each feature in the original dataset. due to the uneven distribution of the data, we use a normalization technique called z-score to transform the distribution of the data into a more uniform manner. the figure 2 shows the distribution of the data after normalization. from the figure 1 and figure 2, we can see that there is a drastic change in the distribution of the data after normalization. the distribution among features in each category are more even and has low variance which is a good thing for the machine learning algorithm so it can learn better. c. feature correlation this research is done with python using the tensorflow library which was developed by google [10]. figure 3 shows a heatmap of the correlation strength between features. as we can see from the bar on the right, a lighter color represents a stronger correlation between the features. from this figure, we can see a patch of strongly correlated features on the bottom left. from there it was shown that radius_mean, perimeter_mean, and area_mean is highly correlated with area_worst, perimeter_worst, and radius_worst. i.k.m. jais et al. / knowledge engineering and data science 2019, 2 (1): 41–46 43 furthermore, it was also shown on the top left that radius_mean, perimeter_mean, and area_mean have strong correlation between themselves. on the other hand, we can see that there are a few weakly correlated features across the heatmap but since it also has strong correlation with other features, we decided to not prune any feature. d. machine learning the wide and deep neural network requires the user to define which features are base features, crossed features, and deep features. the purpose of this is to define which features go into the wide and deep part of the network. in this case, since the features are already grouped into 3 parts, the process is simplified. the mean group is the base features, the standard error group is the crossed features, and the last group is the deep features. other parameters that had to be defined are shown in the table 1. moreover, we also record the time taken for the model to complete training. next, the model is run for the benchmark test. for the benchmark test, all features are fed into the deep part of the network with the same parameters as above. afterwards, the model was run using the wide and deep network without adam optimization then with adam optimization. fig. 1. distribution of ten features in the dataset fig. 2. the data distribution after normalization 44 i.k.m. jais et al. / knowledge engineering and data science 2019, 2 (1): 41–46 iii. results and discussion table 2 shows the results obtained after training has completed. based on the results, the first thing to notice is that wide and deep network with adam optimization obtained the highest accuracy. however, we could not automatically assume that it is the best performer because the high accuracy result could be due to overfitting. thus, we need to consider the next two metrics. fig. 3. correlation strength between features table 1. parameter detail paramater description training set 455 rows test set 114 rows number of examples per batch 30 number of training epochs 50 number of hidden layers 6 number of neurons for each layer 100, 75, 50, 25, 10, 5 i.k.m. jais et al. / knowledge engineering and data science 2019, 2 (1): 41–46 45 auc refers to the area under the curve of the receiver operating characteristic (roc) line while the next metric refers to the area under the precision recall curve. the roc curve shows the true positive rate against the false positive rate for the test set. on the other hand, the precision recall graph shows the precision rate against the recall rate as the name suggests. precision refers to the number of true positives over the total of true positives and false positives while recall refers to the number of true positives over the total of true positives and false negatives. for both of these metrics, the area under the curve needs to be close to 1 to show that the result obtained is good and was not due to overfitting. in this case, for the wide and deep network with adam optimization, the number is close to 1 so we can be assured that the result obtained was not due to any error or anomaly. the average loss and loss metric refers to the loss function which is a part of the learning process in a neural network. as we can see both neural network and wide and deep network with adam optimization resulted in a low average loss and loss which is a good sign. the prediction/mean shows a close value to the label/mean which is 0.333 and that is also signs of a good model because it means that the prediction rate is close to the truth. next, we can see improvement in the time taken for the wide and deep network with adam optimization to finish training compared to without optimization. however, the conventional neural network showed the shortest time but this could be due to the network architecture being more simple compared to the wide and deep network. iv. conclusion as a conclusion, it was expected to see that the wide and deep network with adam optimization performs the best. however, the wide and deep network without adam optimization trails not far from it. in this case, we may need to scale the model bigger to see significant difference between their performances by feeding a bigger dataset as an example. nonetheless, it was proven that adam optimization is able to improve the performance of the wide and deep neural network. acknowledgment this research was supported by the research initiative grants scheme (rigs): rigs16-346-0510. references [1] h. cheng et al., “wide & deep learning for recommender systems,” in proceedings of the 1st workshop on deep learning for recommender systems, 2016, pp. 7–10. [2] h. chougrad and h. zouaki, “deep convolutional neural networks for breast cancer screening,” comput. methods programs biomed., vol. 157, pp. 19–30, 2018. [3] t. araújo et al., “classification of breast cancer histology images using convolutional neural networks,” plos one, vol. 12, no. 6, 2017. [4] d. p. kingma and j. l. ba, “adam: a method for stochastic optimization,” in proceedings of the 3rd international conference on learning representations (iclr 2015), 2015, pp. 1–15. [5] q. li, b. feng, l. xie, p. liang, h. zhang, and t. wang, “a cross-modality learning approach for vessel segmentation in retinal images,” ieee trans. med. imaging, vol. 35, no. 1, pp. 109–118. table 2. training results parameter neural network wide and deep neural network without adam optimization wide and deep neural network with adam optimization accuracy 0.965 0.947 0.991 auc 0.993 0.995 0.996 auc precision recall 0.989 0.990 0.993 average loss 0.123 0.165 0.137 loss 3.704 4.705 3.896 prediction/mean 0.301 0.292 0.431 time taken 29.959 55.627 51.647 46 i.k.m. jais et al. / knowledge engineering and data science 2019, 2 (1): 41–46 [6] i. m. azpiazu, n. dragovic, o. anuyah, and m. s. pera, “looking for the movie seven or sven from the movie frozen ? a multi-perspective strategy for recommending queries for children,” in proceedings of the 2018 conference on human information interaction & retrieval, 2018, pp. 92–101. [7] w. liu, z. wang, x. liu, n. zeng, y. liu, and f. e. alsaadi, “a survey of deep neural network architectures and their applications,” neurocomputing, vol. 234, pp. 11–26, 2017. [8] i. safran and o. shamir, “on the quality of the initial basin in overspecified neural networks,” in international conference on machine learning, 2016, vol. 774–782. [9] w. h. wolberg and o. l. street, w. nick mangasarian, “breast cancer wisconsin (diagnostic) data set,” 1995. [online]. available: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic). [10] google, “tensorflow,” 2018. [online]. available: https://github.com/tensorflow. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 79–91 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p79-91 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) round-robin algorithm in load balancing for national data centers i kadek wahyu sudiatmika 1,*, gede indrawan 2, sariyasa 3 universitas pendidikan ganesha, jl. udayana no.11, buleleng 81116, indonesia 1 wahyu.sudiatmika@undiksha.ac.id*; 2 gindrawan@undiksha.ac.id; 3 sariyasa@undiksha.ac.id * corresponding author i. introduction the bali provincial government currently operates many public service applications integral to the lives of its residents, local villages, and regional apparatus. prominent among these systems are the traditional village financial management system (in indonesian, sistem informasi keuangan desa adat or sikuat), the civil service system (in indonesian, sistem manajemen kepegawaian or simpeg), the virtual office (e-office), and the electronic procurement system (in indonesian, sistem pengadaan secara elektronik or spse). with these systems operating on single, on-premise servers, the challenge of resource limitation becomes increasingly apparent. a single server has finite cpu, ram, storage, and bandwidth. overloading a server with multiple systems can lead to performance degradation or crashes. furthermore, the security of all the systems becomes jeopardized if one system on the server is compromised. robust server management techniques, like load-balancing and virtualization, become crucial to alleviate these issues. load balancing, by definition, is the distribution of a workload across multiple servers, ensuring that no singular server bears an overwhelming load [1]. this process optimizes and stabilizes system performance, ensuring maximum uptime and consistent service delivery. among the strategies employed for load-balancing, the round robin algorithm stands out. this algorithm systematically assigns incoming server requests to the next server in line, ensuring an equitable article info a b s t r a c t article history: received 21 july 2023 revised 21 august 2023 accepted 18 september 2023 published online 22 september 2023 the provincial government of bali assumes a crucial role in administering various public service applications to meet the requirements of its community, traditional villages, and regional apparatus. nevertheless, the escalating magnitude of traffic and uneven distribution of requests have resulted in substantial server burdens, which may jeopardize the operation of applications and heighten the likelihood of downtime. ensuring efficient load distribution is of utmost importance in tackling these difficulties, and the round robin algorithm is often utilized for this purpose. however, the current body of research has not extensively examined the distinct circumstances surrounding on-premise servers in the bali provincial government. the primary objective of this study is to address the significant gap in knowledge by conducting a comprehensive evaluation of the round robin algorithm's effectiveness in load-balancing on-premise servers inside the bali provincial government. the primary objective of our study is to assess the appropriateness of the algorithm within the given context, with the ultimate goal of providing practical and implementable suggestions. the observations above can optimize system efficiency and minimize periods of inactivity, thereby enhancing the provision of vital public services across bali. this study provides essential insights for enhancing server infrastructure and load-balancing strategies through empirical evaluation and comprehensive analysis. its findings are valuable for the bali provincial government and serve as a reference for other organizations facing challenges managing server loads. this study signifies a notable advancement in establishing reliable and practical public service applications within bali. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: load-balancing round-robin algorithm server on-premise performance public service application http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 80 distribution [2]. however, a significant gap exists: no existing research evaluating the performance of the round robin algorithm specifically within the bali provincial government's on-premise server context exists. this work undertakes a novel and groundbreaking investigation to address a significant gap in load balancing. the primary aim of this study is to examine the efficacy of the round robin algorithm within the specific context of on-premise servers used by the province government of bali. this particular domain has been noticeably underrepresented in previous research efforts. our research aims to offer the bali provincial government carefully crafted recommendations based on rigorous information and specifically customized to their distinct server environment. this study gives particular attention to assessing and examining the round-robin methodology. this study aims to precisely construct a framework that maximizes the operational efficiency of the national data centers managed by the bali provincial government. this work distinguishes itself via its innovative approach, as it explores hitherto unexplored domains to tackle the urgent requirement for server optimization within a specific and intricate realworld context. through a thorough examination of the round robin algorithm's appropriateness for this unique context, our objective is to offer fresh perspectives and remedies that can be utilized not only by the bali provincial government but also serve as a valuable point of reference for comparable entities grappling with comparable obstacles in their management of server infrastructure. this research has the potential to impact the domain of load balancing and server optimization substantially, hence facilitating the development of more efficient and robust server environments in the coming years. ii. method the research design commences with the collection of data, which is subsequently followed by the formulation of test cases, the testing of these test cases, and the analysis of the obtained test results [3][4][5][6]. the initial phase entails the identification of the system environment and infrastructure that will undergo testing, the collection of pertinent information regarding the application to be evaluated, and the establishment of the test's objectives and requirements. the subsequent phase involves the formulation of test cases for every test scenario, with the objective of including all crucial facets of the application and system environment. the third phase entails the execution of test cases based on specified scenarios, the documentation of test outcomes, and the verification of their alignment with the anticipated results. the concluding phase involves the examination of the test outcomes and their comparison with the objectives and requirements of the test. this process entails the identification of any issues or flaws in the application or system environment, followed by the implementation of the requisite enhancements or optimizations. in general, the study design is implemented to assure the systematic and rigorous execution of tests, hence generating dependable outcomes that can be utilized for the advancement of system development. the processes outlined in figure 1 provide a more comprehensive and deep understanding. from figure 1, in more detail, the research steps are described as follows. first, the data collection stage, is one of the initial stages in conducting research. data collection is carried out to collect information and data relevant to the research problem to be solved. in this stage, the method of data collection that will be carried out is a literature study and interviews. second steps is preparation of test cases. the test case preparation stage is essential in load-balancing research using the round robin algorithm on an on-premise server in the bali provincial government. this stage aims to create a series of test cases that are used to test the performance of the round robin algorithm under various conditions. next step is test case testing, at this stage, the researcher will test several test cases that have been prepared previously in the test case preparation stage. from the results of the test cases, it is hoped that information will be obtained about the performance and suitability of the round robin algorithm in an on-premise server environment at the bali provincial government so that it can provide recommendations to the bali provincial government regarding the most suitable loadbalancing algorithm to use. 81 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 fig. 1. research steps in the last stage, results of the test case test that have been carried out before will be analyzed in depth to determine the performance of the round robin algorithm in load-balancing on the server on-premise of the bali provincial government. this analysis will include an evaluation of load testing, failover testing, robustness testing, and security testing. based on the results of the analysis from this stage, the researcher will conclude the advantages and disadvantages of the round robin algorithm in load-balancing on the server on the premise of the bali provincial government and provide recommendations regarding the most suitable load-balancing algorithm used in an onpremise server environment. iii. result and discussion this study aims to analyze the performance of the round-robin algorithm in a load balancer. in this study, we tested the performance of the round-robin algorithm in selecting the destination server for each incoming request. the test case will use standard testing from grafana labs k6 [7][8][9][10][11][12][13]. based on grafana labs documentation, k6 is an open-source load balancer testing tool that simplifies and increases performance testing productivity for cloud technicians and engineers. the following tests will be carried out based on the grafana labs k6 standard. sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 82 the assessment of a load balancer's performance under conditions that roughly resemble its regular workday load is a crucial benchmark, also referred to as average-load testing. this testing method offers significant insights into the ability of the load balancer to achieve its performance targets during regular operations continuously [14][15][16][17][18]. figure 2 visually depicts the outcomes derived from the average-load testing, illustrating our findings. the presented testing scenario portrays an environment that exhibits a typical workload, which closely resembles the load balancer's actual use during regular weekdays. moreover, it illustrates a moderate labor duration, providing insight into the time required to handle and allocate incoming requests effectively. fig. 2. average-load test the insights obtained by doing average-load testing provide a practical understanding of the load balancer's capacity to manage the routine demands it faces effectively. through simulating common use patterns, researchers can get a more comprehensive knowledge of the load balancer's performance within a context that closely aligns with its practical reality. understanding this information is crucial in guaranteeing that the load balancer can continuously and effectively fulfill the requirements of the systems it assists throughout regular operations, hence improving the system's overall stability and user satisfaction. the stress testing process, as seen in figure 3, is a crucial stage in assessing the resilience and performance of a load balancer under extreme loads that exceed standard usage patterns. this testing methodology comprehensively evaluates the load balancer's capacity to uphold system stability and consistent reliability while subjected to intense stress levels [19][20][21][22][23][24][25][26][27]. fig. 3. stress test figure 3 provides a visual representation of the stress testing results, effectively illustrating the responsiveness of the load balancer under extreme conditions. this scenario's purpose is to impose excessive demands on the system deliberately, so replicating instances of high usage or unanticipated surges in traffic to identify vulnerabilities, bottlenecks, or possible failure sites. by putting the load balancer to these increased conditions, researchers can obtain vital insights regarding its resilience and ability to manage unexpected increases in user activity, ensuring the continuous provision of services. the stress testing process is of utmost importance in enhancing the load balancer's performance characteristics, ensuring its ability to withstand and remain robust in highly demanding use scenarios. these insights are crucial for enterprises aiming to uphold high 83 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 availability and ensure smooth user experiences, particularly in times of increased demand or unforeseen swings in traffic. the process of breakpoint testing, as illustrated in figure 4, is a critical undertaking aimed at precisely identifying the underlying constraints inside a system. the justification for doing breakpoint testing is complex and involves a range of compelling factors, all of which contribute considerably to the overall durability and strength of the system [28][29][30][31][32][33][34][35] [36]. fig. 4. breakpoint test primarily, breakpoint testing plays a crucial role in proactive planning. organizations can obtain valuable insights into the operating boundaries of the system by intentionally submitting the load balancer to progressively significant loads until it approaches its breakpoint. this information is the basis for developing thorough remediation techniques for load balancer failures or catastrophic system overloads. with this understanding, companies can establish predetermined measures for mitigating risks, resulting in decreased periods of inactivity, limited service disruptions, and the assurance of a prompt and efficient reaction to obstacles. moreover, the utilization of breakpoint testing is of utmost importance in the process of protocol creation. this capability enables businesses to optimize response protocols by refining the methods and procedures necessary to address prospective challenges. the use of a proactive strategy is crucial in the identification and preventative resolution of vulnerabilities, hence enhancing the overall dependability and stability of the system. it is essential to acknowledge that breakpoint testing is a methodical and regulated procedure. the demand is gradually augmented until the load balancer nears its breakpoint, at this juncture, the test is manually terminated to mitigate any potential server harm. this cautious strategy guarantees the system's reliability while allowing enterprises to collect vital data about system performance and constraints. breakpoint testing is fundamentally a strategic endeavor that enhances system resilience and optimizes performance. this technology enables enterprises to effectively manage the intricacies of load balancing, instilling them with a sense of assurance in their ability to address obstacles proactively, mitigate interruptions, and provide uninterrupted service quality to their consumers. before conducting the test, a scenario will be created for each test case. the test scenarios are average-load testing, stress testing, and breakpoint testing. in the context of load testing, our objective is to accurately simulate the dynamic patterns of user interactions with the load balancer through a meticulously constructed average-load testing scenario. this process is conducted with a high degree of control and methodical precision. the initial stage, which involves the progressive inclusion of people individually over 5 minutes, closely resembles the natural accumulation of user engagement during typical usage. this phase enables a detailed observation of the load balancer's response to incremental requests, enabling an assessment of its capacity to effectively distribute resources and sustain minimal delay as the number of users progressively increases. additionally, this provides valuable information regarding the load balancer's handling of the initial surge of connections, which is a critical factor in guaranteeing a smooth user experience during times of increased demand. sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 84 the succeeding step involves the simultaneous engagement of 100 users with the load balancer for 10 minutes, which acts as a critically significant stress test. this rigorous phase simulates situations in which the system becomes overwhelmed due to abrupt increases in traffic, such as the dissemination of viral content or the execution of marketing campaigns. by putting the load balancer to a period of high demand, we can evaluate its capacity to effectively manage substantial workloads while ensuring optimal performance, uptime, and resource allocation. this phase assesses not only the technical capabilities of the load balancer but also its ability to maintain service quality under challenging circumstances, therefore mitigating the risk of service interruptions during periods of high demand [37]. figure 5 provides a comprehensive visual depiction of the dynamic scenario, effectively illustrating the entire testing procedure, facilitating comprehension of the many stages of testing and serving as a framework for interpreting and analyzing results. by employing carefully designed testing scenarios and utilizing visual aids, businesses can obtain an in-depth understanding of the load balancer's functionality, enabling them to make informed decisions based on data analysis, aiming to improve system performance and resilience. fig. 5. scenario average-load testing figure 6 represents a pivotal juncture when intentional and significant pressure is applied to the load balancer, resulting in a massive surge of incoming traffic. the simulation provided in this study aims to recreate real-world scenarios where sudden and quick surges in user activity can place substantial pressure on the system's resources and capabilities. the initiation of stress testing entails users systematically consecutively accessing the load balancer for an extended duration of 10 minutes, progressively augmenting the user count to a substantial aggregate of 200 individuals. the progressive incorporation of users underscores the load balancer's capacity to adjust to an everexpanding user load while maintaining consistent performance metrics [38]. fig. 6. stree testing scenario 85 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 once the user count exceeds the critical threshold of 200, the situation transitions into a phase marked by a prolonged duration of heightened demand, wherein intensive demands persist for 10 minutes. this phase replicates scenarios involving high stress levels, during which system resources are entirely used. during this phase, a comprehensive analysis is performed on critical performance indicators, encompassing reaction times, resource utilization, and error rates. the data produced presents valuable information into the load balancer's capacity to effectively manage heavy workloads while maintaining service quality at a satisfactory level [39]. the culmination of the stress testing scenario occurs when users systematically complete their requests within a 5-minute timeframe, resulting in a gradual decrease in user burden. the phenomenon that has been noticed demonstrates a decline in user involvement that naturally occurs after increased demand. this observation offers valuable insights about the load balancer's capacity to manage the reduction in incoming requests efficiently. organizations can enhance their comprehension of the load balancer's performance in high-stress conditions by employing visual representations of stress testing scenarios. these insights are of paramount importance for companies seeking to enhance the resilience of their systems against unexpected surges in user traffic and ensure uninterrupted service delivery, especially in peak demand. the break point test scenario, as illustrated in figure 7, is a critical stage within our extensive testing protocol. this scenario aims to methodically evaluate the capabilities and thresholds of the load balancer, especially when confronted with a continuous and substantial increase in user traffic. the process commences with a notable influx of 20,000 users consistently visiting the load balancer, persistently exerting pressure on its capacities until a threshold is reached. this phase aims to determine the specific threshold at which the load balancer's performance begins to deteriorate or is compromised when subjected to high-load situations [40]. fig. 7. breakpoint testing scenario in order to do thorough examinations and verify the results, we utilize the advanced grafana labs k6 testing tool. this tool facilitates the precise execution of tests, ensuring adherence to specified scenarios that faithfully replicate real-world usage patterns. the use of grafana labs k6 guarantees that both control and representation of genuine user behavior characterize our testing methodology, allows us to extract significant insights into the load balancer's performance in diverse scenarios. in addition, the testing procedure on the server side is closely monitored by utilizing kibana data analytics tools. using a dual-monitoring technique functions as a reliable validation mechanism, enabling cross-referencing and verifying the outcomes derived by grafana labs k6. by utilizing the sophisticated analytics features of kibana, a comprehensive understanding of the load balancer's performance can be obtained, including evaluating resource use, response times, and error rates [41]. the testing protocol we employ is characterized by its rigorous nature, resulting in substantial data and valuable insights. these findings are meticulously arranged and effectively communicated sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 86 through a collection of tables. the tables above encompass the test results in average-load testing (table 1), test results of stress testing (table 2), and breakpoint testing results (table 3). utilizing a tabular format facilitates the seamless comparison and analysis of crucial performance parameters, enabling the process of making well-informed decisions and implementing optimization methods for the load balancer and its connected systems. table 1 provides a comprehensive analysis demonstrating the constant and reliable performance of the round robin algorithm across several vital parameters. notably, this algorithm exhibits exceptional proficiency in processing http requests and establishing secure connections. the consistent capacity to produce expected outcomes highlights its appropriateness for the server environment of the bali provincial government, where the utmost importance is placed on stability and dependability. table 1. test results in average-load testing algorithm round robin ip hash data_received. 243 mb 270 kb/s 218 mb 330 kb/s data_sent. 32 mb 35 kb/s 108 mb 43 kb/s http_reg_blocked avg=131.88ms min=3us med=115.75ms med=115.75ms max=2.35s p(90)=165.48m p(95)=190 .69ms avg=140.85ms min=1µs med=117.38ms max=57.57s p(90)=185.54ms p(95)=223.84ms http_req_connecting avg=9.12ms min=0s med=0s max=1.035 p(90)=17.99ms p(95) -24.65 ms avg=7.76ms min=0s med=0s max=3.03s p(90)=14.91ms p(95)=22.45ms http_req_duration avg=109.32ms min=21.92ms med=114.79ms max=2.39s p(90)=199. 14ms p(95 )=226 .37 ms avg=154.45ms min=20.54ms med=121.04ms max=1m0s p(90)=216.97ms p(95)=262.81ms {expected_response: true} avg=109.32ms min=21.92ms med=114.79ms max=2.39s p(90)=199. 14ms p(95 )=226.37 ms avg=153.07ms min=20.54ms med=121.04ms max=57.6s p(90)=216.96ms p(95)=262.77ms http_req_failed 0.00% / 0 × 51290 0.00% ✓ 4 ✗ 172777 http_req_receiving avg=44.16ps min=5us med=25us max=6.13ms p(90)=100 us p(95)=123us avg=37.75µs min=0s med=18µs max=21.62ms p(90)=78µs p(95)=126µs http_req_sending avg=72.87ms min=3us med=86.91 ms max=2.36s p(90)=159.74ms p(95)=182.22ms avg=116.84ms min=3µs med=92.41ms max=57.58s p(90)=177.1ms p(95)=214.64ms http_req_tls_handshaking avg=54.33ms min=0s med=0s max=1.49s p(90)=126.68ms p(95)=141.39ms avg=53.93ms min=0s med=0s max=45.91s p(90)=133.34ms p(95)=157.43ms http_req_waiting avg=36.41ms min=21.8ms med=31.65ms max=2s p(90)=49.24ms p(95)=59.91ms avg=37.57ms min=20.5ms med=30.21ms max=1m0s p(90)=49.2ms p(95)=63.94ms http_regs 51290 56.925729/s 172781 69.609188/s iteration duration avg=1.17s min=1.07s med=1.15s max=3.39s p(90)=1.21s p(95)=1.25s avg=1.21s min=1.04s med=1.15s max=1m1s p(90)=1.24s p(95)=1.3s iterations 51290 56.925729/s 172781 69.609188/s vus 1 min=1 max=100 1 min=1 max=100 vus max 100 min=100 max=100 100 min=100 max=100 on the other hand, the ip hash algorithm demonstrates its advantages in data transmission rates and its ability to handle a larger number of requests per second effectively. these characteristics make it appealing when the primary focus is on swift data delivery. nevertheless, it is essential to acknowledge that compromises in other aspects of performance accompany these benefits. the data obtained from these experiments provides a comprehensive understanding of the performance of both methods, demonstrating distinct strengths in various load-balancing aspects. although both round robin and ip hash have their advantages, the predominant data indicates that round-robin's constant and dependable performance establishes it as the preferable option inside 87 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 the server ecology of the bali provincial government. nevertheless, it is essential to consider the individual deployment and use-case needs, as they may necessitate a more nuanced conclusion. therefore, more research should be conducted to examine these aspects and offer more customized advice the government's servers. the findings reported in table 2 demonstrate the superior performance of the round-robin algorithm compared to the ip hash technique across all critical performance criteria. significantly, the round robin algorithm demonstrates exceptional performance in connection times, request lengths, and overall efficiency in effectively handling http requests. consistent superior outcomes across all crucial factors establish round robin as the optimum solution for enhancing performance in the tested setting. table 2. test results of stress testing algorithm round robin ip hash data_received. 787 mb 525 kb/s 1.1 gb 162 kb/s data_sent. 102 mb 68 kb/s 151 mb 22 kb/s http_reg_blocked avg=194.77ms min=0s med=166.53ms max=5.68s p(90)-246.5ms p(95)=315.45ms avg=920.09ms min=0s med=589.27ms max=16m48s p(90)=1.06s p(95)=1.12s http_req_connecting avg=16.67ms min=0s med=0s max=1. 1s p(90)=36.26ms p(95)=49.9ms avg=277.56ms min=0s med=0s max=16m48s p(90)=68.84ms p(95)=118.92ms http_req_duration avg=171.3ms min=0s med=141.69ms max=59.97s p(90)=302.22ms p(95)=377.21ms avg=4.72s min=0s med=228.01ms max=56m54s p(90)=1.22s p(95)=1.38s {expected_response: true} avg=167.37ms min=21.42ms med=141.66ms max=5.75s p(90)=302.16ms p(95)=377.06ms avg=2.16s min=20.51ms med=225.44ms max=40m7s p(90)=1.22s p(95)=1.32s http_req_failed 0.00% / 14 × 166217 0.66% ✓ 1592 ✗ 236692 http_req_receiving avg=38.17us min=0s med=18us max=579.45ms p(90)=68us p(95)=94us avg=25.75µs min=0s med=19µs max=11.5ms p(90)=38µs p(95)=53µs http_req_sending avg=114.06ms min=0s med=93.2ms max=5.7s p(90)=241.33ms p(95 )=298. 96ms avg=2.04s min=0s med=136.15ms max=40m7s p(90)=1.12s p(95)=1.17s http_req_tls_handshaking avg=75.22ms min=0s med=0s max=4.77s p(90)=172.42ms p(95)=198.03ms avg=183.01ms min=0s med=0s max=10m54s p(90)=863.72ms p(95)=963.06ms http_req_waiting avg=57.2ms min=0s med=43.39ms max=59.52s p(90) =80. 37ms p(95)=100.45ms avg=2.68s min=0s med=60.85ms max=56m54s p(90)=181.39ms p(95)=325.93ms http_regs 166231 110.778429/s 238284 34.517602/s iteration duration avg=1.26s min=1.06s med=1.22s max=1m1s p(90)=1.34s p(95)=1.44s avg=2.2s min=1.03s med=1.87s max=1m1s p(90)=2.24s p(95)=2.55s iterations 166231 110.778429/s 238282 34.517312/s vus 1 min=1 max=200 1 min=1 max=200 vus max 200 min=200 max=200 200 min=200 max=200 in sharp contrast, despite its higher overall data processing and repetition, the ip hash algorithm has a significantly distinct profile. it is characterized by significantly reduced speeds, prolonged waiting periods, and an increased frequency of request failures. the deficiencies above prove the system's constraints in providing expeditious and prompt service, a crucial factor for consumers in a rapidly evolving digital environment. the information presented in table 3 is quite explicit and without significant ambiguity. due to its demonstrated dependability and effectiveness in managing key activities, the round-robin scheduling algorithm is unequivocally favored for optimizing performance within the specific context under examination. nevertheless, it is advisable to consider the precise operational demands sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 88 and use circumstances since these factors may need a more intricate decision-making process when deploying load-balancing solutions. it is imperative to do more investigation into these intricate situations to offer complete and customized suggestions for selecting an ideal load balancer. table 3. breakpoint testing results algorithm round robin ip hash data_received. 20 mb 37 kb/s 20 mb 27 kb/s data_sent. 2.4 mb 4.4 kb/s 2.4 mb 3.1 kb/s http_reg_blocked avg=114.64ms min=75.82ms med=101.49ms max=1.13s p(90)=148.8ms p(95)=212.63ms avg=106.5ms min=0s med=92.05ms max=3.46s p(90)=106.75ms p(95)=117.64ms http_req_connecting avg=30.68ms min=20.57ms med=25.37ms max=328.1ms p(90)=41.57ms p(95)=56.4ms avg=13.77ms min=0s med=5.45ms max=2.01s p(90)=9.76ms p(95)=13.83ms http_req_duration avg=32.07ms min=21.58ms med=26.52ms max=441.87ms p(90)=42.18ms p(95)=55.14ms avg=72.18ms min=0s med=27.29ms max=18.38s p(90)=31.97ms p(95)=36.02ms {expected_response: true} avg=32.07ms min=21.58ms med=26.52ms max=441.87ms p(90)=42.18ms p(95)=55.14ms avg=72.19ms min=21.89ms med=27.29ms max=18.38s p(90)=31.97ms p(95)=36.02ms http_req_failed 0.00% ✓ 0 ✗ 4259 0.02% ✓ 1 ✗ 4258 http_req_receiving avg=156.23µs min=31µs med=139µs max=4.04ms p(90)=198µs p(95)=242µs avg=159.25µs min=0s med=139µs max=8.04ms p(90)=214µs p(95)=266.09µs http_req_sending avg=140.53µs min=23µs med=125µs max=7.45ms p(90)=190µs p(95)=229µs avg=43.14ms min=0s med=132µs max=18.35s p(90)=209µs p(95)=294.19µs http_req_tls_handshaking avg=83.66ms min=53.25ms med=72.74ms max=1.07s p(90)=101.44ms p(95)=144.09ms avg=92.13ms min=0s med=85.78ms max=3.45s p(90)=97.23ms p(95)=105.03ms http_req_waiting avg=31.77ms min=21.36ms med=26.2ms max=441.6ms p(90)=41.79ms p(95)=54.88ms avg=28.87ms min=0s med=26.96ms max=839.28ms p(90)=31.53ms p(95)=35.32ms http_regs 4259 7.884463/s 4259 5.59725/s iteration duration avg=1.14s min=1.09s med=1.13s max=2.18s p(90)=1.19s p(95)=1.26s avg=1.19s min=1.09s med=1.12s max=1m1s p(90)=1.14s p(95)=1.15s iterations 4259 7.884463/s 259 5.59725/s vus 1 min=1 max=13 1 min=1 max=25 vus max 50 min=50 max=50 50 min=50 max=50 the results shown in table 3 highlight the round-robin method's superior performance compared to the ip hash alternative across several essential criteria. significantly, the round-robin algorithm demonstrates exceptional proficiency in data transmission speed, the duration of requests, and the efficient execution of iterations. the constant and excellent performance of round-robin in these crucial areas makes it a tempting option for optimizing load balancing within the dataset under evaluation. the ip hash method demonstrates notable strengths in specific measures such as request blocking and connection delays. however, the round-robin algorithm emerges as the most advantageous option when examining the overall performance profile. the selection between the two algorithms is contingent upon the particular priorities and exigencies of the given use case since each method possesses distinct strengths and trade-offs. based on the extensive data provided, it can be concluded that the round-robin algorithm has superior efficiency across all dimensions, rendering it a highly appealing alternative for enterprises aiming to optimize their load-balancing techniques. nonetheless, it is crucial to ensure that the choice of algorithm follows the unique performance goals and operational limitations of the given context, 89 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 emphasizing the significance of customized approaches in the load-balancing domain. additional inquiry and contextual analysis can potentially enhance this judgment's precision significantly. iv. conclusion through our extensive examination of the ip hash and round robin algorithms, we have garnered significant insights that can contribute to advancing future research endeavors and provide practical guidance for their implementation. concerning data transfer rates, the ip hash method demonstrated a marginal superiority based on the average outcomes of the conducted tests. nevertheless, round-robin has shown to be a more reliable option, especially in terms of managing http requests and secure connections. the stability and dependability of round-robin were further emphasized during stress testing, as it continually surpassed ip hash across several performance parameters. significantly, round robin exhibited enhanced connection times, request durations, and overall efficiency in managing http requests. in the breakpoint test, the level of competition between the two algorithms was more evenly balanced. both ip hash and roundrobin algorithms handled comparable data amounts. however, round robin exhibited superior data transmission rates. although ip hash showed superior performance in request blocking and connection delays, round robin once again showcased its proficiency in the crucial realm of http request handling and iteration processing rates. upon examining the collective results obtained from the three tests, it becomes apparent that the round-robin algorithm exhibits superior performance, consistency, and reliability compared to the ip hash method. although ip hash showed capabilities in certain areas, round-robin consistently beat it across a broader range of performance criteria. when businesses or entities are confronted with the decision between these two algorithms in prospective research, it is highly recommended that round-robin be given significant consideration due to its equitable and efficient performance, which holds particular significance when the utmost importance is placed on maintaining consistency and reliability. nevertheless, it is important to acknowledge that individual use cases and distinct requirements may influence the final selection. hence, it is recommended that future research endeavors undertake a more comprehensive investigation of these particular cases to offer additional insights and recommendations for the selection and implementation of algorithms. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. hanafiah, “implementasi load balancing dengan algoritma penjadwalan weighted round robin dalam mengatasi beban webserver,” it j. res. dev., vol. 5, no. 2, pp. 226–233, jan. 2021. [2] y. arta, “penerapan metode round robin pada jaringan multihoming di computer cluster,” it j. res. dev., vol. 1, no. 2, pp. 26–35, aug. 2017. [3] t. d. putra and r. purnomo, “average max round robin algorithm: a case study,” sinkron, vol. 8, no. 3, pp. 1230– 1237, jul. 2023. http://journal2.um.ac.id/index.php/keds https://doi.org/10.25299/itjrd.2021.vol5(2).5795 https://doi.org/10.25299/itjrd.2021.vol5(2).5795 https://doi.org/10.25299/itjrd.2017.vol1(2).677 https://doi.org/10.25299/itjrd.2017.vol1(2).677 https://doi.org/10.33395/sinkron.v8i3.12051 https://doi.org/10.33395/sinkron.v8i3.12051 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 90 [4] r. purnomo and t. d. putra, “comparison between simple round robin and improved round robin algorithms,” jatisi (jurnal tek. inform. dan sist. informasi), vol. 9, no. 3, pp. 2205–2221, sep. 2022. [5] r. sharma, a. k. goel, m. k. sharma, n. dhiman, and v. n. mishra, “modified round robin cpu scheduling: a fuzzy logic-based approach,” in lecture notes in operations research, 2023, pp. 367–383. [6] a. y. ahmad, “an attempt to set standards for studying and comparing the efficiency of round robin algorithms,” j. educ. sci., vol. 32, no. 2, pp. 11–20, jun. 2023. [7] b. manasa and a. r. babu, “dynamic weighted round robin approach in software-defined networks using pox controller,” int. j. recent innov. trends comput. commun., vol. 11, no. 5, pp. 304–310, may 2023. [8] s. e. abubakar, “modified round robin with highest response ratio next cpu scheduling algorithm using dynamic time quantum,” slu j. sci. technol., pp. 87–99, mar. 2023. [9] d. biswas, m. samsuddoha, m. r. al asif, and m. m. ahmed, “optimized round robin scheduling algorithm using dynamic time quantum approach in cloud computing environment,” int. j. intell. syst. appl., vol. 15, no. 1, pp. 22–34, feb. 2023. [10] m. a. s. al-mekhlafi and n. n. s. al-marbe, “lower and upper quartiles enhanced round robin algorithm for scheduling of outlier tasks in cloud computing,” j. eng. technol. sci. joeats, vol. 1, no. 1, pp. 67–87, mar. 2023. [11] w. ullah and m. a. shah, “a novel resilent round robin algorithm based cpu scheduling for efficient cpu utlilization,” in competitive advantage in the digital economy (cade 2022), 2022, pp. 41–48. [12] y. afrianto, h. sukoco, and s. wahjuni, “weighted round robin load balancer to enhance web server cluster in openflow networks,” telkomnika (telecommunication comput. electron. control., vol. 16, no. 3, p. 1402, jun. 2018. [13] h. m. noman and m. n. jasim, “a comparative performance analysis for static and dynamic load balancing techniques in software defined network environment,” j. phys. conf. ser., vol. 1773, no. 1, p. 012010, feb. 2021. [14] t. chomsiri and d. pansa, “load balancer mechanism using optimal parameter based on calculus,” in 2018 international conference on information technology (incit), oct. 2018, pp. 1–6. [15] m. a. n. saif, s. k. niranjan, b. a. h. murshed, f. a. ghanem, and a. a. q. ahmed, “cso-ilb: chicken swarm optimized inter-cloud load balancer for elastic containerized multi-cloud environment,” j. supercomput., vol. 79, no. 1, pp. 1111–1155, jan. 2023. [16] k. k. azumah, p. r. m. maciel, l. t. sørensen, and s. kosta, “modeling and simulating a process mining-influenced load-balancer for the hybrid cloud,” ieee trans. cloud comput., vol. 11, no. 2, pp. 1999–2010, apr. 2023. [17] r. uddin and f. monir, “performance evaluation of ryu controller with weighted round robin load balancer,” in communications in computer and information science, 2021, pp. 115–129. [18] k. takahashi, k. aida, t. tanjo, and j. sun, “a portable load balancer for kubernetes cluster,” i n proceedings of the international conference on high performance computing in asia-pacific region, jan. 2018, pp. 222–231. [19] o. khoshaba, v. lytvynov, v. grechaninov, and k. zavertailo, “performance of the reverse load balancer method in cluster and cloud infrastructures,” in advances in intelligent systems and computing, 2021, pp. 186–196. [20] x. huang, z. guo, and m. song, “fglb: a fine‐grained hardware intra‐server load balancer based on 100 g fpga smartnic,” int. j. netw. manag., vol. 32, no. 6, nov. 2022. [21] s. mangalampalli, p. k. sree, k. v. n. rao, a. rapaka, and r. t. kocherla, “prioritized load balancer for minimization of vm and data transfer cost in cloud computing,” in advances in intelligent systems and computing, 2022, pp. 263–271. [22] s. atalla, a. bianco, r. birke, and l. giraudo, “a hardware load balancer for a multi-stage software router architecture (sep. 17),” in 2014 world congress on computer applications and information systems, wccais 2014, 2022, no. july. [23] w. w. mulat, s. k. mohapatra, r. sathpathy, and s. k. dhal, “improving throttled load balancing algorithm in cloud computing,” in algorithms for intelligent systems, 2022, pp. 369–377. [24] m. park, j. seok, and k. lee, “a sip load balancer for performance enlargement,” 2022. [25] s. s. tripathy, d. s. roy, and r. k. barik, “m2fbalancer: a mist-assisted fog computing-based load balancing strategy for smart cities,” j. ambient intell. smart environ., vol. 13, no. 3, pp. 219–233, may 2021. [26] n. g. elnagar, g. f. elkabbany, a. a. al-awamry, and m. b. abdelhalim, “simulation and performance assessment of a modified throttled load balancing algorithm in cloud computing environment,” int. j. electr. comput. eng., vol. 12, no. 2, p. 2087, apr. 2022. [27] f. mulyadi and k. akkarajitsakul, “non-cooperative and cooperative game approaches for load balancing in distributed systems,” in proceedings of the 2019 7th international conference on computer and communications management, jul. 2019, pp. 252–257. [28] m. elveny, a. winata, b. siregar, and r. syah, “a tutorial: load balancers in a container technology system using docker swarms on a single board computer cluster,” ilkogr. online elem. educ. online, vol. 19, no. 4, pp. 744– 751, 2020. [29] s. sahana, t. mukherjee, and d. sarddar, “a conceptual framework towards implementing a cloud-based dynamic load balancer using a weighted round-robin algorithm,” int. j. cloud appl. comput., vol. 10, no. 2, pp. 22–35, apr. 2020. [30] t. barbette, e. wu, d. kostic, g. q. maguire, p. papadimitratos, and m. chiesa, “cheetah: a high -speed programmable load-balancer framework with guaranteed per-connection-consistency,” ieee/acm trans. netw., vol. 30, no. 1, pp. 354–367, feb. 2022. [31] o. khoshaba, v. grechaninov, a. lopushanskyi, and k. zavertailo, “studying the dynamic bottlenecks of a load balancer in distributed systems,” in lecture notes in networks and systems, 2022, pp. 199–211. https://doi.org/10.35957/jatisi.v9i3.2547 https://doi.org/10.35957/jatisi.v9i3.2547 https://doi.org/10.1007/978-981-19-8012-1_24 https://doi.org/10.1007/978-981-19-8012-1_24 https://doi.org/10.33899/edusj.2023.137735.1317 https://doi.org/10.33899/edusj.2023.137735.1317 https://doi.org/10.17762/ijritcc.v11i5.6618 https://doi.org/10.17762/ijritcc.v11i5.6618 https://doi.org/10.56471/slujst.v6i.363 https://doi.org/10.56471/slujst.v6i.363 https://doi.org/10.5815/ijisa.2023.01.03 https://doi.org/10.5815/ijisa.2023.01.03 https://doi.org/10.5815/ijisa.2023.01.03 https://doi.org/10.59421/joeats.v1i1.1420 https://doi.org/10.59421/joeats.v1i1.1420 https://doi.org/10.59421/joeats.v1i1.1420 https://doi.org/10.1049/icp.2022.2038 https://doi.org/10.1049/icp.2022.2038 https://doi.org/10.12928/telkomnika.v16i3.5601 https://doi.org/10.12928/telkomnika.v16i3.5601 https://doi.org/10.12928/telkomnika.v16i3.5601 https://doi.org/10.1088/1742-6596/1773/1/012010 https://doi.org/10.1088/1742-6596/1773/1/012010 https://doi.org/10.23919/incit.2018.8584884 https://doi.org/10.23919/incit.2018.8584884 https://doi.org/10.1007/s11227-022-04688-w https://doi.org/10.1007/s11227-022-04688-w https://doi.org/10.1007/s11227-022-04688-w https://doi.org/10.1109/tcc.2022.3177668 https://doi.org/10.1109/tcc.2022.3177668 https://doi.org/10.1007/978-3-030-84842-2_9 https://doi.org/10.1007/978-3-030-84842-2_9 https://doi.org/10.1145/3149457.3149473 https://doi.org/10.1145/3149457.3149473 https://doi.org/10.1007/978-3-030-58124-4_18 https://doi.org/10.1007/978-3-030-58124-4_18 https://doi.org/10.1002/nem.2211 https://doi.org/10.1002/nem.2211 https://doi.org/10.1007/978-981-16-7088-6_23 https://doi.org/10.1007/978-981-16-7088-6_23 https://doi.org/10.1007/978-981-16-7088-6_23 https://doi.org/10.1109/wccais.2014.6916593 https://doi.org/10.1109/wccais.2014.6916593 https://doi.org/10.1109/wccais.2014.6916593 https://doi.org/10.1007/978-981-19-0332-8_27 https://doi.org/10.1007/978-981-19-0332-8_27 https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahukewiu78kjz72baxwcsmwghctqdfuqfnoecbqqaq&url=http%3a%2f%2fwww.wseas.us%2fe-library%2fconferences%2fjoint2002%2f451-298.pdf&usg=aovvaw1wntkgn2eac_6ngclzfjoj&opi=89978449 https://doi.org/10.3233/ais-210598 https://doi.org/10.3233/ais-210598 https://doi.org/10.11591/ijece.v12i2.pp2087-2096 https://doi.org/10.11591/ijece.v12i2.pp2087-2096 https://doi.org/10.11591/ijece.v12i2.pp2087-2096 https://doi.org/10.1145/3348445.3348477 https://doi.org/10.1145/3348445.3348477 https://doi.org/10.1145/3348445.3348477 https://www.researchgate.net/publication/349158847_a_tutorial_load_balancers_in_a_container_technology_system_using_docker_swarms_on_a_single_board_computer_cluster https://www.researchgate.net/publication/349158847_a_tutorial_load_balancers_in_a_container_technology_system_using_docker_swarms_on_a_single_board_computer_cluster https://www.researchgate.net/publication/349158847_a_tutorial_load_balancers_in_a_container_technology_system_using_docker_swarms_on_a_single_board_computer_cluster https://doi.org/10.4018/ijcac.2020040102 https://doi.org/10.4018/ijcac.2020040102 https://doi.org/10.4018/ijcac.2020040102 https://doi.org/10.1109/tnet.2021.3113370 https://doi.org/10.1109/tnet.2021.3113370 https://doi.org/10.1109/tnet.2021.3113370 https://doi.org/10.1007/978-3-030-89902-8_16 https://doi.org/10.1007/978-3-030-89902-8_16 91 sudiatmika et al. / knowledge engineering and data science 2023, 6 (1): 79–91 [32] j.-b. lee, t.-h. yoo, e.-h. lee, b.-h. hwang, s.-w. ahn, and c.-h. cho, “high-performance software load balancer for cloud-native architecture,” ieee access, vol. 9, pp. 123704–123716, 2021. [33] s. atalla, a. bianco, r. birke, and l. giraudo, “netfpga-based load balancer for a multi-stage router architecture,” in 2014 world congress on computer applications and information systems (wccais), jan. 2014, pp. 1–6. [34] f. alharbi and m. mustafa, “two-tier load balancer as a solution to a huge number of servers,” j. eng. appl. sci., vol. 9, no. 1, p. 1, 2022. [35] k. i. nikishin, “load balancer of data in a distributed network via nginx proxy server,” proc. southwest state univ., vol. 26, no. 3, pp. 98–111, feb. 2023. [36] a. k. sinha, s. s. k. singh, s. sai, and m. sivagami, “implementing an integrated network load balancer for minimizing weighted response,” in lecture notes on data engineering and communications technologies, 2023, pp. 651–662. [37] m. lopez-martin, b. carro, j. i. arribas, and a. sanchez-esguevillas, “network intrusion detection with a novel hierarchy of distances between embeddings of hash ip addresses,” knowledge-based syst., vol. 219, p. 106887, may 2021. [38] e. osei kofi and e. ahene, “enhanced network load balancing technique for efficient performance in software defined network,” plos one, vol. 18, no. 4, p. e0284176, apr. 2023. [39] t. isobe et al., “areion: highly-efficient permutations and its applications to hash functions for short input,” iacr trans. cryptogr. hardw. embed. syst., pp. 115–154, mar. 2023. [40] c. rawls and m. a. salehi, “load balancer tuning: comparative analysis of haproxy load balancing methods,” 2022. [41] k. takahashi, “a study on portable load balancer for container clusters,” university for advanced studies (sokendai), 2019. https://doi.org/10.1109/access.2021.3108801 https://doi.org/10.1109/access.2021.3108801 https://doi.org/10.1109/wccais.2014.6916593 https://doi.org/10.1109/wccais.2014.6916593 https://doi.org/10.5455/jeas.2022050101 https://doi.org/10.5455/jeas.2022050101 https://doi.org/10.21869/22231560-2022-26-3-98-111 https://doi.org/10.21869/22231560-2022-26-3-98-111 https://doi.org/10.1007/978-981-99-1767-9_47 https://doi.org/10.1007/978-981-99-1767-9_47 https://doi.org/10.1007/978-981-99-1767-9_47 https://doi.org/10.1016/j.knosys.2021.106887 https://doi.org/10.1016/j.knosys.2021.106887 https://doi.org/10.1016/j.knosys.2021.106887 https://doi.org/10.1371/journal.pone.0284176 https://doi.org/10.1371/journal.pone.0284176 https://doi.org/10.46586/tches.v2023.i2.115-154 https://doi.org/10.46586/tches.v2023.i2.115-154 https://arxiv.org/abs/2212.14198v1 https://arxiv.org/abs/2212.14198v1 https://www.researchgate.net/publication/337971562_a_study_on_portable_load_balancer_for_container_clusters https://www.researchgate.net/publication/337971562_a_study_on_portable_load_balancer_for_container_clusters knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 24–40 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p24-40 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) exploring the impact of students demographic attributes on performance prediction through binary classification in the kdp model issah iddrisu a,1,*, peter appiahene a,2, obed appiah a,3, inusah fuseini b,4 a university of energy and natural resources, post office box 214, sunyani, ghana b university for development studies, unnamed road, tamale, ghana 1 issah.iddrisu.stu@uenr.edu.gh*; 2 peter.appiahene@uenr.edu.gh; 3 obed.appiah@uenr.edu.gh; 4 obed.appiah@uenr.edu.gh * corresponding author i. introduction learner assessment is central to determining students' progress in every educational establishment. evaluating students' performance, however, has become a daunting task as more factors are now involved when it comes to the determinants of student achievements due to the paradigm shift now taking place in the educational sector: the use of learning management systems (lms), student information systems (sis), and educational management information systems (emis). the data produced by these systems tend to overwhelm educational decision-makers due to the diversity and the massive volume of data housed by these data sources. however, recent research improvements have made powerful computational prediction methods and techniques, such as machine learning, a realistic alternative for various applications, including educational decision support systems (edss). machine learning (ml) is one way that can help decipher the intricate relationship between these students' data and their performance. when implemented correctly in learning environments, machine learning will improve our knowledge of fundamental processes by simplifying the identification, extraction, and evaluation of underlying factors affecting student learning and achievement levels. much progress has been made in machine learning about its use in other fields such as medicine, commerce, the transport industry, bioinformatics, road traffic detection and control, and in diverse fields where decision-making is crucial [1]. ml involves searching through many possible hypotheses to ascertain the most appropriate and relevant data and then comparing it with existing data generated by the learner. the idea of machine learning is derived from various disciplines, such as probability and statistics, computational complexity, information theory, neurology, evolutionary theories, and models [2]. article info a b s t r a c t article history: received 9 march 2023 revised 20 march 2023 accepted 21 april 2023 published online 30 april 2023 during the course of this research, binary classification and the knowledge discovery process (kdp) were used. the experimental and analytical capabilities of rapid miner's 9.10.010 instructional environment are supported by five different classifiers. included in the analysis were 2334 entries, 17 characteristics, and one class variable containing the students' average score for the semester. there were twenty experiments carried out. during the studies, 10-fold cross-validation and ratio split validation, together with bootstrap sampling, were used. it was determined whether or not to use the random forest (rf), rule induction (ri), naive bayes (nb), logistic regression (lr), or deep learning (dl) methods. rf outperformed the other four methods in all six selection measures, with an accuracy of 93.96%. according to the rf classifier model, the level of education that a child's parents have is a major factor in that child's academic performance before entering higher education. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: student demographic performance prediction classification kdp model http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 25 the ml design approach leans itself against several criteria that embody identifying the natural experiences acquired from training, the exact function to learn, a demonstration for the said function, and the optimal algorithm for learning it according to the training examples. ml algorithms commonly used include; decision trees (dt), support vector machines (svm), artificial neural networks (ann), logistic regression (lr), naïve bayes (nb), and rule inductions (ri) algorithms. similar to the other fields where ml has been successfully employed, its application on educational data is a promising area in research identified as educational data mining (edm). it involves creating processes to extract patterns embedded in datasets within educational settings [3]. this concept has been implemented to improve and assess educational activities and decision-making. prediction, which encompasses the subcategories of classification, regression, and density estimation, is a paradigm in edm [4]. relation mining, association mining, correlation mining, sequential pattern mining, and causative data mining are all types of clustering [5]. in addition, prediction also incorporates data distillation to aid in human logic and model finding. edm has proven to be the primary source of solid and dependable data analysis regarding educational decision-making at the country's educational institutions [6][7]. it carefully identifies education challenges to determine appropriate solutions that address them. the inclusion of an expert system in managing primary education due to edm has been enumerated in [8] and [6]. educational data mining has been used to track the academic welfare of students and the general administrative procedures of educational institutions worldwide [9][10]. it is essential to be aware of the factors (also known as the predictor variables) that influence students' academic performance to comprehend and enhance the current state of the educational system [11]. therefore, determining the characteristics associated with students' academic accomplishment has always aroused the interest of academics who work in edm. many earlier studies dissected this phenomenon by isolating one variable at a time. they attempted to investigate the relationship between a single element and its impact on academic accomplishment by collecting data, the majority of which was obtained using instruments of the survey type. previous research works have been published in the academic world to determine the primary elements or characteristics that influence the learner's achievement, including the algorithms that produce the best prediction result. students' apparent poor performance in numerous educational establishments has been influenced by various predictors [12][13]. they include personal characteristics, intellectual ability, gender and aptitude tests, academic achievement, previous college accomplishments, and demographic characteristics [14] in modeling students’ academic performance based on their cognitive and noncognitive characteristics [11]. seven ml heterogenous lazy classifiers were employed, including dt, knn, ann, lr, rf, adaboost, and svm. they used the 10-fold and leave-one-out cross-evaluation techniques to evaluate the selected classifiers' predictive performance. the student's absent days (sad) were the dominant feature for predicting students’ academic success. it was also concluded that the rf, lr, and ann were viable in predicting students' performance. implementation of ml to determine students' academic achievement based on the student's internal assessment data constructed an ann algorithms-based prediction model [15]. the best classification accuracy attained by the model was 95.34% through the ann. furthermore, the precision, recall, fscore, accuracy, and kappa statistics efficiency were derived as rule-based decision specifications to discover the most practical classification methods. however, the study presented inconsistent observations on which specific machine learning model is most accurate in predicting students' performance. investigating factors affecting students' performance at the postgraduate level by using the ann for constructing the model [16]. the study presented a model using the deep learning approach for performance prediction based on 395 postgraduate students and 30 records within the r data mining environment. a comparison of the accuracy of the lr, the rf technique, and the ann revealed that the lr performed with 12.339% accuracy. the rf gave an accuracy of 28.101%, and the ann had an accuracy of 97.429% on the given dataset. with this prediction accuracy, it was concluded that ann is more reliable and demonstrates improved classification results than other traditional classifiers. the dataset used in the study was based on the attributes from institutions of higher learning. it will be interesting to apply the same model to datasets of pre-tertiary institutions to validate the model's generalization. 26 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 investigated the prediction of students' learning outputs and explored the likelihood of recognizing the critical features in the data to be used in creating the prediction model using visualization and the clustering algorithm techniques [17]. the outcome demonstrated the capability of the clustering algorithm in classifying significant indicators within the datasets. in addition, the study showed the efficiency of svm and learning discriminant analysis (lda) algorithms in training educational datasets while giving satisfactory classification accuracy and reliability test rates. however, the small data set cannot be generalized to prove the model's efficacy on all educational datasets. three different ml technique was used to forecast student performance [18]. the dt, nb, and lr classifications were employed here. the feature engineering criteria and the modification and selection of dataset characteristics were applied to enhance the predictions made by ml algorithms. the dataset used was put in two separate categories. the research findings suggest that using ml to anticipate student performance may be helpful. the most successful method from the first dataset was nb classification, with 98% accuracy; dt did better for the second batch of data, with 78% accuracy. in the study, the specific attributes and techniques capable of determining future learning outcomes could not be identified, presenting a conceptual vacuum that warrants further investigation. studies on the relationships between the instructional strategies employed by instructors and educators and how they impact students' academic performance have recently attracted more attention. most research focuses on achievement due to the use of assessment techniques such as class tests, homework, class exercises, project work, and semester examinations [19]. when predicting a student's future academic success, past grades from an academic institution are seen to have the appropriate amount of weight as enumerated by [20], mainly when those grades come from continuous assessment, which shows a student's early mastery of a topic and progress of the study. explored the efficacy of assessments using examination techniques, class tests, assignments, and mid-semester quizzes, including the influence of lecturer response on students' performance [21]. the study's outcome revealed a correlation between the assessments students took and, eventually, the student's final grades. another investigation exploring the relevance of formative assessment to improve the prediction of learner grades in examinations suggested the possibility of identifying students who may perform poorly in their final examinations. the possibility of being able to forecast, with a degree of accuracy, how a student will perform at an end-of-course examination [22]. the effects of giving assessment feedback on time to students often result in a small quantity of enhancement in the final grades [23]. predicting the validity of previous achievements in determining students' performance in higher education [24]. the high school scholastic assessment test (sat) score marks and the early years' university grades were considered possible predictors of future performance. the impact of subjects on students' advanced placements was also investigated. their finding clearly connected these three characteristics and students' university accomplishments. among the factors that influence students’ performance are school effects, socio-economic background, and personal traits hindering students' performance [12]. student background characteristics such as education levels, the profession of parents/guardians, and place of residence all play an essential part in defining students' success tinto (1975) [24]. this is further corroborated by referring to these phenomena on students' academic success as "a one-hundred-factor problem," as many researchers focused on different aspects of students' performance in different periods and came out with diverse conclusions [25]. examining the impact of socio-economic influence on the upbringing of students and the final results of their education, realized that students from privileged backgrounds attained higher grades or had necessary skills that proved valuable within the academic setting [26]. this suggests that the level of poverty and even the area students come from can affect a student's academic output. furthermore, this suggests that a student's home environment is a contributory factor to his performance. in serbia, some demographic features, including gender, ethnicity, and the students' school background, were investigated to determine which among them had more influence on the student’s academic performance in mathematics and the serbian language [27]. the result indicated that student affluence contributed the most to poor mathematics performance, whereas the serbian language i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 27 grades were less affected. gender had a relatively minimal effect on the grades suggesting that gender had less effect on students' performance at the university level. integrating demographic data alongside school results is recommended because learner achievement is based almost entirely on students' past exam results, mostly without consideration for the setting where some of these performances had been accomplished [28]. again, research on student achievement and the associations with context-specific background variables and attainment in broader terms was limited mainly [12][13]. hence the need to delve into the correlation between students' performance and their demographic variables. more so, literature in this regard has failed to provide further remedies or intervention strategies based on identifiable traits early in a student's programs of study. as a result, the goal of this research is to execute ml on students’ demographic characteristics to track their achievements, as well as design a classification model capable of mapping student features and performance in order to effectively implement the ministry of education’s (moe) flagship early intervention scheme to improve underperforming students' academic achievements in schools. the paper aims to identify and apply ml algorithms to uncover the key demographic factors that influence newly admitted students' academic achievement as well as identify students to receive appropriate academic intervention so that overall school performance can be scaled up in the west africa senior secondary certificate examination (wassce). the research aims to examine and address the following set of questions: 1. which machine learning classification algorithms are more viable in predicting students' academic attainment based on their demographic attributes? 2. what primary demographic attributes influence students' academic performance at ghana's senior high school (shs) level? ii. methods this study employed the experimental research approach using binary classification techniques based on the six-step kdp model. the classification technique was used to sort the students into either in need of intensive intervention or low intervention. we employed secondary data from two sources. based on the placement forms of students from the computerized school selection and placement system (cssps), the demographic, basic education certificate examination (bece) average score, and previous school data were extracted. in contrast, the semester average score and the grades for english language, mathematics, and integrated science for their senior high school (shs) performance were also extracted from the student's information system (sis). also, with the suggestion of the domain expert (ict coordinator of tamale islamic science senior high school (tissec), the following student attributes were considered helpful for the task at hand: mother's education level, father's education level, sponsor for the student's education, the birth position of the student in the family, and parental status of students. this study used 1854 records and 17 common attributes (including the class attribute) for training and evaluating the various models. the description of students' features used in the study is summarized in table 1. a. dataset optimization and feature extraction primary and real-world data will invariably contain imbalanced data challenges [29]. for example, whenever the number of instances from one class (the minority class) is significantly lower than the number of instances from the other classes (the majority class), the minority class may be the most effective, leading to the highest error cost in terms of learning [30]. the synthetic minority oversampling technique (smote) with default settings was used as a sampling technique to upscale the minority classes (i.e., students' demographic variables) to manage class imbalance within the features. the upscaling synthetically increased the number of demographic variables by 79% within the local repository of rapid miner after the smote up-sampling application. 28 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 table 1. attributes extracted from the database no. attribute name data value data type 1 gender [1]: male = m: [2] female = f nominal 2 student position in the family [1]:1st born, [2]: last born, [3]: others, [4]: only child numeric 3 parents marital status [1]: married,[2]: single, [3]: widowed nominal 4 fathers edu. [1]: primary school, [2]: junior high school (jhs), [3] secondary school (shs), [4]: tertiary, [5]: non nominal 5 mothers’ edu. [1]: primary school, [2] junior high school (jhs), [3]: secondary school (shs), [4]: tertiary, [5]: non nominal 6 fathers’ occ. [1]: retired, [2]: government, [3]: private sector employee, [4]: self-employment, [5] nominal 7 mothers’ occ [1]: retired, [2]: government, [3]: private sector employee, [4]: self-employment, [5] nominal 8 sponsor [1]: self, [2]: parent, [3]: scholarship, [4]: others nominal 9 residential status [1]: boarding, [2]: day nominal 10 type of jhs attended [1]: private, [2]: public nominal 11 bece aggregate [1]: 6-8, [2]: 9-11, [3]: 12-15, [4]: 16-19, [5]: 20-24, [6]: 25-30, [7]: above 30 numeric 12 bece accumulated raw score [1]: 50-100, [2]: 101-200, [3]: 201-300, [4]: 301-400 numeric 13 first semester avg score [1]: 00-45, [2]: 46-50, [3]: 51-55, [4]: 56-60, [5]: 61-65, [6]:66-70, [7]: [8]: 71-79, [9]:80 and above numeric 14 region of residence [1]: n/r, [2]: a/r, [3]: g/r, [4]: c/r, [5]: u/w, [6]: u/e, [7]: s/r, [8]: ne/r, [9]: w/r, [10]: e/r, [11]: v/r, [12]: o/r, [13]: ahafo/ r, {14]: bono east, [15]: bono, [16]: wn/r nominal 15 integrated science [1]: a1, [2]: b2, [3]: b3, [4]: c4, [5]: c5, [6]: c6, [7]: d7, [8]: e8,[9]: f9 nominal 16 english language [1]: a1, [2]: b2, [3]: b3, [4]: c4, [5]: c5, [6]: c6, [7]: d7, [8]: e8,[9]: f9 nominal 17 mathematics [1]: a1, [2]: b2, [3]: b3, [4]: c4, [5]: c5, [6]: c6, [7]: d7, [8]: e8,[9]: f9 nominal since not all attributes have equal significance in prediction within a defined dataset, feature extraction and order are critical. given this, the attributes were sorted on information gain by weight, as seen in table 2. the operator "weight by information gain", was used in rapidminer to determine the order of the attributes. figure 1 depicts the descending order of information gained from common attributes to class attributes. table 2. attributes weights by information gain no. attribute information gain no. attribute information gain 1 mother education 0.157 9 region of residence 0.016 2 father education 0.139 10 age 0.013 3 bece raw score 0.131 11 jhs location 0.012 4 bece aggregate 0.123 12 sponsor 0.003 5 father occupation 0.083 13 student birth position 0.001 6 mother occupation 0.030 14 parent marital status 0.000 7 jhs type 0.026 15 student maturity 0.000 8 residential status 0.016 16 gender 0.000 b. modeling technique and model building experiments were conducted in this study to build models by incorporating specified classifiers for predicting the performance of pre-tertiary students based on demographic information. five classification approaches were used for model construction to meet the study's aims. rapidminer studio was used to conduct the analysis. the rf algorithms from dt, ri algorithms from rule-based classifiers, the nb algorithm from bayesian networks, the lr algorithm from regression, and dl algorithms from nn were chosen for the experiments among the various classification algorithms available in rapidminer. the grounds for selecting the algorithms are their capacity to handle polynomial attributes effectively, the ease of understanding and interpretation of the model's outcomes for the investigations, and their popularity in recent years in education-related classification problems. i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 29 fig. 1. a line graph of information gain of attributes c. description of the selected algorithms first, a classification method is used to construct a decision tree (dt). the classification processes are described in this instance via a hierarchical array of decisions on feature variables that manifest in the shape of a tree [31]. dt are made of nodes joined to constitute a rooted tree; therefore, it is a directed graph comprising nodes known as roots without incoming edges (figure 1). the other nodes that determine the class of objects are known as the leaves or terminal nodes [32]. every leaf is attributed to a class representing the most appropriate target value [33]. nodes with a blend of diverse classes are to be split further. a stopping criterion determines when the decision tree algorithm should terminate. when an entire training sets in the terminal/leaf node fit within a particular class, then the stopping criterion is said to be reached [34]. figure 2 illustrates a typical dt structure [35]. fig. 2. concept of a decision tree 30 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 every node matches a characteristic, while the branches link with an array of values. all nodes are labeled with the attributes they test, and every branch has its corresponding values [36]. the range of values is mutually exclusive and complete. the properties of a tree being disjoint or complete are vital as they ensure every instance maps to one case (figure 2). averaging ensemble approaches include the random forest (rf) algorithm. rf represent huge feature areas and are more resilient than dt. rf is a bagged classifier that connects a group of dt classifiers to form a forest of trees [37]. a diverse collection of classifiers is formed by integrating randomization into the classifier-building process. the ensemble prediction is presented as the average prediction of the discrete classifiers [2]. in rf, every tree in the ensemble is created using a unique bootstrap sample, which includes a random selection of instances with replacements from the entire training dataset [38]. random feature selection is used in an rf [39], where ‘𝑚’ features are chosen randomly from ‘𝑀’ features for every node of the dt “𝑡”, and the optimal value is taken from “𝑚”. therefore, the split determined when splitting a node throughout tree formation is no longer the best among all features. alternatively, the chosen split is the best among a randomly picked collection of characteristics. as a result, the forest bias often grows concerning the bias of a single non-random tree [40]. however, averages generally compensate for an overall model's increase in bias. table 3 shows a description of rf-optimized parameters and data types within rapidminer. table 3. some random forest algorithm parameters with their values in rapidminer parameters value description type criterion information gain determine the criteria by which the qualities will be divided apart. the value is improved for each of these criteria about the selected parameter. nominal apply pruning true upon development, the random forest model's random trees can be trimmed. depending on the confidence variable, some branches are supplemented by leaves if approved. boolean random split false rather than being balanced, this setting divides numerical characteristics randomly. boolean number of trees 20 this indicates how many random trees will be produced. numeric maximal depth 10 a tree's depth fluctuates according to the supplied example set's dimensions and properties. numeric confidence 0.1 this option specifies the confidence level used in pruning's pessimistic error computation. numeric voting strategy majority vote outlines the prediction plan if the model prediction is in disagreement. nominal the second is bayesian classification. the bayesian classifiers also called the naïve bayes (nb), are based on statistical classifiers derived from the bayesian theorem [41]. the accuracy and speed of the bayesian classifiers have been proven to be of high magnitude on large databases [14]. the bayesian classification offers a pictorial view of underlying associations on which to perform learning. a trained bayesian holds that networks can be helpful in classification [42]. a bayesian classification graphical model is indicated in figure 3. fig. 3. bayesian graphical model i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 31 let 𝑋 denote a data tuple labeled as the measurements on 𝑛 attributes. let 𝐻 denote the hypothesis. then, 𝑃(𝐻|𝑋) denotes the probability of 𝐻 being actual is based on 𝑋. 𝑃(𝐻|𝑋) denotes the probability of 𝐻, conditioned on 𝑋. on the other hand, 𝑃(𝐻) denotes the prior probability of hypothesis 𝐻. correspondingly, 𝑃(𝐻|𝑋) denotes the posterior probability of 𝑋 conditioned on 𝐻, while 𝑃(𝑋) is the prior probability of 𝑋. the bayesian theorem offers a criterion for computing the posterior probability, 𝑃(𝐻|𝑋) from 𝑃(𝐻), 𝑃(𝑋|𝐻), and 𝑃(𝑋). the equation is denote in (1). 𝑃(𝐻|𝑋) = 𝑃( 𝑋 𝐻 )𝑃(𝐻) 𝑃(𝑋) (1) for classification problems, 𝑋 will represent an observed data tuple, assuming 𝐻 as a hypothesis binding on 𝑋 with class c. these are used to establish the probability of 𝑃(𝐻|𝑋) that binds on tuple 𝑋 in class c, according to the attribute depiction of 𝑋 [43]. the nb algorithm makes learning simple by assuming that variables are autonomous of a specific class while offering a probabilistic interpretation of classification [11]. though autonomy is a wrong assumption in general, the nb classifier frequently outperforms more advanced classifiers in practice. for example, while employing nb to analyze university and primary school students' performance, [44] found that the nb algorithm had superior accuracy in predicting the performance of primary school students. third is rule-based classifiers. a typical rule is described as follows: 𝐼𝐹 a condition exists, 𝑇𝐻𝐸𝑁 the result [32]. the antecedent condition is on the rule's left side and consists of a variety of logical operators, comprising of >, <, =, & & or, mainly employed on feature variables. the consequent that generates the class variable is on the rule's right side. ri is a rule presented as qi→c, qi being the antecedent and c as a class variable. the symbol → epitomizes a condition “𝑇𝐻𝐸𝑁”. the symbol qi denotes a condition applied to the feature set [43]. a rule is of the form: 𝐼𝐹 (attribute 1; value 1) and (attribute 2; value 2) and …… (attribute n; value n) 𝑇𝐻𝐸𝑁 (decision; value). rule induction is experimented with in the study and is a widely applied rule-based classification technique. as stated, [33] rules are good when denoting information and aspect of information. ri generates rules by dividing and conquering the training set, bringing out all instances bound by the rule. rule induction uses the divide-and-conquer and separate-and-conquer rule learning approaches. the rule algorithms generate a decision list, an ordered set of rules. through j48, rule induction discovers rules based on partial dt, develops a partial c4.5 decision tree, and translates the "best" leaf into a rule [41]. typically, an if-then rule has the form: if mother education = primary and mother occupation = government and jhs location= urban then status = low intervention. fourth is support vector machines (svm). svm is a learning algorithm to study and understand classification and regression rules. support vector machines (svms), for example, can be used to train radial base functions (rbfs), polynomials, and multilayer perceptron (mlp) classifiers [14]. the svms are derived from the statistical learning concept, which aims to solve related problems, except more complex ones, as a transitional step [45]. the svm belongs to the supervised learning algorithm family capable of generating learning rules based on the given training dataset. the svm has a comprehensive theoretical basis and entails comparatively fewer data samples for the training; investigations indicate that svm is not sensitive to sample dimensions [46]. fifth is neural networks (nn), simulating humans' nn system. it comprises an interrelated cluster of artificial neurons processing information based on a connectionist technique for calculations [3]. the nn framework is made up of interconnected nodes through a directional link. every node presents itself as a processing unit, and each link depicts a causal association among the nodes. the nodes are adaptive (the outputs of the nodes are based on the modifiability of the parameters concerning these nodes) [46]. every node in the input layer of an artificial neural network (ann) matches a predictor when the ann is first constructed. after that, the input nodes are connected to various other nodes contained within the hidden layer. every input node is connected to other hidden layer nodes within the network. the inner layer nodes are linked via other inner layers or directly to an output layer. one or several response variables constitute the output layer [32]. 32 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 next to the input layer, the other nodes take in inputs, multiply the inputs by a connection weight 𝑊𝑥𝑦 (nodes 1 to 3 are put as 𝑊13), sum them, and then apply a function (known as activation or squashing function) to them, then transfer the results to the next layer. for instance, values passed from nodes 4 to 6 are put as activation functions to ([𝑊14 * value of node 1] + [𝑊24 * value of node 2]). figure 4 depicts an nn structure. fig. 4. a neural network with one hidden layer the most basic deep networks are feed-forward deep networks, commonly known as multilayer perceptron (mlps) [46]. the mlp is the most implemented nn architecture in predictive data mining. the mlp is based on the feed-forward deep network with many possibly concealed layers, with the input and output layers connected [46]. the feed-forward neural network has no interconnections between nodes within a given layer; instead, outputs from one layer are used as input information to nodes in subsequent layers. this ensures modularity within the network, i.e., nodes are coherent in functionality or provide an equivalent level of abstraction on input vectors [33]. the last is regression, commonly employed in predictive model building and the analytical processes in data mining. regression predictions are primarily centered on historical data using functions and formulas [47]. it is mainly a statistical approach to data mining. regression is implemented to derive a model between dependent and independent variables [47]. regression is also used to build a model to analyze existing datasets to forecast trends using linear or logistic regression (lr) techniques derived from statistical methods where functions are driven from an existing dataset. the derived data is subsequently mapped to the functions to assist in predicting [48]. the lr algorithm is applied to build a regression model using categorical dependent variables. lr is put into three categories (1) binary, in the case of binary response variables, (2) multinomial – for the above two non-ordered dependent variables (3) ordinal for an ordered category [33]. researchers and data analysts generally use lr to analyze and classify proportional and binary response data [49]. the lr can effortlessly handle probability and multi-class issues in classification. d. research design and evaluation metrics this study is based on experimental research that employs binary classification techniques. the data comprised numerical (e.g., age, test scores, etc.) and nominal (textual data), e.g., gender, residential status, and former school. the experimental study concepts are chosen because they are the basic approach to studying cause and effects (cause/effect) connections and studying the relations between two variables [33]. also, experimental research is used by researchers to make comparisons between two or more groups on one or more metrics. the research again employed a hybrid data mining model development approach based on the kdp model to carry out the study. this approach gives the researcher a deeper understanding of the problem than deploying only one approach. this design methodology was employed to obtain a much more broad-minded, research-oriented explanation of the phases; it symbolizes a data mining process rather than just a modeling step; and has numerous novel, clear, and specific feedback loops [33]. figure 5 adopted indicates the six steps kdp modeling approach comprising of understanding the research problem, understanding data, preparation of data, mining the data, analyzing the knowledge base, and using the knowledge that has been discovered [50]. i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 33 fig. 5. the six-step kdp model evaluation of model performance is an essential rating for models’ effectiveness, improving parameters during the iterative learning process, and choosing an acceptable model from an assortment of models [51]. the following six widely known performance metrics were used to compare and select algorithms for evaluating the classification task: accuracy, precision, sensitivity, specificity, auc, and f-measure to construct a robust model. the most prevalent metric for measuring the feasibility of a model is its accuracy. a data mining classifier's correct accuracy is measured by how well its predictions match the actual true or false values. the equation for accuracy can be seen in (2). precision for a class is equivalent to counts of true positives (i.e., the counts of instances rightly considered as positive) divided by the total count of instances considered as a positive class (i.e., summation of true positives and false positives) as in (3). here, recall can be explained as a ratio of the number of true positives to the overall count of instances belonging to the positive class (i.e., summation of true positives and false negatives). i.e., instances not considered to belong to the positive class, though they belong to it). recall carries the same value as sensitivity in model performance denote in (4). 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑁 (2) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 (3) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 (4) similarly, the negative class's precision and recall are defused. use precision is determined by the proportion of instances categorized as negative that negative. in contrast, the ratio of true negatives to the total number of instances of the negative class will provide a recall for users. the f-measure is a metric for evaluating the performance of classifiers using confusion matrices. f-measure is the opposite correlation between accuracy and recall, defined as the harmonic mean of precision and recall 34 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 as in (5). it is essential to determine if a model's accuracy and recall are pretty well balanced [52]. the "true negative rate" is the name given to specificity. it provides information on the percentage of actual instances of negativity that a given model has correctly predicted as negative denote as in (6). it measures the proportion of real negatives to all negatives. 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 x precision x recall precision+recall (5) 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒+𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (6) the area under the roc curve (auc) calculates the area under the roc curve from (0,0) to (1,1) in two dimensions. the auc defines an overall assessment of performance across all potential categorization criteria. auc may be seen as the likelihood that a random positive instance will be ranked higher than a random negative instance in a given model. e. experimental settings and experimentation with selected algorithms the models were developed and simulated in the design view of the rapid miner’s modeling environment using a fujitsu laptop computer with windows 10 pro (version 21h2) 64-bit operating system, an x64-based processor (intel(r) core (tm) i7-4702mq cpu @ 2.20ghz 2.20 ghz) and 8 gigabytes of random access memory (ram). the k-fold cross-validation and the split validation were employed in each experiment as metrics evaluation techniques. the default parameter relative ratios of 0.7 for training and 0.3 for testing were adopted in split validation. in 10-fold cross-validation, data is arbitrarily subdivided into ten mutually exclusive equal subgroups of one to ten. training and testing are repeated ten times. the initial subgroup is reserved as a test set. the exploration method was used to identify the most suitable algorithm during the experimentation process. four different experiments were conducted for each of the five algorithms used in the study (random forest, rule induction, naïve bayes, regression, and deep learning) as follows: experiment 1: experimenting algorithm with split (ratio split) validation test mode. experiment 2: experimenting algorithm by employing bootstrap resampling with a split (ratio split) validation test mode. experiment 3: experimenting algorithm with 10-fold cross validation test mode. experiment 4: experimenting by employing bootstrap resampling with 10-fold cross validation test mode. a pictorial representation of the study method is illustrated in figure 6. iii. result and discussion this section presents the results of the random forest model on the dataset to discover the student demographic variables influencing their performance. a. determination and evaluation of the best classification model for predicting students’ achievements rq1: which machine learning classification algorithms are more viable in predicting students' academic attainment based on their demographic attributes? one of the primary goals of this study is to identify a suitable ml classifier capable of predicting students' academic success based on demographic characteristics. five algorithms were explored to implement the classification modeling: rf, ri, nb, lr, and dl. the results of the experiments are presented in table 4. i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 35 fig. 6. a pictorial depiction of the study framework table 4. summary of best-performing models from the five algorithms a lg o r it h m t e st m o d e a c c u r a c y p r e c is io n s e n si ti v it y s p e c if ic it y f -m e a su r e a u c rf pruning using bootstrap resampling with 10-fold cross-validation 93.96% 93.19% 94.97% 92.94% 94.04% 0.980 ri with ratio split validation 83.00% 83.48% 82.29% 83.71% 82.88% 0.879 nb using split validation 79.43% 78.77% 80.57% 78.29% 79.66% 0.879 lr using split validation 81.57% 80.44% 83.43% 79.71% 81.91% 0.892 dl using bootstrap resampling with 10-fold cross validation 84.45% 82.15% 88.49% 80.35% 85.11% 0.924 comparing the six-performance metrics in table 4, rf (pruned) implementing bootstrap resampling with 10-fold cross validation had the most outstanding performance metrics among the five classifiers for predicting students’ characteristics influencing their academic performance. the rf had an accuracy of 93.96%, a precision of 93.19%, a sensitivity of 94.97%, a specificity of 92.94%, an f-measure of 94.04%, and an auc of 0.980. as a result, the rf result with 10-fold cross-validation and bootstrap resampling was selected as the proposed model for the study. 36 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 b. analysis of attributes of importance in the random forest classifier model rq2: what primary demographic attributes influence students' academic performance at the shs level in ghana? the weights of the respective attributes by information gain were determined using the model simulator operator to determine the attributes that had a significant impact on the decision made by the rf classifier. these weights were ordered in descending. the list's top two attributes were considered the most relevant in the model choice process. according to the rf classifier model simulator, the mother's and father's education levels (with the highest weights of 0.358 and 0.168, respectively) are the two discovered demographic factors that significantly support the classification model per this study. figure 7 depicts the order in which the attributes in support of the prediction are arranged according to their weights. fig. 7. order of attributes according to weights of importance the two most contributing demographic attributes based on the weight of contributions to the decision made by the model are the mother's and father's education levels. the bece attributes happened to belong to academic features; hence they were excluded. this section explains the evaluation technique for the model developed to evaluate the demographic factors impacting student performance in pre-tertiary institutions. the study included twenty specific tests with various classifiers. the following evaluation parameters were used: the confusion matrix, the number of trees in the forest, and comparing the roc of a random forest with the rocs of rule induction, nb, lr, and dl classifiers to construct a robust model. table 5 displays the confusion matrix of the chosen model, created using the rf algorithm and the bootstrap resample approach with 10-fold cross-validation. table 5. confusion matrix evaluation for random forest model predicted class a c t u a l c l a s s a positive b negative classified as positive (tp)= 1108 (fn)=65 a=low intervention negative (fp)=91 (tn)= 1070 b= intensive intervention i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 37 much may be learned by meticulously scrutinizing the errors generated by any classification model. the errors show discrepancies between the model's predictions and the tangible outcome in the actual business situation. when an appropriate model is discovered, the next step is determining why classification inaccuracies happened in the testing data. for instance, when predicting an attribute for a certain class label, the predicted and actual results may differ. however, because comparable features reside within the same class limit, the classifier predicts the data into a particular class. table 4 displays the confusion matrix of the final model for the study. it indicates that 1108 2,334 incidents were accurately labeled as low intervention, whereas 1070 instances were correctly labeled as intensive intervention. this classifier identified 91 instances as a low intervention when they should have been classified as intensive intervention. again, 65 cases were wrongly labeled as an intensive interventions when they should have been classified as low intervention. the misclassification of the two groups might be because if low intervention status happens, there is also a potential for intensive intervention status to occur, and vice versa. roc curves with averaged thresholds for all five classifiers were generated, and their arear under curves (aucs) were evaluated using 10-fold cross-validation. finally, the roc graph is constructed and shown in figure 8. fig. 8. roc curves to compare the performance of random forest and the other classifiers in the study from the roc graph, it can be deduced that random forest achieves superior classification metrics compared to the rest of the four classifiers (i.e., ri, nb, lr, and dl). the thick red line represents the curve for the random forest with an auc of 0.980. c. determining students’ intervention type eventually, figure 8 illustrates the classifier's conclusive results after considering all features. the study's primary purpose was to discover the demographic determinants of learners' academic success. these determinants help educational administrators define learners as needing intensive or low intervention. per the confusion matrix class prediction of the random forest model in table 4, out of the 2334 upscaled sample understudy, 1173 (50.26%) were labeled as needing low intervention, while 1161 (49.74%) of the second-year students whose data was used were classed as needing intensive academic intervention to enhance their performance. as a result, it is possible to examine and conclude that the model effectively categorized the 2334 students according to the type of intervention they needed to boost their performance. the student's classification by intervention types is illustrated in figure 9. 38 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 fig. 9. number of students classified as in need of low or intensive intervention overall, the rf classifier emerged as the best classification technique for the task from this study. the rf classifier correctly classified 2193(93.96%) instances, while 141(6.04%) instances were incorrectly classified. according to [53] in "estimates of highly accurate models“, the rf model is highly viable for performance determinants prediction since its accuracy extends beyond the 75% lower bound benchmark. again, the mother's and father's education levels (with information gains of 0.358 and 0.168, respectively) are the recognized demographic factors per this study that significantly influence pretertiary students’ academic achievement. in their study, this finding is confirmed by [54] that welleducated parents prioritize a text-rich home environment, enhancing their academic achievement. iv. conclusion the proposed demographic-based predictive model offers an innovative approach to predict learner performance accurately and recommend appropriate intervention schemes. by leveraging demographic information, educational institutions can provide targeted support to students, ultimately enhancing their educational experience and improving academic outcomes. this study has significantly reduced the gap in practical knowledge observed in the literature by introducing an intervention scheme for respective students requiring intensive or minimal academic interventions in its prediction procedure. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. http://journal2.um.ac.id/index.php/keds i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 39 references [1] r. hasan, s. palaniappan, s. mahmood, k. u. sarker, and a. abbas, “modelling and predicting student’s academic performance using classification data mining techniques,” int. j. bus. inf. syst., vol. 34, no. 3, pp. 403–422, 2020. [2] m. n. yakubu and a. m. abubakar, “applying machine learning approach to predict students’ performance in higher educational institutions,” kybernetes, vol. 51, no. 2, pp. 916–934, 2022. [3] m. arashpour et al., “predicting individual learning performance using machine-learning hybridized with the teaching-learning-based optimization,” comput. appl. eng. educ., vol. 31, no. 1, pp. 83–99, 2023. [4] d. m. ahmed, a. m. abdulazeez, d. q. zeebaree, and f. y. h. ahmed, “predicting university’s students performance based on machine learning techniques,” 2021 ieee int. conf. autom. control intell. syst. i2cacis 2021 proc., no. august, pp. 276–281, 2021. [5] f. de galiza barbosa et al., “genitourinary imaging,” clinical pet/mri. pp. 289–312, 2022. [6] f. inusah, y. m. missah, n. ussiph, and f. twum, “expert system in enhancing efficiency in basic educational management using data mining techniques,” int. j. adv. comput. sci. appl., vol. 12, no. 11, pp. 427–434, 2021. [7] f. inusah, y. m. missah, u. najim, and f. twum, “data mining and visualisation of basic educational resources for quality education,” int. j. eng. trends technol., vol. 70, no. 12, pp. 296–307, dec. 2022. [8] f. inusah, y. m. missah, u. najim, and f. twum, “integrating expert system in managing basic education: a survey in ghana,” int. j. inf. manag. data insights, vol. 3, no. 1, p. 100166, 2023. [9] f. inusah, y. m. missah, u. najim, and f. twum, “agile neural expert system for managing basic education,” intell. syst. with appl., vol. 17, no. december 2022, p. 200178, 2023. [10] h. drachsler and w. greller, “privacy and analytics it’s a delicate issue a checklist for trusted learning analytics,” acm int. conf. proceeding ser., vol. 25-29-apri, no. april, pp. 89–98, 2016. [11] b. owusu-boadu, i. k. nti, o. nyarko-boateng, j. aning, and v. boafo, “academic performance modelling with machine learning based on cognitive and non-cognitive features,” appl. comput. syst., vol. 26, no. 2, pp. 122– 131, 2021. [12] i. issah, o. appiah, p. appiahene, and f. inusah, “a systematic review of the literature on machine learning application of determining the attributes influencing academic performance,” decis. anal. j., vol. 7, no. march, p. 100204, 2023. [13] m. tadese, a. yeshaneh, and g. b. mulu, “determinants of good academic performance among university students in ethiopia: a cross-sectional study,” bmc med. educ., vol. 22, no. 1, pp. 1–9, 2022. [14] f. ouatik, m. erritali, f. ouatik, and m. jourhmane, “predicting student success using big data and machine learning algorithms,” int. j. emerg. technol. learn., vol. 17, no. 12, pp. 236–251, 2022. [15] s. hussain and m. q. khan, “student-performulator: predicting students’ academic performance at secondary and intermediate level using machine learning,” ann. data sci., no. ml, 2021. [16] v. k. pal and v. k. k. bhatt, “performance prediction for post graduate students using artificial neural network,” int. j. innov. technol. explor. eng., vol. 8, no. 7, pp. 446–454, 2019. [17] l. m. abu zohair, “prediction of student’s performance by modelling small dataset size,” int. j. educ. technol. high. educ., vol. 16, no. 1, 2019. [18] m. pojon, “using machine learning to predict student performance,” univ. tampere, no. june, pp. 1–28, 2017. [19] b. sekeroglu, k. dimililer, and k. tuncal, “student performance prediction and classification using machine learning algorithms,” in proceedings of the 2019 8th international conference on educational and information technology, mar. 2019, pp. 7–11. [20] m. n. yakubu and a. m. abubakar, “applying machine learning approach to predict students’ performance in higher educational institutions,” kybernetes, no. june, 2021. [21] f. aman, a. rauf, r. ali, f. iqbal, and a. m. khattak, “a predictive model for predicting students academic performance,” in 2019 10th international conference on information, intelligence, systems and applications (iisa), jul. 2019, pp. 1–4. [22] j. lópez-zambrano, j. a. l. torralbo, and c. romero, “early prediction of student learning performance through data mining: a systematic review,” psicothema, vol. 33, no. 3, pp. 456–465, 2021. [23] a. i. adekitan and e. noma-osaghae, “data mining approach to predicting the performance of first year student in a university using the admission requirements,” educ. inf. technol., vol. 24, no. 2, pp. 1527–1543, 2019. [24] y. altujjar, w. altamimi, i. al-turaiki, and m. al-razgan, “predicting critical courses affecting students performance : a case study,” procedia procedia comput. sci., vol. 82, no. march, pp. 65–71, 2016. [25] a. ahadi, r. lister, h. haapala, and a. vihavainen, “exploring machine learning methods to automatically identify students in need of assistance,” icer 2015 proc. 2015 acm conf. int. comput. educ. res., pp. 121–130, 2015. [26] d. t. ha, c. n. giap, p. t. t. loan, and t. l. h. huong, “an empirical study for student academic performance prediction using machine learning techniques,” int. j. comput. sci. inf. secur., vol. 18, no. 3, pp. 21–28, 2020. [27] j. david and g. anastasija, “predicting academic performance based on students ’ family environment : evidence for colombia using classification trees,” vol. 11, no. 3, pp. 299–311, 2019. [28] m. i. al-twijri and a. y. noaman, “a new data mining model adopted for higher institutions,” procedia comput. sci., vol. 65, no. iccmit, pp. 836–844, 2015. [29] a. fernández, s. garcía, f. herrera, and n. v. chawla, “smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary,” j. artif. intell. res., vol. 61, pp. 863–905, 2018. [30] p. d. jenssen, t. krogstad, and k. halvorsen, “community wastewater infiltration at 69 o northern latitude – 25 years of experience,” soil sci. soc. am. onsite wastewater conf. , albuquerque nm, 7-8 april 2014, no. april, pp. 7–8, 2014. [31] c. anuradha and t. velmurugan, “fast boost decision tree algorithm: a novel classifier for the assessment of student performance in educational data,” vol. 31, pp. 254–0223, 2016. [32] p. bhatia, “introduction to data mining,” data min. data warehous., pp. 17–27, 2019. [33] s. samson, “use of data mining for determining higher education students ’performance,” st. mary’s university, 2019. [34] j. feng, “predicting students’ academic performance with decision tree and neural network,” university of central florida, 2019. [35] k. david kolo, s. a. adepoju, and j. kolo alhassan, “a decision tree approach for predicting students academic performance,” int. j. educ. manag. eng., vol. 5, no. 5, pp. 12–19, 2015. [36] y. liu, s. fan, s. xu, a. sajjanhar, s. yeom, and y. wei, “predicting student performance using clickstream data and machine learning,” educ. sci., vol. 13, no. 1, 2023. https://doi.org/10.1504/ijbis.2020.108649 https://doi.org/10.1504/ijbis.2020.108649 https://doi.org/10.1108/k-12-2020-0865 https://doi.org/10.1108/k-12-2020-0865 https://doi.org/10.1002/cae.22572 https://doi.org/10.1002/cae.22572 https://doi.org/10.1109/i2cacis52118.2021.9495862 https://doi.org/10.1109/i2cacis52118.2021.9495862 https://doi.org/10.1109/i2cacis52118.2021.9495862 https://doi.org/10.1016/b978-0-323-88537-9.00012-x https://dx.doi.org/10.14569/ijacsa.2021.0121148 https://dx.doi.org/10.14569/ijacsa.2021.0121148 http://dx.doi.org/10.14445/22315381/ijett-v70i12p228 http://dx.doi.org/10.14445/22315381/ijett-v70i12p228 https://doi.org/10.1016/j.jjimei.2023.100166 https://doi.org/10.1016/j.jjimei.2023.100166 https://doi.org/10.1016/j.iswa.2023.200178 https://doi.org/10.1016/j.iswa.2023.200178 https://doi.org/10.1145/2883851.2883893 https://doi.org/10.1145/2883851.2883893 http://dx.doi.org/10.2478/acss-2021-0015 http://dx.doi.org/10.2478/acss-2021-0015 http://dx.doi.org/10.2478/acss-2021-0015 https://doi.org/10.1016/j.dajour.2023.100204 https://doi.org/10.1016/j.dajour.2023.100204 https://doi.org/10.1016/j.dajour.2023.100204 https://doi.org/10.1186/s12909-022-03461-0 https://doi.org/10.1186/s12909-022-03461-0 https://doi.org/10.3991/ijet.v17i12.30259 https://doi.org/10.3991/ijet.v17i12.30259 https://doi.org/10.1007/s40745-021-00341-0 https://doi.org/10.1007/s40745-021-00341-0 https://www.ijitee.org/wp-content/uploads/papers/v8i7s2/g10760587s219.pdf https://www.ijitee.org/wp-content/uploads/papers/v8i7s2/g10760587s219.pdf https://doi.org/10.1186/s41239-019-0160-3 https://doi.org/10.1186/s41239-019-0160-3 https://urn.fi/urn:nbn:fi:uta-201706262111 https://doi.org/10.1145/3318396.3318419 https://doi.org/10.1145/3318396.3318419 https://doi.org/10.1145/3318396.3318419 https://doi.org/10.1108/k-12-2020-0865 https://doi.org/10.1108/k-12-2020-0865 https://doi.org/10.1109/iisa.2019.8900760 https://doi.org/10.1109/iisa.2019.8900760 https://doi.org/10.1109/iisa.2019.8900760 https://doi.org/10.7334/psicothema2021.62 https://doi.org/10.7334/psicothema2021.62 https://doi.org/10.1007/s10639-018-9839-7 https://doi.org/10.1007/s10639-018-9839-7 https://doi.org/10.1016/j.procs.2016.04.010 https://doi.org/10.1016/j.procs.2016.04.010 https://doi.org/10.1145/2787622.2787717 https://doi.org/10.1145/2787622.2787717 https://www.researchgate.net/publication/340351415_an_empirical_study_for_student_academic_performance_prediction_using_machine_learning_techniques https://www.researchgate.net/publication/340351415_an_empirical_study_for_student_academic_performance_prediction_using_machine_learning_techniques http://dx.doi.org/10.25115/psye.v11i3.2056 http://dx.doi.org/10.25115/psye.v11i3.2056 https://doi.org/10.1016/j.procs.2015.09.037 https://doi.org/10.1016/j.procs.2015.09.037 https://doi.org/10.1613/jair.1.11192 https://doi.org/10.1613/jair.1.11192 https://www.soils.org/files/meetings/specialized/full-conference-proceedings.pdf https://www.soils.org/files/meetings/specialized/full-conference-proceedings.pdf https://www.soils.org/files/meetings/specialized/full-conference-proceedings.pdf https://www.researchgate.net/publication/311347685_fast_boost_decision_tree_algorithm_a_novel_classifier_for_the_assessment_of_student_performance_in_educational_data https://www.researchgate.net/publication/311347685_fast_boost_decision_tree_algorithm_a_novel_classifier_for_the_assessment_of_student_performance_in_educational_data https://www.cambridge.org/core/books/abs/data-mining-and-data-warehousing/introduction-to-data-mining/4a5da2eb1116347161117a2f4eb2a4b1 http://repository.smuc.edu.et/handle/123456789/5274 http://repository.smuc.edu.et/handle/123456789/5274 https://stars.library.ucf.edu/etd/6301/ https://stars.library.ucf.edu/etd/6301/ http://dx.doi.org/10.5815/ijeme.2015.05.02 http://dx.doi.org/10.5815/ijeme.2015.05.02 https://doi.org/10.3390/educsci13010017 https://doi.org/10.3390/educsci13010017 40 i. iddrisu et al. / knowledge engineering and data science 2023, 6 (1): 24–40 [37] m. m. z. eddin, n. a. khodeir, and h. a. elnemr, “a comparative study of educational data mining techniques for skill-based predicting student performance,” int. j. comput. sci. inf. secur., vol. 16, no. 3, pp. 56–62, 2018. [38] p. sokkhey and t. okazaki, “hybrid machine learning algorithms for predicting academic performance,” int. j. adv. comput. sci. appl., vol. 11, no. 1, pp. 32–41, 2020. [39] m. n. yakubu, “applying machine learning approach to predict students ’ performance in higher educational institutions,” no. june, 2021. [40] y. denny, h. leslie, h. spits, and w. budiharto, “systematic literature review on abstractive text summarization,” no. november, 2021. [41] k. blackmore and t. r. j. bossomaier, “comparison of see5 and j48.part algorithms for missing persons profiling,” no. december, 2016. [42] f. ofori, e. maina, and r. gitonga, “using machine learning algorithms to predict students’ performance and improve learning outcome: a literature based review,” j. inf. technol., vol. 4, no. 1, pp. 2616–3573, 2020. [43] s. agrawal, s. k., and a. k., “using data mining classifier for predicting student’s performance in ug level,” int. j. comput. appl., vol. 172, no. 8, pp. 39–44, 2017. [44] d. gašević, v. kovanović, and s. joksimović, “piecing the learning analytics puzzle: a consolidated model of a field of research and practice,” learn. res. pract., vol. 3, no. 1, pp. 63–78, 2017. [45] p. g. sameer and s. r. barahate, “educational data mining – a new approach to the education systems,” pp. 18– 20, 2016. [46] a. s. hashim, w. a. awadh, and a. k. hamoud, “student performance prediction model based on supervised machine learning algorithms,” iop conf. ser. mater. sci. eng., vol. 928, no. 3, 2020. [47] c. a. palacios, j. a. reyes-suárez, l. a. bearzotti, v. leiva, and c. marchant, “knowledge discovery for higher education student retention based on data mining: machine learning algorithms and case study in chile,” entropy, vol. 23, no. 4, pp. 1–23, 2021. [48] d. t. larose and c. d. larose, “data mining and predictive analytics (wiley series on methods and applications in data mining): 9781118116197: computer science books @ amazon.com,” wiley ser., p. 794, 2015. [49] m. maalouf, “logistic regression in data analysis: an overview,” int. j. data anal. tech. strateg., vol. 3, no. 3, p. 281, 2011. [50] k. j. cios, w. pedrycz, r. w. swiniarski, and l. a. kurgan, data mining: a knowledge discovery approach. 2007 . [51] y. chen et al., “evaluation efficiency of hybrid deep learning algorithms with neural network decision tree and boosting methods for predicting groundwater potential,” geocarto int., vol. 37, no. 19, pp. 5564–5584, 2022. [52] d. m. w. powers, “evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,” pp. 37–63, 2020. [53] i. h. witten, e. frank, m. a. hall, and c. j. pal, data mining: practical machine learning tools and techniques. 2016. [54] m. idris, s. hussain, and n. ahmad, “relationship between parents’ education and their children’s academic achievement,” j. arts soc. sci., vol. 7, no. 2, pp. 82–92, dec. 2020. https://www.researchgate.net/publication/330502931_a_comparative_study_of_educational_data_mining_techniques_for_skill-based_predicting_student_performance https://www.researchgate.net/publication/330502931_a_comparative_study_of_educational_data_mining_techniques_for_skill-based_predicting_student_performance http://dx.doi.org/10.14569/ijacsa.2020.0110104 http://dx.doi.org/10.14569/ijacsa.2020.0110104 https://doi.org/10.1108/k-12-2020-0865 https://doi.org/10.1108/k-12-2020-0865 http://dx.doi.org/10.24507/icicelb.12.11.xxx http://dx.doi.org/10.24507/icicelb.12.11.xxx https://www.researchgate.net/publication/290312652_comparison_of_see5_and_j48part_algorithms_for_missing_persons_profiling https://www.researchgate.net/publication/290312652_comparison_of_see5_and_j48part_algorithms_for_missing_persons_profiling https://www.researchgate.net/publication/340209478_using_machine_learning_algorithms_to_predict_students'_performance_and_improve_learning_outcome_a_literature_based_review https://www.researchgate.net/publication/340209478_using_machine_learning_algorithms_to_predict_students'_performance_and_improve_learning_outcome_a_literature_based_review http://dx.doi.org/10.5120/ijca2017915201 http://dx.doi.org/10.5120/ijca2017915201 https://doi.org/10.1080/23735082.2017.1286142 https://doi.org/10.1080/23735082.2017.1286142 https://scholar.google.com/scholar?q=educational%20data%20mining%20%20a%20new%20approach%20to%20the%20education%20systems https://scholar.google.com/scholar?q=educational%20data%20mining%20%20a%20new%20approach%20to%20the%20education%20systems https://iopscience.iop.org/article/10.1088/1757-899x/928/3/032019 https://iopscience.iop.org/article/10.1088/1757-899x/928/3/032019 https://doi.org/10.3390/e23040485 https://doi.org/10.3390/e23040485 https://doi.org/10.3390/e23040485 http://repo.darmajaya.ac.id/4011/1/data%20mining%20and%20predictive%20analytics.pdf http://repo.darmajaya.ac.id/4011/1/data%20mining%20and%20predictive%20analytics.pdf http://dx.doi.org/10.1504/ijdats.2011.041335 http://dx.doi.org/10.1504/ijdats.2011.041335 https://doi.org/10.1007/978-0-387-36795-8 https://doi.org/10.1080/10106049.2021.1920635 https://doi.org/10.1080/10106049.2021.1920635 https://www.researchgate.net/publication/228529307_evaluation_from_precision_recall_and_f-factor_to_roc_informedness_markedness_correlation https://www.researchgate.net/publication/228529307_evaluation_from_precision_recall_and_f-factor_to_roc_informedness_markedness_correlation https://www.sciencedirect.com/book/9780123748560/data-mining-practical-machine-learning-tools-and-techniques#book-description https://www.sciencedirect.com/book/9780123748560/data-mining-practical-machine-learning-tools-and-techniques#book-description https://doi.org/10.46662/jass-vol7-iss2-2020(82-92) https://doi.org/10.46662/jass-vol7-iss2-2020(82-92) knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 1, april 2023, pp. 41–56 eissn 2597-4637 https://doi.org/10.17977/um018v6i12023p41-56 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) deep learning for multi-structured javanese gamelan note generator arik kurniawati a,1, eko mulyanto yuniarno a,b,2,* ,yoyon kusnendar suprapto a,b,3 a department of electrical engineering, institut teknologi sepuluh nopember, surabaya, indonesia b department of computer engineering, institut teknologi sepuluh nopember, surabaya, indonesia 1 arikkurniawati.19071@mhs.its.ac.id; 2 ekomulyanto@ee.its.ac.id*; 3 yoyonsuprapto@ee.its.ac.id * corresponding author i. introduction javanese gamelan, one of the musical arts of indonesia, is known for its diverse playing patterns. the technique for playing it is usually called karawitan. a song in javanese gamelan has different patterns of presentation, as in the examples of the songs sampak nem slendro nem and srepeg nem slendro nem. what distinguishes the two songs is the type of song structure; the first is sampak and the second is srepeg. this song structure is like a genre in general music; this song structure is played by ricikan struktural instruments. this means that if the song structure pattern played is not appropriate, then the song has lost its composition. because in javanese gamelan, a song is not only based on the strength of the main melody but also on other instruments as accompaniment music, because these instruments are used to compose the composition of a song as a whole. in a song in javanese gamelan, the song title reflects how the song composition is played [1][2][3][4][5][6][7][8]. one of the variations of karawitan patterns in javanese gamelan is the surakarta style, which has several forms of song structure [1][2]. a song is composed of various elements. these elements contribute to the overall composition. these elements include dynamics, rhythm, laya, laras, and pathet. tempo plays a crucial role in controlling the rhythm of the gendhing, while laya describes the speed at which it is performed. pathet expresses the specific emotion or feeling the song is trying to convey, and laras refers to the scales used in that song. dynamics, on the other hand, emphasizes the variety, balance, and dynamic nature of a song's musical components [2][3]. article info a b s t r a c t article history: received 20 june 2023 revised 07 july 2023 accepted 15 july 2023 published online 18 july 2023 javanese gamelan, a traditional indonesian musical style, has several song structures called gendhing. gendhing (songs) are written in conventional notation and require gamelan musicians to recognize patterns in the structure of each song. usually, previous research on gendhing focuses on artistic and ethnomusicological perspectives, but this study is to explore the correlation between gendhing as traditional music in indonesia and deep learning technology that replaces the task of gamelan composers. this research proposes cnn-lstm to generate notation of ricikan struktural instruments as an accompaniment to javanese gamelan music compositions based on balungan notation, rhythm, song structure, and gatra information. this proposed method (cnn-lstm) is compared with lstm and cnn. the musical data in this study is represented using numerical notation for the main melody in balungan notation. the experimental results showed that the cnn-lstm model showed better performance compared to the lstm and cnn models, with accuracy values of 91.9%, 91.5%, and 91.2% for cnn-lstm, lstm, and cnn, respectively. and the value of note distance for the sampak song structure is 4 for the cnn-lstm model, 8 for the lstm model, and 12 for the cnn model. the smaller the note distance, the closer it is to the original notation provided by the gamelan composer. this study provides relevance for novice gamelan musicians who are interested in learning karawitan, especially in understanding ricikan struktural music notation and gamelan art in composing musical compositions of a song. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: javanese gamelan notation cnn-lstm multi-instrument ricikan struktural http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ mailto:ekomulyanto@ee.its.ac.id https://creativecommons.org/licenses/by-sa/4.0/ a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 42 song structure, a particular karawitan art form, uses music as a symbolic medium to represent various aspects [4][6]. the goals of song are to be complex, to entertain the audience, and to convey a range of social, moral, cultural, and spiritual values [5]. incorrect performance of musical techniques in a song composition can lead to the loss of its aesthetic value and unique characteristics. in order to perform javanese gamelan well, it is necessary to have an understanding of both the rules of gamelan and the emotional atmosphere conveyed by the piece of music being performed. however, playing javanese gamelan presents several challenges, especially in determining the playing pattern [3]. as a result, assistance is required to facilitate the learning process of this cultural practice for future generations [3][4]. the aim of this study is to use technology to simplify the process of playing javanese gamelan. the size of a gendhing (song) can be determined by calculating the number of gatra in each gongan and the total number of gongan in the song [1][2]. gendhing is further divided into three subtypes: ageng(big), sedheng (middle), and alit(small). gendhing alit, consisting of sampak, srepeg, ayakayakan, lancaran, bubaran, ketawang, and ladrang, is the focus of this study [2]. this categorization is based on the design of the ricikan struktural instrument groupings, which include the kenong, kethuk, kempyang, kempul, and gong. the arrangement of the ricikan struktural instruments is an important factor in notation that determines the composition of the musical piece [2][6]. the musical instruments known as kenong, kethuk, and kempul serve as breaks in the song, while the gong indicates the end of the song. there are two additional groups of javanese gamelan instruments in addition to the ricikan struktural instruments, which are a) ricikan balungan, which is a group of musical instruments that play the basic melody of a song, such as slenthem, demung, saron, and peking; and b) ricikan garap as musical accompaniment like ricikan stuktural, which is a group of musical instruments that handle variations in song decoration, such as rebab, gender barung, gender penerus, bonang barung, bonang penerus, gambang, siter, and suling [2]. the configuration of musical pieces in javanese gamelan is occasionally not only dependent on the composer's artistic expression but also matches standard notational conventions. consequently, in order to perform a piece in javanese gamelan, it is necessary to commit to memory the patterns of each composition's song structure, as complete notation for all gamelan instruments is not always provided. the javanese gamelan notation generally consists of only the primary melody, thereby necessitating a high level of expertise among gamelan musicians to execute all the instruments. nonetheless, this presents a difficulty for inexperienced musicians who require comprehensive notation for every instrument to perform gamelan music. figure 1 illustrates the structure of a javanese gamelan composition. gamelan sheet music, as depicted in figure 1, only displays balungan notation (note) and omits the notation of the other two groups of instruments, ricikan struktural and ricikan garap. this notation is typically used by gamelan players to perform karawitan, along with other information about the piece, such as the song's structure type, rhythm type, and information about the laras and pathet. laras and pathet refer to musical scales and modes of the song. fig. 1. part of song in javanese gamelan (a) song structure, (b) title of song, (c) laras and pathet, (d) melody 43 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 figure 2 illustrates the ricikan struktural instruments used in the composition of a song [1],[2],[8]. these instruments include gong ageng, gong suwuk, kenong, kempul, kethuk, and kempyang, as shown in figure 2. the position of these instruments within a song distinguishes different types of song structure. the gong ageng denotes the longest cycle of a song, while the gong suwuk is used in all song structures except the ketawang and ladrang forms, where it is replaced by the kempul. the kenong divides the flow of the gendhing into musical phrases of equal length. the kempul, which is a smaller gong, often interlocks with the kenong in forms such as lancaran, ketawang, and ladrang. the balungan represents the melody notes of each song, which are divided into several lines, each line containing several gatra, each of which is made up of several notes. fig. 2. ricikan struktural in javanese gamelan figure 3 is an example of the detailed structure of the gendhing lancaran form. lancaran is a form of gendhing that has 4 gatra or 16 balungan notations on each gongan. there are usually four gongan in a lancaran composition. the pattern rules for lancaran are as follows: (1) kenong occurs on the last note of each gatra (also known as dhong gedhe), and the note always matches that of the dhong gedhe; (2) kempul occurs on the second note of each gatra (also known as dhong cilik), and there are only three kempul notes. the first gatra has no kempul note; (3) kethuk (+) is played on the odd notes of each gatra; (4) gong suwuk is played at the end of the fourth gatra. fig. 3. song structure of lancaran rhythm (irama) refers to the tempo and rhythm in gamelan music. there are five types of rhythm, including irama lancar, irama tanggung, irama dadi, irama wilet, and irama rangkep. a song is typically presented in different rhythms [5], such as the lancaran manyar sewu song, which can be presented in both the irama lancar and irama tanggung forms. in this case, the rhythm has a significant impact on the way the song is performed. currently, discussions of the types of gendhing patterns focus mainly on artistic and ethnomusicological perspectives. for example, studies have examined the kempul pattern in gendhing alit in klenengan music [6], the kenong instrument pattern in karawitan style aesthetics [7], and the role of ricikan struktural as one of the indicators in gendhing formation [8]. however, the relationship between gamelan music and technology, especially deep learning (dl), has received little attention. the purpose of this study is to use dl to assist novice gamelan musicians in understanding the ricikan struktural components. this study is known as part of the music generation. the integration of dl technology with the art of music has contributed to the development of music generators capable of creating new and unique musical compositions [9]. in recent years, the field of music composition has seen significant development due to the development of advanced deep learning techniques such as convolutional neural network (cnn) and long short-term memory (lstm). the cnn is a special type of deep learning that has been used in the field of music composition. an example of this phenomenon is the creation of new music using audio-based music, such as midi [10], or symbolically represented music [11] in alternative formats. the use of cnn represents a contemporary advancement in the field of music. the use of cnn has been widely implemented in the field of image classification [12]. the previously discussed networks are purposefully constructed a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 44 to detect and extract identifiable patterns and features from visual data [13]. similar methods are used to train these networks for the purpose of recognizing patterns and features in musical sequences. in previous research related to music generation, cnn were reliable in obtaining the semantic features of music [14] and multiple feature extraction [15]. cnn is often integrated with other deep learning techniques, such as lstm, to generate complex and sophisticated musical compositions [13]. the lstm network is a variant of the recurrent neural network (rnn), which is able to effectively capture long-term temporal dependencies in time-series data, including musical sequences. in previous studies, lstm has been widely used for music generation because it is suitable for learning patterns from sequential music data [16][17]. the combination of cnn and lstm networks produces both short-term and long-term musical patterns, resulting in more authentic and rationally structured music [13][18]. cnn-lstm has several advantages, including the ability to perform temporal analysis while extracting abstract features [19], and it outperforms standard machine learning algorithms in terms of stability, accuracy, and prediction [20][21][22]. in music generation, convolutional lstm outperforms lstm with more pronounced waveforms and clearer melodies [18]. it combines the advantages of cnn, which can extract effective features from data music sequences, and lstm, which can not only discover data interdependence in time series data, but also automatically detect the ideal mode suitable for relevant data to build new sequences [23]. many music-related studies use the combined methods of cnn and lstm, such as music classification or music genre recognition [28][29][30][31][32], music recommendation [33], chord recognition [34][35], and music emotion recognition [36][37][38]. cnns are used to extract audio or sheet music features, while lstms are used to learn temporal dependencies in music data for recognition, prediction, recommendation, and classification. however, previous research on music generation using the cnn-lstm combination is limited to the generation of new melodies in turkish pop music with a certain style [13] and modern music [18] from midi files. in this study, the same approach is used to generate music notation for several instruments based on variations in the structure of javanese gamelan songs using notation-based music datasets. however, the difference with the previous research is that this study uses a dataset with more readable notation represented as numerical notes in text format. and the focus of this study is to generate musical accompaniment for multiple instruments. in the context of gamelan music, cnn and lstm have been used to create musical compositions that follow the rules and conventions of traditional gamelan music. the ability of the cnn network is used to extract important features from the input parameters fed into the network, such as balungan notation, rhythm, and gatra information. the lstm network is then used to generate the notation of several ricikan struktural instruments as a musical accompaniment to the melodic notation of the balungan instrument, with the ability of the lstm to model temporal dependencies. according to the above statement, the issues covered in this study are: • writing complete notation, especially for ricikan struktural instruments, is very helpful for novice gamelan players. • the notation patterns of the ricikan struktural instruments have different variations, so it will be more convenient for novice gamelan players to play a gamelan song based on the structure of the song, where the function of the notation pattern of the ricikan struktural instrument is used as the structure for a song. this study aims to automatically generate notation for several instrument groups, including kenong, kethuk, kempyang, kempul, and gong, using cnn-lstm. the features used in this study include the main melodic notation of the balungan instrument, rhythm, and gatra information. the main contributions of this study are presented below: • a dataset of javanese gamelan music was created based on symbol notation. • the use of numerical notes as a simplified method of representing musical data as input. • this study effectively generates musical accompaniment for various musical instruments, including kenong, kethuk, kempyang, kempul, and gong, by incorporating song characteristics such as song structure, gatra, and rhythm. 45 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 • to help the general public understand the various patterns of song structures and their notation for the ricikan struktural instrument groups. the remaining sections of this paper are organized as follows: section i presents the introduction and related work. section ii describes the methodology, including the details of the dataset and the proposed model. section iii presents the experiments and results. finally, section iv provides the conclusion of the paper. ii. method the objective of this study is to use cnn-lstm to create an automatic notation generator for the ricikan struktural instrument. the technique in this study uses cnn for feature extraction and lstm as the notation generator. the detailed steps for implementing the proposed method are discussed in this section. a. dataset the present study employed symbol-based data, specifically numerical notes, sourced from a collection of multiple songs available at http://www.gamelanbvg.com, for the music dataset. the data extracted from musical compositions includes the song's notation as well as its distinctive features, such as gatra details, rhythmic patterns, and song structure composition. furthermore, an annotation of certain ricikan struktural instruments designed by a specialist in gamelan from soewidiatmaka gamelan has been incorporated into the dataset. a total of 35 songs were used in this study. these are divided into seven song structures, with five songs in each structure. the various ricikan struktural instruments and the notation for the balungan were arranged according to the gatra of each song. the balungan is often represented by four notations in one gatra. as a result, the dataset used in the current study contains approximately 600 gatra distributed across the 35 songs, as shown in figure 4. in this dataset, 28 songs were used for training (80% of the data) and validation (20% of the data), and 7 songs were used for testing. the songs used in this study are listed in table 1, where the table lists the song titles used as datasets with the type of song structure, type of laras (scale of the song), type of pathet (mode of the song), and the rhythm contained in the song. fig. 4. dataset representation a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 46 table 1. list of songs for dataset in this study no song rhythm data 1 2 3 4 5 sampak tlutur slendro manyura sampak manyura slendro manyura sampak nem slendro nem sampak sanga slendro sanga sampak tlutur slendro sanga tanggung tanggung tanggung tanggung tanggung test training, validation training, validation training, validation training, validation 6 7 8 9 10 srepeg manyura slendro manyura srepeg nem slendro nem srepeg sanga slendro sanga srepeg tlutur slendro manyura srepeg tlutur slendro sanga tanggung tanggung tanggung tanggung tanggung test training, validation training, validation training, validation training, validation 11 12 13 14 15 ayak-ayakan nem slendro nem ayak-ayakan manyura slendro manyura ayak-ayakan pamungkas slendro manyura ayak-ayakan sanga slendro sanga ayak-ayakan umbul donga slendro manyura lancar, tanggung, dadi lancar, tanggung, dadi lancar, tanggung, dadi lancar, tanggung, dadi lancar, tanggung, dadi test training, validation training, validation training, validation training, validation 16 17 18 19 20 lancaran manyar sewu slendro manyura lancaran kuda nyongklang pelog barang lancaran maesa kurda slendro sanga lancaran rena rena slendro manyura lancaran sarung jagung pelog barang lancar lancar, tanggung lancar, tanggung lancar tanggung test training, validation training, validation training, validation training, validation 21 22 23 24 25 bubaran arum arum pelog barang bubaran kembang pacar pelog nem bubaran purwaka pelog nem bubaran sembunggilang slendro sanga bubaran udan mas pelog barang tanggung tanggung tanggung tanggung tanggung test training, validation training, validation training, validation training, validation 26 27 28 29 30 ketawang ibu pretiwi pelog nem ketawang kinanthi pawukir slendro manyura ketawang kinanthi sandhung slendro manyura ketawang langen gita pelog barang ketawang subakastawa slendro sanga tanggung, dadi tanggung, dadi tanggung, dadi tanggung, dadi tanggung, dadi test training, validation training, validation training, validation training, validation 31 32 33 34 35 ladrang kalongking pelog nem ladrang mugi rahayu slendro manyura ladrang pariwisata slendro sanga ladrang santi mulya pelog lima ladrang sumyar pelog barang tanggung tanggung, dadi tanggung, dadi, wiled tanggung, dadi tanggung, dadi, wiled test training, validation training, validation training, validation training, validation laras (scale of song) : slendro /pelog; pathet (mode of song): manyura, nem, sanga,barang, lima b. preprocessing data the input data of this study consists of balungan notation, rhythm type, song structure type, and gatra information, while the output data consists of ricikan struktural music notation such as kenong, kethuk, kempyang, kempul, gong ageng, and gong suwuk. preprocessing of both input and output data using one-hot encoding techniques [39], which involves converting both input and output data into binary form with careful consideration of the respective data, figure 5 shows the preprocessing result of one-hot encoding. fig. 5. one-hot encoding for note, rhythm, song structure and gatra before the input is fed into the cnn-lstm network, a one-hot encoding process is performed on each input, which consists of balungan notation arranged in each gatra, rhythm, song structure, and gatra information from this note. after the encoded vector input is combined into an input sequence, it is ready to be fed into the cnn-lstm architecture network. 47 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 c. cnn-lstm the following section provides a detailed description of the structure of the cnn-lstm architecture model. the diagram in figure 6 shows the different steps of this study. the proposed cnn-lstm model consists of three main components: a convolutional neural network (cnn), a long-short-term memory (lstm) network, and a fully connected layer. • the cnn is used to obtain a feature representation of the input music sequence, which consists of balungan notation divided into gatra, rhythm, song structure, and gatra information from this note. this cnn network consists of a 1d convolutional layer with 32 filters and a kernel size of 2, with padding set to the same size. this is followed by an activation layer using relu and a 1d max-pooling layer. • the lstm component is responsible for modeling the temporal dependencies between the extracted features and generating musical accompaniment sequences. it consists of a single-layer lstm with 128 hidden units and a dropout layer with a size of 0.2 to avoid overfitting. • the fully connected layer and the output layer use a sigmoid activation function for each ricikan struktural instrument to predict the musical accompaniment. fig. 6. proposed method cnn-lstm for note generator for multi-instrument in addition, the model was trained up to 100 epochs, batch size 5, with the adam optimizer and binary cross entropy as the loss function values. after completing the training, the cnn-lstm network demonstrates the ability to generate a sequence of musical notes suitable for the purpose of providing accompaniment to ricikan struktural instruments. by first decoding the vector sequence encoded in the ricikan struktural instrument notation. the model uses this data to automatically predict kenong, kethuk, kempyang, kempul, gong suwuk, and gong ageng notes based on test data containing balungan notes, rhythm, and gatra information. a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 48 to provide a comparative analysis, we compared the performance of the cnn-lstm model with that of the cnn and lstm models. the architectural details of each model are shown in figure 7. (a) (b) fig. 7. architecture of (a) lstm and (b) cnn for note generator for multi-instrument d. evaluation as the first evaluation for this study, we investigated the effectiveness of our proposed cnnlstm model in predicting musical accompaniment notes for various ricikan struktural instruments. we compared its performance with that of cnn and lstm models. to evaluate the performance of the cnn-lstm model, we compared its predictions with the ground truth labels or desired outputs (the original notation from the gamelan composer). by applying the model to a specific dataset and comparing its predictions with the actual results, we were able to determine the exact values of accuracy, precision, and recall [40]. • accuracy measures the overall prediction accuracy of a model by determining the number of correctly predicted examples. higher accuracy indicates better performance. • precision is a metric that refers to the number of true positives correctly identified and the sum of true positives and false positives. an increase in precision results in a decrease in false positive accuracy. false positives indicate that the model predicts a positive outcome, but the actual outcome is negative. • the recall metric evaluates a model's ability to reliably detect all positive cases. a lower false negative rate indicates a higher recall score. false negatives indicate that a model predicts a negative outcome when the actual outcome is positive. the second evaluation involves applying the second scenario with different song structures by selecting a single song that is not included in the training data for each song structure. the notation generated by the song generator is then compared to the original version using music analysis methods such as note distance. in this evaluation phase, a detailed assessment of the predictive ability of the proposed model for musical accompaniment is expected. iii. result and discussion this section focuses on the evaluation of the performance of the proposed cnn-lstm model and the assessment of the generated results, with the ultimate goal of providing accompaniment music notations for different types of ricikan struktural instruments. the evaluation was divided into two scenarios: intensive experiments with the same song structure and experiments with different song structures. in the first scenario, several intensive experiments were conducted to evaluate the overall performance of the model on datasets of the same type. the goal is to see how well the model performs when the song structure remains consistent throughout the test period. in contrast, in the second scenario, the experiment was conducted by evaluating the model's performance on datasets with different types of song structure. the goal of this scenario is to evaluate the adaptability and generalizability of the model across different forms of song structure. this was 49 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 intended to assess the model's ability to accurately generate musical accompaniment notes across a range of ricikan struktural instruments. a. quantitative analysis the results of the quantitative analysis of the performance of each model in the two scenarios are summarized in table 2. the results show that the cnn-lstm framework exhibits superior performance compared to the lstm and cnn models in all evaluated scenarios, regardless of whether the song structures used are the same or different, as seen from the accuracy, precision, and recall values. table 2. performance value of accuracy, precision, and recall for cnn-lstm, lstm, and cnn scenario accuracy (%) precision (%) recall (%) cnn-lstm (proposed) 1 all various 91,9 92,3 91,8 2 sampak 96,6 96,6 96,6 srepeg 96,6 96,6 96,6 ayak-ayakan 99,1 99,1 99 lancaran 97,4 97,8 97 bubaran 98,9 99,1 98,4 ketawang 99 100 98,5 ladrang 97,6 98,3 96,1 cnn 1 all various 91,2 91,9 91 2 sampak 96,3 96,4 96,3 srepeg 96,3 96,5 96,3 ayak-ayakan 99,1 99,1 98,9 lancaran 96,8 97,3 96,6 bubaran 98,7 98,8 98,2 ketawang 98,8 99,6 98,1 ladrang 97,4 97,2 95,3 lstm 1 all various 91,5 92 91,3 2 sampak 96,6 96,6 96,6 srepeg 96,6 96,6 96,6 ayak-ayakan 99,1 99,1 98,9 lancaran 97 97,4 96,8 bubaran 98,7 99,1 98,4 ketawang 99 99,6 98,5 ladrang 96,4 98,2 95,8 the cnn-lstm model has higher accuracy, precision, and recall values compared to the cnn and lstm models. a high accuracy score indicates better model performance. a high precision value indicates fewer false positives. and a high recall value indicates fewer false negatives. model performance with high values in the first scenario in table 2 (accuracy = 91.9; precision = 92.3; recall 91.8) will affect the generator results of the ricikan struktural instrument notation, i.e., the result of the cnn-lstm model generator will be more similar to the original when compared to the generator results of cnn and lstm. this will be discussed in more detail in the music generation results section. while the difference in performance accuracy between the three models is comparatively small, fluctuating between a positive 0.2 and 1.2. regarding the accuracy of the first scenario, the cnn-lstm model achieved 91.9, while the cnn and lstm models achieved 91.2 and 91.5, respectively. furthermore, the second scenario tends to produce better performance results due to the homogeneity of the data used in the first scenario. the cnn-lstm model offers a remarkable advantage by integrating the advantageous features of both the cnn architecture, which is great at feature extraction, and the lstm architecture, which is excellent at modeling temporal dependencies. the integration of cnn and lstm in the model enables it to handle both microand macro-level musical patterns proficiently, leading to the generation of more precise and expressive musical accompaniment. a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 50 b. music generation result this section evaluates the notation generators used by analysis tools. test data from each song structure in the second scenario, which has different song structures, will be used. the goal of this evaluation is to assess how closely the output of the generator resembles the composition provided by the gamelan composer. the evaluation criterion used in this evaluation phase is the measure of note distance. note distance is a metric used to quantify the similarity between the generator's output notation (𝑁𝑜𝑡𝑒2) and the original notation (𝑁𝑜𝑡𝑒1) of a gamelan composer's creation. this distance, also referred to as the exact distance, is represented by a binary representation as written in (1). 𝑁0(𝑁𝑜𝑡𝑒1, 𝑁𝑜𝑡𝑒2) = { 0 𝑖𝑓 𝑁𝑜𝑡𝑒1 = 𝑁𝑜𝑡𝑒2 1 𝑖𝑓 𝑁𝑜𝑡𝑒1 ≠ 𝑁𝑜𝑡𝑒2 (1) the proposed approach, cnn-lstm, was evaluated with a comparative analysis compared to cnn and lstm. this evaluation was done by calculating the note distance for each instrument in each song structure. furthermore, an in-depth analysis was conducted to investigate the relationship between input parameters such as balungan notation, song structure, rhythm, and gatra information and the output notation generated on various ricikan struktural instruments such as kenong, kethuk, kempyang, kempul, gong suwuk, and gong ageng. table 3 shows the note distance values for each instrument for the ricikan struktural of various song structures. the results indicate that the cnn-lstm approach produced notations with the lowest note distance values compared to lstm and cnn. a decrease in the note distance value indicates an increase in the degree of similarity between the notations provided by the gamelan composer's musical composition. the results of this study indicate that the cnn-lstm model outperforms both the lstm and cnn models in terms of improving overall performance, as it effectively exploits the strengths of both cnn and lstm. table 3. value of note distance from three model cnn-lstm, cnn, lstm the kempyang instrument is only present in the ketawang and ladrang song structures, and has no notation in other song structures. table 3 shows that the kethuk, kempyang, and gong ageng instruments have a note distance value of 0. as a result, the generated notation from all three models across different song structures is very similar to the gamelan composer's original notation. the fixed notation patterns of each instrument within the song structure contribute to this similarity. specifically, the kethuk instrument has a consistent notation pattern of (+), which represents a hit, while the kempyang instrument has a consistent notation pattern of (-), which also represents a hit. these song structure method kenong kethuk kempyang kempul gong suwuk gong ageng total sampak cnn-lstm 0 0 4 0 0 4 lstm 0 0 8 0 0 8 cnn 0 0 10 2 0 12 srepeg cnn-lstm 0 0 0 0 0 0 lstm 3 0 3 0 0 6 cnn 3 0 3 0 0 6 ayak-ayakan cnn-lstm 1 0 1 2 0 4 lstm 1 0 2 2 0 5 cnn 2 0 1 2 0 5 lancaran cnn-lstm 0 0 0 1 0 1 lstm 0 0 2 1 0 3 cnn 1 0 2 1 0 4 bubaran cnn-lstm 0 0 0 0 0 0 lstm 0 0 0 0 0 0 cnn 0 0 0 0 0 0 ketawang cnn-lstm 0 0 0 0 0 0 0 lstm 0 0 0 1 0 0 1 cnn 0 0 0 2 0 0 2 ladrang cnn-lstm 0 0 0 4 0 0 4 lstm 1 0 0 4 0 0 5 cnn 2 0 0 3 0 0 5 51 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 instruments have no variations in tone. in addition, the gong ageng instrument serves as an indicator of the end of the song, so its notation pattern remains constant without any variations. figure 8 shows visual representations of the notation patterns for kethuk and kempyang in each song structure. fig. 8. pattern of kethuk and kempyang notation for each song structure in table 3, both the kenong and kempul instruments show variations in note distance. the kenong instrument tends to have note distances close to 0 for cnn-lstm model, indicating a close resemblance between the generated notation and the original. the notation pattern on the kenong instrument seems to be more consistent across different song structures compared to the kempul instrument. on the other hand, the note distance values for the kempul instrument show various variations. a value of 0 means that the generated notation is very close to the original. it should be noted, however, that in the case of sampak, there is a tendency for higher note distance values compared to other song structures. this is due to the notation pattern in sampak, where the notation for the kempul instrument does not always match the balungan notation. such variations in the notation pattern are intentional and are often introduced by gamelan composers to add diversity and variation to the music. figure 9 and figure 10 show the output of the notation generators using three models: the cnnlstm, lstm, and cnn methods for multiple instruments in the ricikan struktural within the sampak and bubaran song structures. by observing these figures, we can examine the relationship between the input components, including balungan notation, song structure, rhythm, and gatra information, and the output notation of multiple instruments in the ricikan struktural. the following observations are possible: • the notation for instruments such as the kenong, kempul, gong suwuk, and gong ageng is derived from the balungan notation within each gatra. however, the order in which the notes are taken a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 52 is different for each instrument. for example, in srepeg, the notes for kenong are taken from the 4th tone of each gatra, whereas in ketawang, the last note of the even gatra is chosen. • song structure and rhythm determine the notation pattern for all instruments, including kenong, kethuk, kempyang, kempul, gong suwuk, and gong ageng, within each song form. • gatra information is used to determine the position of the notation for instruments such as gong suwuk, gong ageng, kenong, and kempul. fig. 9. notation of bubaran arum-arum pelog barang figure 9 is the notation of the bubaran arum-arum pelog barang test data, because the generator results of the three models cnn-lstm, lstm, and cnn for all instruments have no difference or are similar to the original notation of the gamelan composer, so only the original song notation is shown. if we observed, the notation pattern in figure 9 for the kenong and kempul instruments has a structure pattern that is consistent with the balungan notation. however, the situation is different from what is shown in figure 10, where the generator results of the cnn-lstm, lstm, and cnn models do not match the original kempul notation for many kempul notations. the same notation is not shown in figure 10, while the different notation is highlighted in yellow for the cnn-lstm model generator results, green for lstm, and blue for cnn. in the sampak tlutur slendro manyura test data, there are differences in the notations generated on the kempul and gong suwuk instruments. the notation for the kempul and gong suwuk instruments is usually derived from the balungan notation, but sometimes the composer substitutes variations of the notation that are different from the balungan notation. for example, in the 3rd gatra of the first line, the 5th note of the balungan becomes the 2nd note of the kempul. the generator results of cnn and lstm are different, while the proposed method cnn-lstm are the same notation as the original. this is consistent with the results shown in table 3, where the note distance value for the kempul instrument is smaller compared to the two models of cnn and lstm for the sampak song structure type. based on the results of the music notation generator shown in table 3, figure 9, and figure 10, shows that the cnn-lstm model can produce a notation generator that is more similar to the original (notation that is the creation of gamelan experts). with the ability of cnn in extracting important features from the input fed into the model and supported by the ability of lstm in predicting music notation from previously learned patterns. however, in table 3 and figure 10, there are still some notations that are different from the original, this may still be a rule of gamelan notation, especially the kempul instrument, which has not been used as a feature in the proposed comparison model. 53 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 fig. 10. notation of sampak tlutur slendro manyura, the colored notation is the result of a generator notation that differs from the original notation of the composer's gamelan (yellow section generated by cnn-lstm, green section generated by lstm, and blue section generated by cnn). the results of this study can be useful in the field of education, especially for novice gamelan players, in playing ricikan struktural instruments, because in gamelan songs there is only melody notation. the notation pattern of ricikan struktural instruments can be identified by the title of a song a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 54 in javanese gamelan, because in the title there is a structure of the song that affects the notation pattern of ricikan struktural instruments. in addition, this study is also useful in the field of gamelan art, with the creation of an automatic generator of ricikan sruktural instrument notation, it can be used to compose an automatic musical composition on a javanese gamelan song as an accompaniment to melody notation. the limitation of this research is that it only generates the notation of ricikan struktural instruments, it still needs to be combined with other instrument notations, such as the notation of ricikan garap instruments as song decorators and kendang instruments as rhythmic controllers. in order to improve the results more optimally, further investigation is needed, especially in relation to the rules on the kempul and gong suwuk instruments and its correlation with a song in javanese gamelan, because the results of this study still have some notation patterns that do not match the original, especially for the kempul and gong suwuk instruments. iv. conclusion this study concludes that cnn-lstm, lstm, and cnn models can effectively predict musical note generation for multi-instrument ricikan struktural javanese gamelan. experimental results show that cnn-lstm outperforms lstm and cnn in terms of accuracy, recall, precision, and quality of generated notations. this superiority can be attributed to the combination of the strengths of both models, resulting in improved performance. the more homogeneous data scenario yields higher accuracy scores due to the consistent distribution of the same data, resulting in more consistent pattern generation. note distance, which measures the difference between the generator's notations and the composer's gamelan notations, shows that the third generator model (cnn-lstm, lstm, and cnn) produces similar notations to the original for instruments such as kethuk, kempyang, and gong ageng. however, instruments such as kenong, kempul, and gong suwuk show significant differences. the small note distance value indicates a consistent notation pattern on the ricikan struktural instrument, which follows the balungan notation. however, the large note distance value indicates variation of pattern in the ricikan struktural instrument, which sometimes does not follow the balungan notation. this illustrates that consistency with standardized pattern rules does not always exist in javanese gamelan, but sometimes gamelan composers change the notation of these instruments as a variation in playing gamelan music. although not all notations are exactly the same as the original, this method of music generation can still be used to supplement the notation in javanese gamelan songs based on song characteristics such as the type of song structure, rhythm, melody (balungan) notation, and gatra information. this study has benefited for novice gamelan players, especially in playing ricikan struktural, by creating an automatic ricikan struktural instrument notation generator. this can be used to create an automatic musical composition on javanese gamelan songs, complementing the melody notation in gamelan songs. the study can also be applied to gamelan art. this study focuses on the ricikan struktural generators in javanese gamelan, but also explores the ricikan garap and kendang instruments for next study. future studies should look at the rules of the kenong and gong suwuk instruments and how they relate to the songs, as there are notation patterns in the study that still differ from the original, especially for the kempul and gong suwuk instruments. in addition, the wide variety of javanese gamelan styles provides opportunities for further study. acknowledgment we are grateful to soewidiatmaka gamelan for their invaluable contributions of both knowledge and data to our study, and we express our deep appreciation to them. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. 55 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 funding statement this research receive funding from the indonesian endowment fund for education (lpdp), which was provided under the budi dn doctoral scholarship programme. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] martongprawit, “catatan pengetahuan karawitan,” surakarta: akademi seni karawitan indonesia (aski), 1975. [2] r supanggah, “bothèkan karawitan i,” . jakarta: masyarakat seni pertunjukan indonesia, 2002. [3] a. setyoko and z. w. pratama, “faktor-faktor kesulitan pembelajaran praktik karawitan jawa program studi etnomusikologi fakultas ilmu budaya universitas mulawarman,” jurnal mebang: kajian budaya musik dan pendidikan musik, vol. 1, no. 2, pp. 81–92, 2021. [4] s. ananda and n. scorviana herminasari, “minat generasi muda kepada pelestarian gamelan jawa di komunitas gamelan muda samurti andaru laras,” jurnal studi budaya nusantara, 2022. [5] d. p. prasetyo, “ragam garap kendhang kalih ladrang dalam karawitan gaya surakarta,” skripsi institut seni indonesia surakarta, 2016. [6] v. melinda, “garap tabuhan kempul pada gendhing alit dalam klenèngan,” skripsi fakultas seni pertunjukan isi yogyakarta, 2019. [7] d. purwanto, “permainan ricikan kenong dalam karawitan jawa gaya surakarta,” gelar: jurnal seni budaya, 11(2), 2013. [8] supardi, “ricikan struktural salah satu indikator pada pembentukan gending dalam karawitan jawa,” keteg, vol. 13, no. 1, 2013. [9] j. p. briot, g. hadjeres, and f. d. pachet, “deep learning techniques for music generation,” heidelberg: springer, 2020. [10] r. madhok, s. goel, and s. garg, “sentimozart: music generation based on emotions,” international conference on agents and artificial intelligence, vol 2, pp. 501-506, 2018. [11] yang l. c., chou s. y., and yang y. h., “midinet: a convolutional generative adversarial network for symbolicdomain music generation,” arxiv preprint arxiv:1703.10847, 2017. [12] q. li, w. cai, x. wang, y. zhou, d. d. feng, and m. chen, “medical image classification with convolutional neural network,” international conference on control automation robotics & vision (icarcv), pp. 844-848, ieee, 2014. [13] s. tanberk and d. b. tükel, “style-specific turkish pop music composition with cnn and lstm network,” world symposium on applied machine intelligence and informatics (sami), pp.181-185, ieee, 2021. [14] j. chen, “construction of music intelligent creation model based on convolutional neural network,” computational intelligence and neuroscience, 2022. [15] f. minglei, “application of music industry based on the deep neural network,” scientific programming, pp.1-6, 2022. [16] f. shah, t. naik and n. vyas, “lstm based music generation,” international conference on machine learning and data engineering (icmlde), pp. 48-53, 2019. [17] s. mangal, r. modak, & p. joshi, “lstm based music generation system,” international advanced research journal in science, engineering and technology, vol. 6, issue 5, 2019. [18] y. huang, x. huang, and q. cai, “music generation based on convolution-lstm,” computer and information science, 11(3), 50-56, 2018. [19] s. liang, b. zhu, y. zhang, s. cheng, & j. jin, “a double channel cnn-lstm model for text classification,“ ieee 22nd international conference on high performance computing and communications; ieee 18th international conference on smart city; ieee 6th international conference on data science and systems, pp. 1316-1321, 2020. [20] a. agga, a. abbou, m. labbadi, y. el houm, & i. h. o. ali, “cnn-lstm: an efficient hybrid deep learning architecture for predicting short-term photovoltaic power production,” electric power systems research, 208, 2022. [21] t. liu, j. bao, j. wang, y. zhang, “a hybrid cnn⁻lstm algorithm for online defect recognition of co₂ welding,” sensors (basel), 18(12):4369, 2018. [22] w. tan, j. zhang, j. wu, h. lan, x. liu, k. xiao, & p. guo, “application of cnn and long short-term memory network in water quality predicting,” intelligent automation & soft computing, 34(3), 1943-1958, 2022. [23] w. lu, j. li, y. li, a. sun, & j. wang, “a cnn-lstm-based model to forecast stock prices,” complexity, pp 110, 2020. [24] a. m. syarif, a. azhari, s. suprapto, & k. hastuti, “human and computation-based music representation for gamelan music,” malaysian journal of music, 9, 82-100, 2020. [25] k. hastuti, & k. mustafa, “a method for automatic gamelan music composition,” international journal of advances in intelligent informatics, 2(1), 26-37, 2016. http://journal2.um.ac.id/index.php/keds https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=catatan+pengetahuan+karawitan&btng= https://scholar.google.com/scholar?hl=id&as_sdt=0%2c5&q=r+supanggah%2c+both%c3%a8kan+karawitan+&btng= https://doi.org/10.30872/mebang.v1i2.13 https://doi.org/10.30872/mebang.v1i2.13 https://doi.org/10.30872/mebang.v1i2.13 https://jsbn.ub.ac.id/index.php/sbn/article/view/168 https://jsbn.ub.ac.id/index.php/sbn/article/view/168 http://repository.isi-ska.ac.id/1357/ http://repository.isi-ska.ac.id/1357/ http://digilib.isi.ac.id/4437/ http://digilib.isi.ac.id/4437/ https://jurnal.isi-ska.ac.id/index.php/gelar/article/view/1449 https://jurnal.isi-ska.ac.id/index.php/gelar/article/view/1449 https://jurnal.isi-ska.ac.id/index.php/keteg/article/viewfile/635/631 https://jurnal.isi-ska.ac.id/index.php/keteg/article/viewfile/635/631 https://link.springer.com/content/pdf/10.1007/978-3-319-70163-9.pdf https://link.springer.com/content/pdf/10.1007/978-3-319-70163-9.pdf https://doi.org/10.5220/0006597705010506 https://doi.org/10.5220/0006597705010506 https://arxiv.org/abs/1703.10847 https://arxiv.org/abs/1703.10847 https://doi.org/10.1109/icarcv.2014.7064414 https://doi.org/10.1109/icarcv.2014.7064414 https://doi.org/10.1109/sami50585.2021.9378654 https://doi.org/10.1109/sami50585.2021.9378654 https://doi.org/10.1155/2022/2854066 https://doi.org/10.1155/2022/2854066 https://doi.org/10.1155/2022/4068207 https://doi.org/10.1155/2022/4068207 https://doi.org/10.1109/icmlde49015.2019.00020 https://doi.org/10.1109/icmlde49015.2019.00020 https://arxiv.org/abs/1908.01080 https://arxiv.org/abs/1908.01080 https://arxiv.org/abs/1908.01080 https://arxiv.org/abs/1908.01080 https://doi.org/10.1109/hpcc-smartcity-dss50907.2020.00169 https://doi.org/10.1109/hpcc-smartcity-dss50907.2020.00169 https://doi.org/10.1109/hpcc-smartcity-dss50907.2020.00169 https://doi.org/10.1016/j.epsr.2022.107908 https://doi.org/10.1016/j.epsr.2022.107908 https://doi.org/10.1016/j.epsr.2022.107908 https://doi.org/10.3390/s18124369 https://doi.org/10.3390/s18124369 https://doi.org/10.32604/iasc.2022.029660 https://doi.org/10.32604/iasc.2022.029660 https://doi.org/10.1155/2020/6622927 https://doi.org/10.1155/2020/6622927 https://doi.org/10.37134/mjm.vol9.7.2020 https://doi.org/10.37134/mjm.vol9.7.2020 https://doi.org/10.26555/ijain.v2i1.57 https://doi.org/10.26555/ijain.v2i1.57 a. kurniawati et al. / knowledge engineering and data science 2023, 6 (1): 41–56 56 [26] k. hastuti, a. azhari, a. musdholifah, & r. supanggah, “rule-based and genetic algorithm for automatic gamelan music composition,” international review on modelling and simulations, 10(3), pp 202-212, 2017. [27] a. kurniawati, e. m. yuniarno, y. k. suprapto, & a. n. i. soewidiatmaka, “automatic note generator for javanese gamelan music accompaniment using deep learning,” international journal of advances in intelligent informatics, 9(2), pp 231-248, 2023. [28] m. ashraf, f. abid, m. atif, and s. bashir, “the role of cnn and rnn in the classification of audio music genres,” vfast transactions on software engineering, 2022. [29] m. ashraf, f. abid, i. u. din, j. rasheed, m. yesiltepe, s. f. yeo, & m. t. ersoy, “a hybrid cnn and rnn variant model for music classification,” applied sciences, 13(3), 1476, 2023. [30] x. luo, “automatic music genre classification based on cnn and lstm”, highlights in science, engineering and technology, pp39, 61-66, 2023. [31] r. gupta, s. ashish, h. shekhar, and m. d. s. dominic, “music genre classification using cnn and rnn-lstm,” micro-electronics and telecommunication engineering: proceedings of 5th icmete 2021 (pp. 729-745). singapore: springer nature singapore, 2021. [32] d. kostrzewa, p. kaminski, & r. brzeski, “music genre classification: looking for the perfect network,” international conference on computational science, pp.55-67, cham: springer international publishing, 2021. [33] r. t. irene, c. borrelli, m. zanoni, m. buccoli, & a. sarti, “automatic playlist generation using convolutional neural networks and recurrent neural networks,” 27th european signal processing conference (eusipco), pp. 15, 2019. [34] t. ito and s. arai, “harmonic representation for cnn-lstm automatic chord recognition,” 3rd international conference on cybernetics and intelligent system (icoris), pp. 1-5, 2021. [35] s. b. puri , s. p. mahajan, “automatic note and chord recognition for harmonium music: a deep learning approach,” journal of critical reviews, 7(15), 2020. [36] s. hizlisoy, s.yildirim, & z. tufekci, “music emotion recognition using convolutional long short term memory deep neural networks,” engineering science and technology, 24(3), pp760-767, 2021. [37] s. sheykhivand, z. mousavi, t. y. rezaii, & a. farzamnia, “recognizing emotions evoked by music using cnnlstm networks on eeg signals,’’ ieee access, 8, 139332-139345, 2020. [38] s. ayadi and z. lachiri, “a combined cnn-lstm network for audio emotion recognition using speech and song attributs,” 6th international conference on advanced technologies for signal and image processing (atsip), pp. 16, 2022. [39] a. ranjan, v. n. j. behera, and m. reza, “using a bi-directional lstm model with attention mechanism trained on midi data for generating unique music,” artificial intelligence for data science in theory and practice, nov. 2022. [40] j. gareth, w. daniela, h. trevor, and t. robert, j. gareth, w. daniela, h. trevor, & t. robert, “an introduction to statistical learning: with applications in r,” spinger, 2013. https://doi.org/10.15866/iremos.v10i3.11479 https://doi.org/10.15866/iremos.v10i3.11479 https://doi.org/10.26555/ijain.v9i2.1031 https://doi.org/10.26555/ijain.v9i2.1031 https://doi.org/10.26555/ijain.v9i2.1031 https://vfast.org/journals/index.php/vtse/article/view/793 https://vfast.org/journals/index.php/vtse/article/view/793 https://doi.org/10.3390/app13031476 https://doi.org/10.3390/app13031476 https://doi.org/10.54097/hset.v39i.6494 https://doi.org/10.54097/hset.v39i.6494 https://doi.org/10.1007/978-981-16-8721-1_67 https://doi.org/10.1007/978-981-16-8721-1_67 https://doi.org/10.1007/978-981-16-8721-1_67 https://doi.org/10.1007/978-3-030-77961-0_6 https://doi.org/10.1007/978-3-030-77961-0_6 https://doi.org/10.23919/eusipco.2019.8903002 https://doi.org/10.23919/eusipco.2019.8903002 https://doi.org/10.23919/eusipco.2019.8903002 https://doi.org/10.1109/icoris52787.2021.9649565 https://doi.org/10.1109/icoris52787.2021.9649565 https://doi.org/10.22075/ijnaa.2021.6040 https://doi.org/10.22075/ijnaa.2021.6040 https://doi.org/10.1016/j.jestch.2020.10.009 https://doi.org/10.1016/j.jestch.2020.10.009 https://doi.org/10.1109/access.2020.3011882 https://doi.org/10.1109/access.2020.3011882 https://doi.org/10.1109/atsip55956.2022.9805924 https://doi.org/10.1109/atsip55956.2022.9805924 https://doi.org/10.1109/atsip55956.2022.9805924 https://arxiv.org/abs/2011.00773 https://arxiv.org/abs/2011.00773 http://103.62.146.201:8081/jspui/handle/1/9528 http://103.62.146.201:8081/jspui/handle/1/9528 knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 188–198 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p188-198 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) the effect of the number of hidden layers on the performance of deep q-network for traveling salesman problem benzfica hanif a,1, aisyah larasati a,2,*, rudi nurdiansyah a,3, trung le b,4 a department of mechanical and industrial engineering, faculty of engineering, universitas negeri malang jl. semarang no. 5, malang 65145, indonesia b department of industrial and management systems engineering, university of south florida 4202 e fowler ave, tampa-fl 33620, usa 1 benzfica@gmail.com; 2 aisyah.larasati.ft@um.ac.id*; 3 rudi.nurdiansyah.ft@um.ac.id; 4 tqle@usf.edu * corresponding author i. introduction consumer behavior has been changing due to the desire for fast, safe, and efficient fulfillment of their needs, driven by the digital era. meeting these consumer expectations requires the intervention of delivery services. during the delivery process, problems often arise in route determination. these problems occur because couriers rely on their knowledge to deliver items to customer addresses, which can lead to further complications when dealing with larger quantities of items and diverse customer addresses. the impacts of such issues include wasted delivery time, increased operational costs, and unmet delivery targets. the traveling salesman problem (tsp) involves a salesman and a set of n cities. this issue aims for the salesman to visit each city exactly once while covering the shortest possible total tour distance [1]. the solution to the tsp has been widely addressed using optimization algorithms to optimize the resources available in the distribution process. the essence of tsp is to find the shortest route involving a number of points, including returning to the starting point. as a complex mathematical problem, various heuristic methods have been developed over time to find approximate solutions [2]. the research by [3] utilized the harris hawk optimization algorithm, which employs random-key article info a b s t r a c t article history: received 14 september 2023 revised 18 september 2023 accepted 03 october 2023 published online 21 october 2023 the traveling salesman problem (tsp) effectively represents the complex distribution issues encountered by couriers, who must carefully plan a route that includes all customer addresses while minimizing the distance traveled. as the magnitude of deliveries and the range of destinations expand, the courier's responsibility becomes progressively challenging. in this particular context, the objective of our research is to expand the existing knowledge and explore the complete capabilities of deep q-network (dqn) models in order to achieve the most efficient route determination. this endeavor can potentially bring about significant changes in the courier and delivery service sector. the foundation of our unique methodology relies on an empirical inquiry, utilizing a comprehensive dataset including 178 observations obtained from motorcycle-based package delivery agents. our research is carefully planned and executed using a comprehensive factorial experimental design. this design incorporates three crucial factors: the number of hidden layers, episodes, and epochs. the hidden layer parameter is set to a singular level, while the episode parameter is configured to explore five levels, and the epoch parameter is designed to travel four levels. the evaluation of our dqn models' performance is conducted utilizing the mse metric as a measure. this assessment is carried out at every iterative cycle, ensuring thorough scrutiny. the central focus of our research centers on the intricate connection between episodes and epochs, and their influence on mse. the findings of our study reveal that the association between episodes, epochs, and errors is not statistically significant although different level of episodes and epochs produces slightly different level of error. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: deep q-network traveling salesman problem hidden layer episode epoch http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 189 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 encoding to generate a tour. the research conducted by [4] uses the new ant colony optimization for solving tsp, achieving high accuracy and fast computational times. the research conducted by [5] used ant colony optimization to determine tsp routes and showed that the execution time of ant colony optimization was faster in obtaining results than the execution time of exact methods. with the growing popularity of machine learning (ml) and deep learning, numerous research teams have embraced mg combinatorial optimization challenges, including the widely recognized tsp. new models and architectures for solving tsp have been progressively created using deep (reinforcement) learning, improving the performance [6]. the ml algorithm can be applied in the deep q-network (dqn). reference [7], in their study, utilized the dqn algorithm to address shipping and route issues for autonomous robots. research conducted by [8] used dqn to solve the truck routing problem between terminals to minimize the total cost incurred. however, in recent years, machine learning advancements have been utilized to solve tsp-related problems. the deep neural network method provides significantly more robust capabilities in pattern recognition and feature representation. these algorithms can provide solutions based on performance comparisons in determining the best routes. the tsp has a long history and finds numerous real-world applications. it aims to discover the most efficient route that includes each city exactly once and ends at the starting city [9]. equation (1) is the objective function of tsp, denoted by z, which aims to minimize the total distance traveled to determine the route. the formulation of tsp modeling represents the distance traveled from point i to point j as cij. the decision variable, denoted as xij, represents whether there is a travel from point i to point j. 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑍 = ∑ ∑ 𝐶𝑖𝑗𝑛𝑖=1 𝑋𝑖𝑗 𝑛 𝑗=1 () having constraint limitations, (2) and (3) constraints ensure that the selected route arrives and leaves the destination once. travel value from point i to point j as in (4). ∑ 𝑋𝑖𝑗 = (𝑗 = 1, 2, 3, … , 𝑁)𝑛𝑗=1 () ∑ 𝑋𝑖𝑗 = (𝑖 = 1, 2, 3, … , 𝑁)𝑛𝑖=1 () 𝑋𝑖𝑗 = 0 𝑜𝑟 1 () reference [10] researched the tsp using genetic algorithms with performance assessment of the model based on the total distance traveled. in the research conducted by [11], algorithms were compared to solve the tsp to obtain an optimal route that visits each destination once and returns to the starting point. this study applies constraints in determining the optimal route based on the loss value of the model. the performance evaluation of the dqn algorithm in determining the optimal route is based on the loss value. the loss function typically employed is the mean square error (mse). mse represents the expected value of the squared difference between the estimated parameter and the true parameter. a lower mse value indicates greater accuracy in describing the experimental data [12]. dqn is a multi-layered neural network with specific states to generate action values [13]. the essential components of dqn are the target network and experience replay. dqn is a combined algorithm of q-learning and deep neural networks to consider the value of the state-action function. the advantage of using dqn is its ability to represent observation results in high-dimensional states and calculate q-function values using a deep neural network. the target parameter used in the dqn algorithm is defined as in (5). 𝑌𝑡 𝐷𝑄𝑁 = 𝑅𝑡+1 + 𝛾 max 𝑄 (𝑆𝑡+1, 𝑎 ; 𝜃𝑘 −) () the q-values are updated using the parameters present in the neural network. the updated neural network values are obtained through reverse transfer from the loss function. the loss function of dqn is defined as the squared error between the target q-value and the estimated q-value, as in (6). b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 190 𝐿𝑜𝑠𝑠𝐷𝑄𝑁 = [𝑅𝑡+1 + 𝛾𝑚𝑎𝑥𝑄(𝑆𝑡+1, 𝑎; 𝜃𝑘 −) − 𝑄)𝑠; 𝑎; 𝜃)]2 () figure 1 illustrates the dqn training process [14]. the dqn training process is an improvement over the q-learning algorithm that addresses the instability issues in representing the function of the non-linear network. dqn uses experience replay to process the transfer samples. at each time step t, the transfer samples obtained by the agent interacting with the environment are stored in the replay buffer unit. during the training process, a batch of transfer samples is randomly selected, and the stochastic gradient descent algorithm is used to update the network parameters 𝜃. fig. 1. overview of the dqn training within artificial intelligence and optimization, there is a significant research emphasis on enhancing the effectiveness of dqns in tackling complex combinatorial issues such as the tsp. the primary aim of this study is to examine the impact of manipulating the number of hidden layers in a dqn framework on its efficacy in addressing the tsp. the objective of this research endeavor is to discover novel insights and advancements that can significantly improve the effectiveness and precision of dqn-based solutions for this well-established topic. ii. methods the study procedure, as seen in figure 2, undertakes a thorough exploration to tackle the intricate task of identifying the most favorable pathway. fundamentally, this undertaking is supported by the powerful deep q-network (dqn) algorithms, which have the potential to transform the field of route optimization significantly. the research begins by carefully selecting a comprehensive compilation of literature reviews, specifically chosen to cover essential findings regarding the challenges of route determination and references explaining the deep learning techniques utilized in this study. once the core knowledge base has been established, the succeeding step thoroughly examines realworld challenges in determining routes and the various strategies to address them. this comprehensive inquiry forms the foundation for making well-informed decisions, allowing the research team to develop a methodology grounded in empirical evidence and relevant in practical terms. the data acquisition phase is important, as it involves carefully collecting a comprehensive dataset that includes crucial features such as order id, origin, postal codes, addresses, and geographical coordinates (latitude and longitude). after the data-gathering procedure, a rigorous preprocessing protocol is implemented to examine the dataset carefully. this approach aims to remove duplicate entries and extract the fundamental attribute variables that will serve as the foundation for constructing the dqn algorithm model in the upcoming steps. 191 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 fig. 2. research flow the culmination of the research process is conducting a rigorous three-factorial experiment that examines the crucial factors of hidden layers, episode configuration, and epoch settings. this study aims to systematically investigate the most practical combination of these parameters, leading to the optimization of the dqn model and paving the way for significant advancements in route optimization approaches. within the complex interplay of theory and practice, this study provides significant advancements in deep learning-driven route determination, offering novel approaches to address urgent practical obstacles. the dataset consists of 178 data points collected on a single day of the delivery procedure. the data is acquired during the observation process and documented within the application possessed by each courier. data selection based on attribute variables and the subsequent cleaning procedure are necessary steps to ensure the data's integrity by removing duplicates and incomplete entries. the variables to be utilized are location, latitude, and longitude. start literature review identification problem statement collecting data (observation) order id, origin, postal, address, latitude, longitude data preprocessing cleaning and reduction data modeling deep q-network setting parameter evaluation conclusion finish b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 192 the study incorporates several criteria, namely the quantity of hidden layers, episodes, and epochs. the augmentation of hidden layers has enhanced precision; nevertheless, it necessitates a lengthier training period and heightens the potential for overfitting [14]. the quantity of epochs denotes a comprehensive iteration within the machine learning process, during which the model acquires knowledge from the entirety of the training dataset. in the context of neural network methodologies, the iterative nature of learning processes plays a crucial role in achieving the convergence of weight values. given the lack of knowledge regarding the ideal number of episodes and epochs, it becomes imperative to conduct experiments using various values to attain the lowest possible loss. consequently, this research investigates the impact of manipulating the parameters of hidden layers, episodes, and epochs. the number of hidden layers to be tested will be limited to one. the episode will be conducted for 50, 100, 150, 200, and 250 iterations. additionally, the epoch will be set to 1, 50, 100, and 500. this experiment yielded 20 unique combinations of hidden layers, episodes, and epochs. the construction of the deep q-network model commences with establishing the environment, wherein the initial state is determined by referencing the historical data about deliveries. the initial location of the delivery is situated at jl raya sawojajar, namely at ruko wow no.11a. the courier, functioning as the agent, will traverse the state space region, encompassing the latitude and longitude coordinates of addresses within the given environment. the mobility of the agent is constrained to the provided location data. the agent will continue its movement till it reaches the ultimate delivery destination, located explicitly at jl danau towuti raya blok g4 a17. the deep q-network configuration is produced using the keras toolkit. the model is constructed using dense or completely connected layers, each consisting of 32 neurons, and employing the rectified linear unit (relu) activation function. the optimization function, adam, is a widely used algorithm in machine learning that is designed to update the parameters of a model efficiently. it combines the benefits of the adaptive moment estimation (adam) and root mean square propagation (rmsprop) algorithms. adam utilizes adaptive learning rates for each parameter, which are computed based on the first and second moments of the gradients, allowing for effective optimization of the model's parameters. on the other hand, the loss function, mean square error (mse), is a commonly employed metric in regression tasks. it measures the average squared difference between the predicted and actual values. mse is widely used due to its simplicity and ability to penalize more significant errors more heavily. minimizing the deep q-network can generate various outputs, including the trajectory followed, the overall distance covered, and the computed value of the loss function. iii. result and discussion the research design comprises a comprehensive framework considering three essential criteria: the number of hidden layers, episodes, and the epochs utilized. the enigmatic topography of concealed strata unveils a captivating compromise an augmentation of these strata has exhibited a distinct inclination to enhance degrees of accuracy. nevertheless, this approach is accompanied by extended training periods and an increased likelihood of overfitting, as supported by previous research [14]. in table 1, epochs are a guiding principle for a comprehensive exploration of machine learning. within the vast expanse of this particular environment, the model experiences a transformative process of acquiring knowledge, integrating valuable insights derived from the entirety of the training dataset. within the domain of neural network techniques, the iterative nature of these learning processes facilitates the complex process of weight value convergence. however, the indeterminate characteristic of determining the ideal quantity of episodes and epochs motivates us to undertake an empirical investigation, conducting experiments with various numerical values to achieve the elusive objective of minimizing loss. consistent with the empirical approach, our research aims to investigate the complex interaction between hidden layers, events, and epochs. the hidden layers' canvas is deliberately limited to a single layer, enabling us to isolate and examine the influence of other variables. in contrast, episodes will be carefully planned and executed, spanning a range of 50, 100, 150, 200, and 250 occurrences. 193 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 concurrently, the epoch parameter will be strategically set to values of 1, 50, 100, and 500. the comprehensive experimental approach employed in this study yields a diverse array of 20 unique combinations, providing a comprehensive understanding of the interplay and impact of these variables on the performance of our model. in this comprehensive investigation, our objective is to decipher the most effective arrangement that facilitates improved accuracy while minimizing the potential drawbacks of overfitting. this endeavor will ultimately lead to the developing of more efficient and precise approaches for determining routes. the loss value obtained from different level parameter is shown in table 1. table 1. loss value using different level parameter parameter loss episode epoch 50 1 7.005321025848389 50 50 3.2833027944434434e-05 50 100 0.00012886642070952803 50 500 1.189620525110513e-05 100 1 3.721908797160722e-05 100 50 1.953684477484785e-05 100 100 1.5766545402584597e-05 100 500 9.782820598047692e-06 150 1 0.0010764036560431123 150 50 1.7226924683200195e-05 150 100 1.4794331036682706e-05 150 500 0.0002136115072062239 200 1 0.009234399534761906 200 50 1.270834582101088e-05 200 100 1.1694006389006972e-05 200 500 5.954136940999888e-05 250 1 0.0018385164439678192 250 50 4.7960747906472534e-05 250 100 2.5381839805049822e-05 250 500 3.104201823589392e-05 episode and epoch are two crucial elements that must strategically interact for the dqn model to be constructed. in order to minimize the loss function, a key performance statistic, these parameters work together to create the model's architecture. we use the time-tested analysis of variance (anova) approach to determine the specific influence of each parameter on the final loss value. we use anova as our dependable compass to guide us through the challenging environment of parameter influence. the findings of the anova test performed between episode and loss are revealed in table 2, a gold mine of insights. the computed p-value, which is just 0.438, survives statistical inspection. even though this value is substantial, careful interpretation is required. table 2. anova between episode and loss sum of squares df mean square f sig. betwwen groups 9.807 4 2.452 0.999 0.438 within groups 36.805 15 2.454 total 44.612 19 table 3 serves as a guiding light for understanding the complex dynamics in our dqn model, providing the results of our anova study performed between epoch and loss. a significant p-value of 0.416 emerges from this tableau of statistical findings, attesting to the thorough examination of our dataset. however, when this result is compared to the revered confidence level of 0.05, a staunch bulwark of statistical rigor set at the high 95% threshold, the actual significance of this finding becomes clear. table 3. anova between episode and loss sum of squares df mean square f sig. betwwen groups 7.386 3 2.462 1.004 0.416 within groups 39.226 16 2.462 total 46.612 19 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 194 the null hypothesis, which maintains no significant epoch-driven influence on the loss value, triumphs in this delicate dance of numbers as the p-value exceeds the confidence standard. this significant resultanova between epoch and loss. based on the results obtained from the many anova tests, it can be concluded that the parameters of episode and epoch have negligible or no significant impact on the loss value. however, it is essential to note that this discovery is closely linked to the sample size used in our research efforts, highlighting the need for a nuanced comprehension. as expounded upon in the literature by [14], attaining statistical significance becomes challenging when conducted within a restricted sample size. furthermore, the significance of replications in the study's design, as highlighted by [17], should not be ignored, as they can uncover or disguise specific effects within the data. the significance of degrees of freedom, as highlighted by [18], is another crucial aspect closely connected to sample size and replication data. although a degree of freedom value of 15 is generally considered acceptable, it is crucial to recognize that contextual limitations may hinder the specific selection of this number in some studies. the inquiry undertaken by [19] on weather classification, which employed the backpropagation method with different numbers of hidden layers (1, 2, and 3), is a noteworthy reference point when considering past research. the results of their study suggest that modifying these parameters did not result in statistically significant improvements in accuracy values. the significance of hidden layers, which function as intermediates within the neural network, becomes prominent since they are provided with activation functions that enable the transfer and training of data across different layers of the network [20]. the selection of the ideal number of hidden layers is still a subject of debate, as evidenced by other research that has used hidden layers as parameters and obtained different accurate results. several factors contribute to determining the appropriate number of hidden layers in a neural network. these factors include the complexity of the network design, the number of input and output units, the volume of training samples, the presence of noise in the dataset, and the intricacy of the training process [21]. additional knowledge can be acquired from the study conducted by [22], wherein artificial neural networks were employed to mimic air pressure resulting from overpressure. the parameter of the epoch was considered in the analysis. interestingly, their model demonstrated no substantial dependence on the epoch parameter, establishing a connection between epochs and the notion of weight convergence in machine learning algorithms. the complex interconnection between episodes and epochs becomes evident as epochs effectively serve as a higher-level loop that encompasses the episode loop. the absence of a clearly defined deterministic rule for calculating the episode's number is emphasized by the findings of [23]. nevertheless, it is essential to acknowledge that an unexpected pattern emerged during the experimental procedure: a positive correlation was observed between the increment in the number of episodes and epochs and a noticeable reduction in the loss value. the observed inconsistency between the anova-based statistical analysis, which showed no statistically significant variation in the loss value concerning episode and epoch, adds a level of intricacy to our comprehension. the presence of incongruity in the given context indicates the potential influence of random elements, such as inadequate data for conducting anova testing and the lack of data replication. the complexity of quantifying the exact relationship between parameters and the loss value in the anova framework presents a challenge, highlighting the significance of recognizing the interaction between statistical analysis and real-world data in pursuing comprehensive insights. figure 3 and figure 4 provide aesthetically captivating representations of the research outcomes, revealing intricate patterns in the dynamics of loss values. figure 3 presents a visual representation of the loss value, illustrating a substantial decline in loss value as the number of episode increases up to 100. however, when the episode is set to 125 and more, the changes in loss value becomes up and down depends on the number of epoch. the visual representation resembles the underlying data patterns, wherein a noticeable decline can be observed in each experimental session using different episode values. likewise, figure 4 presents a captivating depiction of the behavior of the loss value, illustrating a clear pattern of decrease with each successive increase in the number of epochs when 195 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 the epoch is set less than 100. however, when the epoch is set equal to or more than 100, the decrease becomes up and down depends on the number of episode. the graphical narrative presented in this analysis effectively illustrates a pattern of constant yet dynamic reductions in the loss value. this pattern mirrors the complex interplay between different epochs and their influence on the model's performance. fig. 3. episode vs loss upon further examination of the modeling process, it becomes apparent that the algorithm can construct a model that accurately reflects the intricacies of the real-world situation being addressed. upon more profound analysis of table 1, a notable accomplishment becomes apparent the painstaking optimization of hyperparameters, carried out for 500 epochs within the scope of 100 episodes, resulting in a meager loss value of 0.000010. the modest size of this figure serves as a strong indication of the algorithm's effectiveness, solidifying the idea that lower loss values indicate the attainment of an ideal model. if we were to represent this notable accomplishment visually, it would depict a captivating depiction of a gradual decrease in loss over 500 carefully planned periods within the span of 100 instances, serving as a vivid demonstration of the algorithm's ability to improve and enhance its performance consistently. this study provides significant contributions to the domain of route optimisation through the utilisation of dqn models. however, there are certain limitations that can be addressed in future research. these limitations include the potential for expanding the dataset, exploring a broader range of hyperparameters, incorporating data replication techniques, adopting additional evaluation metrics, transitioning towards real-world deployment, and leveraging enhanced computational resources. these enhancements will enhance our comprehension of dqn-based route determination and its pragmatic implementations in the courier and delivery sector. b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 196 fig. 4. epoch vs loss iv. conclusions in summary, our research endeavor has encompassed a thorough investigation into the factors that impact the efficacy of a dqn model in addressing the traveling salesman problem. the study focused on the influence of hidden layers, episodes, and epochs to elucidate their importance in optimizing the loss value. by doing a rigorous analysis of variance (anova), we determined that neither episode nor epoch had a statistically significant impact on the loss value. nevertheless, it is crucial to interpret these results in light of the limitations inherent in our sample size, the availability of replication data, and the degrees of freedom, as these factors might significantly influence the conclusions of the statistical analysis. interestingly, although episode and epoch are statistically neutral, our visual representations in figures 3 and 4 demonstrate a captivating storyline of a steady decline in loss value as the number of episodes and epochs increases. the observation above highlights the algorithm's proficiency in generating models from processed data, as demonstrated by the notable accomplishment of attaining a minimal loss value of 0.000010 during hyperparameter tweaking. our research highlights the complex relationship between statistical analysis and empirical observations in practical contexts. although statistical tests offer valuable insights, they may occasionally fail to capture the intricacies of complicated models. therefore, to fully comprehend the issue, it is necessary to combine statistical rigor and empirical observations, which will allow us to effectively navigate the 197 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 always-changing field of deep reinforcement learning and route optimization with enhanced clarity and accuracy. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. jaradat, b. matalkeh, and w. diabat, "solving traveling salesman problem using firefly algorithm and k-means clustering," 2019 ieee jordan int. jt. conf. electr. eng. inf. technol. jeeit 2019 proc., no. september, pp. 586– 589, 2019. [2] j. n. macgregor and t. ormerod, "human performance on the traveling salesman problem," percept. psychophys., vol. 58, no. 4, pp. 527–539, 1996. [3] f. s. gharehchopogh and b. abdollahzadeh, "an efficient harris hawk optimization algorithm for solving the travelling salesman problem," cluster comput., vol. 25, no. 3, pp. 1981–2005, 2022. [4] w. gao, "new ant colony optimization algorithm for the traveling salesman problem," int. j. comput. intell. syst., vol. 13, no. 1, pp. 44–55, 2020. [5] b. p. silalahi, n. fathiah, and p. t. supriyo, "use of ant colony optimization algorithm for determining traveling salesman problem routes," j. mat. “mantik,” vol. 5, no. 2, pp. 100–111, 2019. [6] a. françois, q. cappart, and l.-m. rousseau, "how to evaluate machine learning approaches for combinatorial optimization: application to the travelling salesman problem," 2019. [7] m. p. li, p. sankaran, m. e. kuhl, r. ptucha, a. ganguly, and a. kwasinski, "task selection by autonomous mobile robots in a warehouse using deep reinforcement learning," proc. winter simul. conf., vol. 2019-decem, pp. 680–689, 2019. [8] t. n. adi, h. bae, and y. a. iskandar, "interterminal truck routing optimization using cooperative multiagent deep reinforcement learning," processes, vol. 9, no. 10, 2021. [9] s. singh and a. lodhi, "study of variation in tsp using genetic algorithm and its operator comparison," int. j. soft comput. eng., no. 3, p. 264, 2013. [10] h. a. abdulkarim and i. f. alshammari, "comparison of algorithms for solving traveling salesman problem," int. j. eng. adv. technol., vol. 4, no. 6, pp. 76–79, 2015. [11] g. ding and l. qin, "study on the prediction of stock price based on the associated network model of lstm," int. j. mach. learn. cybern., vol. 11, no. 6, pp. 1307–1317, 2020. [12] e. xing and b. cai, "delivery route optimization based on deep reinforcement learning," proc. 2020 2nd int. conf. mach. learn. big data bus. intell. mlbdbi 2020, pp. 334–338, 2020. [13] h. van hasselt, a. guez, and d. silver, "deep reinforcement learning with double q-learning," proc. thirtieth aaai conf. artif. intell., pp. 2094–2100, 2016. [14] z. hu, r. beuran, and y. tan, "automated penetration testing using deep reinforcement learning," proc. 5th ieee eur. symp. secur. priv. work. euro s pw 2020, pp. 2–10, 2020. [15] y. shen, n. zhao, m. xia, and x. du, "a deep q-learning network for ship stowage planning problem," polish marit. res., vol. 24, no. s3, pp. 102–109, 2017. [16] s. yigit and m. mendes, "which effect size measure isappropriate for one-way andtwo-way anovamodels? a monte carlo simulation study," revstat stat. j., vol. 16, no. 3, pp. 295–313, 2018. [17] r. a. armstrong, f. eperjesi, and b. gilmartin, "the application of analysis of variance (anova) to different experimental designs in optometry," ophthalmic physiol. opt., vol. 22, no. 3, pp. 248–256, 2002. [18] w. j. ridgman, experimentation in biology., vol. 32, no. 4. london: blackie, 1975. [19] a. e. verawati and a. n. s. kiswanto, "the effect of the number of hidden layers in the backpropagation in case study weather classification," proxies j. inform., vol. 2, no. 2, p. 58, 2021. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1109/jeeit.2019.8717463 https://doi.org/10.1109/jeeit.2019.8717463 https://doi.org/10.1109/jeeit.2019.8717463 https://doi.org/10.3758/bf03213088 https://doi.org/10.3758/bf03213088 https://doi.org/10.1007/s10586-021-03304-5 https://doi.org/10.1007/s10586-021-03304-5 https://doi.org/10.2991/ijcis.d.200117.001 https://doi.org/10.2991/ijcis.d.200117.001 https://doi.org/10.15642/mantik.2019.5.2.100-111 https://doi.org/10.15642/mantik.2019.5.2.100-111 http://arxiv.org/abs/1909.13121 http://arxiv.org/abs/1909.13121 https://doi.org/10.1109/wsc40007.2019.9004792 https://doi.org/10.1109/wsc40007.2019.9004792 https://doi.org/10.1109/wsc40007.2019.9004792 https://doi.org/10.3390/pr9101728 https://doi.org/10.3390/pr9101728 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=2578c5b2e6fd451f54d4a33db6d25f04a6f16b03 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=2578c5b2e6fd451f54d4a33db6d25f04a6f16b03 https://www.researchgate.net/profile/haider-abdulkarim/publication/280597707_comparison_of_algorithms_for_solving_traveling_salesman_problem/links/55bcab9808ae9289a0968a31/comparison-of-algorithms-for-solving-traveling-salesman-problem.pdf https://www.researchgate.net/profile/haider-abdulkarim/publication/280597707_comparison_of_algorithms_for_solving_traveling_salesman_problem/links/55bcab9808ae9289a0968a31/comparison-of-algorithms-for-solving-traveling-salesman-problem.pdf https://doi.org/10.1007/s13042-019-01041-1 https://doi.org/10.1007/s13042-019-01041-1 https://doi.org/10.1109/mlbdbi51377.2020.00071 https://doi.org/10.1109/mlbdbi51377.2020.00071 https://doi.org/10.1609/aaai.v30i1.10295 https://doi.org/10.1609/aaai.v30i1.10295 https://doi.org/10.1109/eurospw51379.2020.00010 https://doi.org/10.1109/eurospw51379.2020.00010 https://doi.org/10.1515/pomr-2017-0111 https://doi.org/10.1515/pomr-2017-0111 https://revstat.ine.pt/index.php/revstat/article/view/244 https://revstat.ine.pt/index.php/revstat/article/view/244 https://doi.org/10.1046/j.1475-1313.2002.00020.x https://doi.org/10.1046/j.1475-1313.2002.00020.x https://doi.org/10.2307/2529293 https://doi.org/10.24167/proxies.v2i2.3212 https://doi.org/10.24167/proxies.v2i2.3212 b. hanif et al. / knowledge engineering and data science 2023, 6 (2): 188–198 198 [20] m. uzair and n. jamil, "effects of hidden layers on the efficiency of neural networks," proc. 2020 23rd ieee int. multi-topic conf. inmic 2020, pp. 1–6, 2020. [21] k. g. sheela and s. n. deepa, "review on methods to fix number of hidden neurons in neural networks," math. probl. eng., vol. 2013, 2013. [22] e. tonnizam mohamad, m. hajihassani, d. jahed armaghani, and a. marto, "simulation of blasting-induced air overpressure by means of artificial neural networks," int. rev. model. simulations, vol. 5, no. 6, pp. 2501 –2506, 2012. [23] m. van otterlo and m. wiering, reinforcement learning and markov decision processes, vol. 12, no. may. 2012 . https://doi.org/10.1109/inmic50486.2020.9318195 https://doi.org/10.1109/inmic50486.2020.9318195 https://doi.org/10.1155/2013/425740 https://doi.org/10.1155/2013/425740 https://www.researchgate.net/profile/vidhyacharan-bhaskar/publication/290044348_a_layout_preserving_lean_documents_suited_for_online_glancing/links/58709bea08ae6eb871bf8a4b/a-layout-preserving-lean-documents-suited-for-online-glancing.pdf#page=131 https://www.researchgate.net/profile/vidhyacharan-bhaskar/publication/290044348_a_layout_preserving_lean_documents_suited_for_online_glancing/links/58709bea08ae6eb871bf8a4b/a-layout-preserving-lean-documents-suited-for-online-glancing.pdf#page=131 https://www.researchgate.net/profile/vidhyacharan-bhaskar/publication/290044348_a_layout_preserving_lean_documents_suited_for_online_glancing/links/58709bea08ae6eb871bf8a4b/a-layout-preserving-lean-documents-suited-for-online-glancing.pdf#page=131 https://doi.org/10.1007/978-3-642-27645-3_1 keds_paper_template knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 2, december 2019, pp. 72–81 eissn 2597-4637 https://doi.org/10.17977/um018v2i22019p72-81 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) optimisation of rice fertiliser composition using genetic algorithms retno dewi anissa a, 1, *, wayan firdaus mahmudy a, 2, agus wahyu widodo a, 3 a faculty of computer science, universitas brawijaya jl. veteran, malang, 65145, indonesia 1 retnodewianissa@yahoo.com *; 2 wayanfm@ub.ac.id; 3 a_wahyu_w@ub.ac.id * corresponding author i. introduction fertiliser is very important for rice growth and evolution. it has one or more elements, used as a replacement of element in plants. fertilisation can be interpreted as a soil nutrient enhancer. fertiliser is divided into several types based on where they are come from [1]; such as inorganic, organic and biological fertiliser. there are two groups of inorganic fertilisers based on the content of nutrients: single and compound fertiliser [2][3][4]. single fertiliser only had one content. there are three kinds of single fertiliser which had the main nutrients such as nitrogen (n), phosphorus (p), potassium (k) and magnesium (mg). some examples of single fertilisers are nitrogen, phosphorus, potassium, and magnesium fertiliser. on the other hands, compound fertiliser is a mixture of two or more nutrients. some examples of mixture is blended again into np and npk. in rice planting, the right composition of fertilizer is essentially needed. if the composition given is not correct, the growth and quality of rice produced is inadequate or the costs incurred become more wasteful and even unaffordable [5][6]. farmers need an appropriate combination in order to obtain good results and minimum expenses of using fertiliser. the combination should consist of s nitrogen (n), phosphorus (p) and potassium (k). this composition is one of the solutions used in composing the optimal fertiliser for rice plants. often, farmers do trial and error or follow others experience to get the right composition of fertiliser. the problem will arise if the raw material for fertilizer changes, then the farmer must change the composition that he knew beforehand. they must try harder to get the right composition of fertiliser. many farmers fail in this case and lose a lot of costs. accordingly, it is necessary to have the right composition of fertilisers in order to obtain high yields and affordable budgets. plants fertiliser combinations issue has been done by several methods such as goal programming [2], and meta-heuristic algorithms [7]. this study uses a genetic algorithm to optimise the fertiliser composition since it has been known as a powerful optimization method [8][9][10][11]. the method is used to produce optimal results with the reproduction processes such as determining population size, crossover rate, chromosome, reproductive and fitness value. the results of the genetic algorithm process expected to help determine the optimal combination of fertiliser. article info a b s t r a c t article history: received 17 march 2019 revised 22 september 2019 accepted 23 september 2019 published online 23 december 2019 there are so many problems with food scarcity. one of them is not too good rice quality. so, an enhancement in rice production through an optimal fertiliser composition. genetic algorithm is used to optimise the composition for a more affordable price. the process of genetic algorithm is done by using a representation of a real code chromosome. the reproduction process using a one-cut point crossover and random mutation, while for the selection using binary tournament selection process for each chromosome. the test results showed the optimum results are obtained on the size of the population of 10, the crossover rate of 0.9 and the mutation rate of 0.1. the amount of generation is 10 with the best fitness value is generated is equal to 1,603. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: genetic algorithm optimisation fertilizer composition rice http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v2i22019p72-81 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 73 ii. materials and methods genetic algorithms or evolutionary algorithms is an optimizing technique that mimics biological evolution. there are several individuals in the population. from several generations, the individual has a role as a parent, which acts as a principal reproduction which can produce offspring. the individual has evolved and the new individual can have a great chance to survive the natural selection and survive. better offspring tend to be generated from the individual who has a good offspring or sometimes is not better, but more likely from generation to generation will form a better population [7]. genetic algorithm stages include initialization, reproduction, evaluation and selection [12][13] [14], is shown in figure 1. in the genetic cycle will be discussed how the process of solving the problem in optimizing the composition of rice fertiliser. there are several steps to resolve the problem using genetic algorithms. these stages can be seen in figure 2. the genetic algorithm process consists of the main processes which include initialization, reproduction, evaluation, and selection. initialization begins with creating certain chromosomes randomly that can represent the solution of problems that need to be solved. the structure of the chromosome is illustrated in table 1. it is also used as a generator of new solution set by randomly composed of a number of chromosome strings initialization reproduction: crossover and mutation evaluation: fitness value of parent and offspring selection next individu fig. 1. cycle diagram of genetic algorithm begin parameter initialization initial population crossover mutation fitness calculation selection stop condition ? no the best chromosome selected end fig. 2. flowchart of genetic algorithm 74 r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 and use it as a population. in this case, it needs a population size (popsize). the value contained in popsize can be used to determine the number of individuals/chromosomes which can accommodate within a population. the number of chromosome representations is determined by the choice of fertilizer available. there are nine chromosomes formed, five initial chromosomes are used to calculate the amount of fertilizer, content, total price needed and fitness value. reproduction is used to form or produce offspring from individuals in the population. there are two kinds of genetic operators here: crossover and mutation. this study uses one cut point crossover. crossover levels must be determined so that the value used to express the ratio of offspring can be produced in the process crossover to produce cr(crossover rate) x popsize of offsprings. illustration of the crossover process can be seen in table 2 and the steps can be seen in figure 3. reciprocal exchange mutation is used in this study. the mutation rate (mr) must be determined in advance to get a comparison between the amount of offspring that can be generated, and the parent involved in the process. the number of offspring that produced can be calculated by mr x popsize. illustration of the mutation process can be seen in table 3 and the steps can be seen in figure 4. evaluation is used for calculating the fitness of each chromosome, the greater fitness results of the chromosome means it produces a better result as a potential solution. within a chromosome always table 1. implementation of chromosome representation individual chromosome total price (rp) fitness 1 2 3 4 5 6 7 8 9 p1 1 5 8 6 2 4 7 9 3 623,800 1.603 table 2. ilustration of crossover individual chromosome 1 2 3 4 5 6 7 8 9 p1 1 5 8 6 2 4 7 9 7 p3 7 6 3 2 4 5 1 8 9 c1 1 5 8 7 6 3 2 4 9 random parents: 1.5 cutpoint : 3 p2 1 2 9 4 5 6 7 3 8 p5 2 5 8 4 9 3 1 6 7 c2 1 2 9 5 8 4 3 6 7 random parents: 2.3 cutpoint : 3 table 3. ilustration of mutation individual chromosome 1 2 3 4 5 6 7 8 9 p1 1 5 8 6 2 4 7 9 3 c3 4 5 8 6 2 1 7 9 3 random parents: 1 random point : 1 and 6 p4 7 4 9 5 6 2 1 3 8 c4 7 2 9 5 6 4 1 3 8 random parents: 4 random point : 2 and 6 p5 2 5 8 4 9 3 1 6 7 c5 2 5 8 4 1 3 9 6 7 random parents: 2 random point : 3 and 6 r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 75 contains the value of fitness and some other property of individuals who have a gene called formers such as name, age, address, and another period of property. fitness values can be expressed in (1). fitness = 1000 / ((total price/1000) + (1000 * penalty)) (1) fertilizer composition always has ideal value requirements for the elements it contains (such as the need for nitrogen (n) content, phospor (p) content, and potassium (k) content. every violation of these ideal values is stated as a punishment written in (2). penalty = need of n + need of p + need of k (2) selection is used to select the individual that can be sustained for the next generation coming from the set of the population and offspring. probability function used to select an individual who needs to be maintained. individuals with higher fitness values have greater chances of being elected to represent the next generation. in this study, the selection used is binary tournament selection. and flowchart of selection process is represented in figure 5. iii. results and discussions implementation of this study using the programming language c# desktop-based. the interface made as easy as possible so that users only enter some parameter values to get the results of the optimization. there are four interfaces that can be accessed by the user, data of fertiliser, fertiliser recommendation reference, input and calculation results of genetic algorithms, and genetic algorithms computation. begin for i=0, i < length of crossover random parent as parent of crossover cuttpoint = 3 initialize array for m=0, m < cuttpoint add parent 1 m++ n=0, n < length of chromosome add parent 2 n++ calculate fitness end i++ fig. 3. flowchart of crossover process 76 r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 user inputs the parameter values in the main interface on the menu input and calculation results of genetic algorithms and genetic algorithms. input interface and display the results of the genetic algorithm in figure 6. afterwards, users can see the process of genetic algorithm calculation that occurs from the input. interface genetic algorithm calculation displayed in figure 7. the population size test is conducted to determine the size of the population which can produce the best fitness as well as the effect of population size on the value of fitness. for testing the generation of a population was 10 and multiplied by 10, from 10 to 100 population. while the value of the crossover rate and mutation rate used by cr = 0.4 and mr = 0.6. each experiment will be performed 10 times and average fitness will be calculated figure 8 shows that the fitness value increases in population size 10 to 20 and produces the same fitness value on the size of 20 to 100 after 10 attempts. in the experiment, population size produces an average of the best fitness with the best fitness value of 1.603. the higher fitness produced better results for the calculation of genetic algorithms. it can be concluded that the value of fitness influenced by the size of the population, the greater the value of the size of the population, no increases fitness values occurred and may affect the computation time that occurs in the software. these experiments generate the optimal solution in the population from 20 to 100 with an average fitness of 1.603. generation testing are conducted to determine the size of the generation which can produce the best fitness as well as the influence of the generation number on the value of fitness. for testing, the population amount was 10 and multiplied by 10 to 100 generations. while the value of the crossover rate and mutation rate used by cr = 0.4 and mr = 0.6; each experiment will be performed 10 times and average fitness will be calculated. begin for i=0, i < total of mutation random parent as parent of mutation random parents as place for exchange chromosome j=0, j <= length of chromosome + 1 f end f i++ array initialization calculate fitness j = point 1 ? t j = point 2 ? f add point 1 add point 1 t t i++ same f fig. 4. mutation flowchart r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 77 on the figure 9 show that the fitness value increased in the size of the generation of 10 to 30 and 70 to 100, while generation reached the lowest fitness value on the size of the generation of 10. the trial generation size produces an average generation of fitness of 1.603 and fitness 1.603 as the best value. the higher the fitness value produced, the better the results of the calculation of genetic algorithms. it can be concluded that the value of fitness is influenced by the size of the generation, the greater the value of the size of the generation of better fitness values occurred. these experiments generated the optimal solution in the population 30 to 50 and 70 to 100 with an average of 1.603. begin for i=0, i < popsize random 2 parent to compared their fitness chromosome to array parent j=0, j <= length of chromosome + 1 compared fitness j++ f end f i++ fig. 5. flowchart of selection process fig. 6. graphical user interface of genetic algorithm input 78 r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 cross over and mutation rate is conducted to determine the size of the crossover rate and mutation rate which can produce the best fitness and influence the amount of crossover rate and mutation rate of the value of fitness; for the testing population amount used was 10 and the generation was 10. while the value of the crossover rate and mutation rate used added by 0.1 from 0 to 1. each experiment will be performed 10 times and calculated the average fitness. on figure 10, it is shown that the combination of crossover rate and mutation rate affect the fitness. the fitness value of the highest increases in the size of the crossover rate and mutation rate of 0.1 and 0.9. while generating the lowest fitness value on the size crossover rate and mutation rate of 0.9 and 0.1. trial size crossover rate and mutation rate produce an average fitness of 1.603 and fitness of 1.603 as the best value. the higher the fitness value produced the better the results of the calculation of genetic algorithms. it can be concluded that the value of fitness was influenced by the size of the crossover rate and mutation rate. these experiments generate the optimal solution in the population 30 to 50 and 70 to 100 with an average of 1,603. best parameters s test has 3 different scenarios using 3 different cases. genetic algorithm parameter values used are the best parameters for the results of population trials, generation, crossover rate and mutation rate. the first thing is to test the best parameters. the population size used was 20 and the generation size was 30. while the crossover rate was 0.1 and the mutation rate was 0.9. each experiment will be carried out 3 times. fig. 7. graphical user interface of genetic algorithm calculation fig. 8. graph of population size testing results 1,564 1,603 1,603 1,603 1,603 1,603 1,603 1,603 1,603 1,603 1,56 1,565 1,57 1,575 1,58 1,585 1,59 1,595 1,6 1,605 1,61 10 20 30 40 50 60 70 80 90 100 f it n e ss population size r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 79 • trial case 1. target of 7 tons, land area of 1 ha, malang regency and ampel gading district. the best solution needed is: kci = 49 kg, za = 300 kg, sp-36 = 48 kg, total price idr 623,800.00. the best chromosomes obtained in case 1 trial are shown in table 4. • trial case 2. target is 8 tons, land area is 1 ha, malang regency and ampel gading district. the best solution needed is: kci = 58 kg, za = 300 kg, sp-36 = 54 kg, total price idr 655,600.00. the best chromosomes obtained in case 2 trial are shown in table • trial case 3target is 9 tons, land area is 1 ha, malang regency and ampel gading district. the best solution needed is: kci = 66 kg, za = 300 kg, sp-36 = 60 kg, total price idr 685,200.00. the best chromosomes obtained in case 2 trial are shown in table in case 1, for target 7 tons, land 1 ha, in malang and district ampel gading can be seen that the best fitness shown is 1.603. the total of the best price is rp 623,800. in case 2 for target 8 tons, land 1 ha, in malang and district ampel gading can be seen that the best fitness shown is 1.525. the total of the best price is rp 655,600. while the test solution in case 3 for target 9 tons, land 1 ha, in malang, and district ampel gading can be seen that the best fitness shown is 1.459. the total of the best price is rp 685,200. measurement of the best solutions on the issue composition of rice plant fertilizer is to look at the highest fitness value. the highest fitness is obtained from the smallest penalty value with minimum price. from the testing that has been done, a genetic algorithm can be used to solve optimisation problems fertiliser composition with a good rice crop. fig. 9. generation testing results fig. 10. graph of crossover rate and mutation rate testing results 1,538 1,569 1,603 1,603 1,603 1,563 1,603 1,603 1,603 1,603 1,53 1,54 1,55 1,56 1,57 1,58 1,59 1,6 1,61 10 20 30 40 50 60 70 80 90 100 f it n e ss amount of generation 1,519 1,403 1,523 1,509 1,599 1,561 1,558 1,515 1,55 1,603 1,553 1,4 1,45 1,5 1,55 1,6 1,65 1/0 0,9/0,1 0,8/0,2 0,7/0,3 0,6/0,4 0,5/0,5 0,4/0,6 0,3/0,7 0,2/0,8 0,1/0,9 0/1 f it n e ss crossover rate and mutation rate 80 r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 iv. conclusion from the experiments conducted in this study, it can be concluded that a genetic algorithm can be used to compose the rice plant fertiliser. the settlement of this problem in the genetic algorithm is to use the representation of a real code chromosome, reproduction method used are one cut point crossover and random mutation, and binary tournament for the selection process. moreover, the quality of the best solution in this study was measured by looking at the highest fitness value. the highest fitness value having the smallest penalty value and the minimum price. on testing can be seen that the best solutions for the fitness value of the result is 1,603 and the price of rp 623,300.-. furthermore, tests have been conducted to examine the influence of genetic parameters on the fitness value are test population, test of generation amount and test combinations of crossover rate (cr) and the mutation rate (mr) and using the best parameters at the trial population produces optimal population values on population size 20 and produce the best fitness of 1,603. in the test generation produce optimal value generation to generation size 30 and produce the best fitness of 1,603. while the trials combine crossover rate (cr) and the mutation rate (mr) is best found in combination mr cr 0.1 and 0.9 with the fitness value 1,603. the conclusion is the higher value of the population will produce better fitness value, the higher the value generation does not make much impact to produce the best fitness value and the higher value of the crossover rate (cr) and the mutation rate (mr) affect the value fitness but it is not necessarily can yield optimal results. the time needed to process the algorithm effect on the higher value it will be a long process of computation. acknowledgement without the outstanding support of the computer science faculty, universitas brawijaya, this paper and the work behind it would not have been possible. we are also grateful to the anonymous peer reviewers for their insightful comments. the kindness and knowledge of one and all strengthened this analysis in countless ways and rescued us from many errors. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. table 4. chromosomes obtained from case 1 trials individual chromosome total price (rp) fitness 1 2 3 4 5 6 7 8 9 p1 2 4 8 1 9 3 6 7 5 623,800 1.603 table 5. chromosomes obtained from case 2 trials individual chromosome total price (rp) fitness 1 2 3 4 5 6 7 8 9 p1 2 8 4 1 3 9 7 5 6 655,600 1.525 table 6. chromosomes obtained from case 3 trials individual chromosome total price (rp) fitness 1 2 3 4 5 6 7 8 9 p1 1 2 5 8 4 6 3 7 9 685,200 1.459 r.d. anissa et al. / knowledge engineering and data science 2019, 2 (2): 72–81 81 additional information no additional information is available for this paper. references [1] p. lingga and m. marsono, petunjuk penggunaan pupuk. jakarta: penebar swadaya, 2007. [2] n. hassan, k. hassan, s. yatim, and s. yusof, “optimizing fertilizer compounds and minimizing the cost of cucumber production using the goal programming approach,” am. j. sustain. agric., vol. 7, no. 2, pp. 45–49, 2013. [3] n. c. lu, y. h. cheng, y. t. wang, and j. cheng, “dynamic propagation problems on mode iii asymmetrical interface crack,” harbin gongye daxue xuebao/journal harbin inst. technol., vol. 39, no. 11, pp. 1710–1714, 2007. [4] p. du jardin, “plant biostimulants: definition, concept, main categories and regulation,” sci. hortic. (amsterdam)., vol. 196, pp. 3–14, 2015. [5] r. budiono, p. g. adinurani, and p. soni, “effect of new npk fertilizer on lowland rice (oryza sativa l.) growth,” iop conf. ser. earth environ. sci., vol. 293, no. 1, 2019. [6] v. b. bado, k. djaman, and v. c. mel, “developing fertilizer recommendations for rice in sub-saharan africa, achievements and opportunities,” paddy water environ., vol. 16, no. 3, pp. 571–586, 2018. [7] n. sivakumar, t. amudha, and n. thilagavathi, “development of a novel bio inspired framework for fertilizer optimization,” in 2019 amity international conference on artificial intelligence, aicai 2019, 2019, pp. 175–181. [8] a. n. fauziyah and w. f. mahmudy, “hybrid genetic algorithm for optimization of food composition on hypertensive patient,” int. j. electr. comput. eng., vol. 8, no. 6, pp. 4673–4683, 2018. [9] n. metawa, m. k. hassan, and m. elhoseny, “genetic algorithm based model for optimizing bank lending decisions,” expert syst. appl., vol. 80, pp. 75–82, sep. 2017. [10] a. hiassat, a. diabat, and i. rahwan, “a genetic algorithm approach for location-inventory-routing problem with perishable products,” j. manuf. syst., vol. 42, pp. 93–103, jan. 2017. [11] c. bharathi, d. rekha, and v. vijayakumar, “genetic algorithm based demand side management for smart grid,” wirel. pers. commun., vol. 93, no. 2, pp. 481–502, mar. 2017. [12] a. sharma, r. preet, p. singh, and p. lehana, “evaluation of the accuracy of genetic algorithms in orientation estimation of objects in industrial environment,” int. j. sci. tech. adv., vol. 1, no. 4, pp. 7–14, 2015. [13] m. l. seisarrina, i. cholissodin, and h. nurwarsito, “invigilator examination scheduling using partial random injection and adaptive time variant genetic algorithm,” j. inf. technol. comput. sci., vol. 3, no. 2, p. 113, 2018. [14] g. prasad, d. singh, a. mishra, and v. h. shah, “genetic algorithm performance assessment by varying population size and mutation rate in case of string reconstruction,” j. basic appl. eng. res., vol. 4, no. 2, pp. 157–161, 2017. https://www.penebarswadaya.com/shop/teknologi/pertanian-dan-industri/petunjuk-penggunaan-pupuk-revisi/ https://ukm.pure.elsevier.com/en/publications/optimizing-fertilizer-compounds-and-minimizing-the-cost-of-cucumb https://ukm.pure.elsevier.com/en/publications/optimizing-fertilizer-compounds-and-minimizing-the-cost-of-cucumb https://doi.org/10.1163/092764409x12580201111548 https://doi.org/10.1163/092764409x12580201111548 https://doi.org/10.1016/j.scienta.2015.09.021 https://doi.org/10.1016/j.scienta.2015.09.021 https://doi.org/10.1088/1755-1315/293/1/012034 https://doi.org/10.1088/1755-1315/293/1/012034 https://doi.org/10.1007/s10333-018-0649-8 https://doi.org/10.1007/s10333-018-0649-8 https://doi.org/10.1109/aicai.2019.8701338 https://doi.org/10.1109/aicai.2019.8701338 https://doi.org/10.11591/ijece.v8i6.pp4673-4683 https://doi.org/10.11591/ijece.v8i6.pp4673-4683 https://doi.org/10.1016/j.eswa.2017.03.021 https://doi.org/10.1016/j.eswa.2017.03.021 https://doi.org/10.1016/j.jmsy.2016.10.004 https://doi.org/10.1016/j.jmsy.2016.10.004 https://doi.org/10.1007/s11277-017-3959-z https://doi.org/10.1007/s11277-017-3959-z http://www.ijsta.com/papers/ijstav1n4y15/ijsta-v1n4r5y15.pdf http://www.ijsta.com/papers/ijstav1n4y15/ijsta-v1n4r5y15.pdf https://doi.org/10.25126/jitecs.20183250 https://doi.org/10.25126/jitecs.20183250 https://www.krishisanskriti.org/ijbab.php?id=521 https://www.krishisanskriti.org/ijbab.php?id=521 microsoft word 1.7335-20764-le3r knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 1, juni 2019, pp. 1–9 eissn 2597-4637 https://doi.org/10.17977/um018v2i12019p1-9 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) crude palm oil prediction based on backpropagation neural network approach hijratul aini 1 and haviluddin 2, * faculty of computer science and information technology, mulawarman university jl.kuaro no.1, samarinda 75123, indonesia 1 hijratulaini10@gmail.com; 2 haviluddin@unmul.ac.id* * corresponding author i. introduction indonesia is the largest crude palm oil (cpo) producer in the world. in 2018, indonesia produced 43 million tons of cpo from 14.03 million hectares plantation. consequently, it gives significant contribution to the national economy [1][2][3]. cpo production management is very necessary. therefore, it should be supported with precise estimation based on production data in previous years. numerous methods are used in order to obtain accurate prediction results such as statistical methods (i.e., arma, arima, sarima, and es) and intelligent computing methods (i.e., fuzzy logic, neural network) [4][5][6][7]. a research by [8] used sarima method to predict crude palm oil, in terengganu, malaysia. the dataset of cpo and palm kernel from june 2001 until may 2011 was used. the results showed that the sarima method was able to predict quite well. furthermore, [6] have used anfis and arfima methods to predict the cpo price in malaysia. the dataset price cpo from january 2004 until december 2011 was used. the research shows that the anfis and arfima models have the ability to use in predicting the cpo prices. on the other hand, [9] have implemented intelligence methods such as support vector machine (svm), neural networks (nn) to predict crude oil prices, palm oil, rubber, and gold. researchers have confirmed that intelligent algorithms were able to predict accurately compared to statistical method (random forest). the prediction results showed that the four parameters greatly affect malaysia's income. moreover, [10] have implemented intelligence method, namely nonlinear autoregressive with external (narx) with three algorithms, levenberg-marquardt, bayesian regulation and scaled conjugate gradient. the research demonstrated that the narx method was able to predict cpo prices accurately. this paper aims to apply one of artificial intelligence method, namely backpropagation neural network (bpnn) to predict cpo production. this article consists of fourth sections. section 1, motivation for writing, research related to what has been prepared. section 2, the method used for prediction. section 3, experimental, and section 4 results and discussion then also summary of the study. the analysis results are expected to support management in planning cpo production. article info a b s t r a c t article history: received 22 april 2019 revised 12 may 2019 accepted 22 may 2019 published online 23 june 2019 crude palm oil (cpo) production at pt. perkebunan nusantara (ptpn) xiii from january 2015 to january 2018 have been treated. this paper aims to predict cpo production using intelligent algorithms called backpropagation neural network (bpnn). the accuracy of prediction algorithms have been measured by mean square error (mse). the experiment showed that the best hidden layer architecture (hla) is 5-10-11-12-13-1 with learning function (lf) of trainlm, activation function (af) of logsig and purelin, and learning rate (lr) of 0.5. this architecture has a good accuracy with mse of 0.0643. the results showed that this model can predict cpo production in 2019. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: cpo machine learning bpnn parameters mse prediction 2 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 ii. methods prediction is an art and science that predicts future events. in other words, the prediction is to require historical data that aims to predict the future. the field of prediction research is increasingly important especially in the economic field. the prediction in production will be required if the conditions of the market are complex and dynamic. therefore, accurate predictions in assisting management decision making are necessary. then, numerous algorithms have been existing and developed in the predictions area, from traditional until intelligent algorithms. in this paper, historical data on crude oil palm production have been implemented to be analysed using intelligent algorithms [11][12][13]. this section will briefly explain a predictions, the bpnn algorithm and historical data used. a backpropagation neural network (bpnn) algorithm is a part of an intelligent method that aims to reduce an error rate in predicting. this method adjust its weight based on the desired output and target differences. the bpnn principle is a multilayer training method by using three layers, namely input layer, hidden layer and output layer, and also weight update process [14]. the bpnn is a development from single layer that only have an input and an output layers. by using a hidden layer, the error value on the network is smaller than the single layer. where, the hidden layer as a place to update and adjust the weight. thus, the new weight values are obtained then directed towards the desired output target [14][15]. the bpnn architecture and flowchart can be seen in figure 1 and figure 2. the bpnn steps are are listed as follows [13][14][16][17]. • step 1: initialize weights with small random value numbers • step 2: as long as the stop condition is not fulfilled, do steps 3 to 8 • step 3: each input unit receives an xi input signal and is forwarded to hidden units • step 4: each unit is hidden summing the weight of the input signal • step 5: each output summing neuron input weight • step 6: each output unit calculated error in each layer • step 7: each hidden layer summing the input layer values from the units in the top layer • step 8: each output calculating update weight and bias • step 9: stop if condition met a. feed forward steps feed forward steps contains step 3 to step 5. in step 3, each input unit (𝑥 , 𝑖 = 1, … n) receives an xi input signal and is forwarded to hidden units. while in step 4: each unit is hidden (𝑧 , 𝑧 = 1, ... p) summing the weight of the input signal with equation (1): 𝑍_𝑖𝑛 = 𝑣 + ∑ 𝑥𝑖 𝑖 𝑛=1 𝑣𝑖𝑗 (1) where z is neuron hidden, 𝑣 is input bias weight neuron to j, 𝑥 is neuron input i, 𝑣 is neuron input to neuron hidden weights. applying the activation function was calculated by equation (2): 𝑍 = 𝑓 𝑍_𝑖𝑛 (2) where, 𝑍 is unit j in hidden layer, 𝑍_𝑖𝑛 is unit 𝑍 output. for example, the activation function used is sigmoid with equation (3): 𝑌 = 𝑓(𝑥) = 1 1+𝑒−𝑥 (3) the next process is sending all neuron to the output unit where, in step 5, each output (yk, k = 1, … m) summing neuron input weight using equation (4): 𝑌_𝑖𝑛 = 𝑤 + ∑ 𝑧𝑗𝑤𝑗𝑘 𝑝 𝑘=1 (4) where, 𝑌_𝑖𝑛 is output unit 𝑦 , 𝑤 is weight bias for hidden neuron k, 𝑧 is unit j in hidden layer, and wjk is hidden neuron to output neuron weights. finally, applying activation function was using eq (5): 𝑌 = 𝑓(𝑌_𝑖𝑛 ) (5) where, 𝑌_𝑖𝑛 is unit output 𝑦 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 3 b. backward steps feed forward steps contains step 6 to step 7. step 6 is when each output unit (𝑦 . k = 1, … m) calculated error in each layer using equation (6). 𝛿𝑘 = (𝑡𝑘– 𝑦𝑘)𝑓’(𝑦_𝑖𝑛𝑘) (6) where, 𝛿 k is weight correction factor wjk, 𝑡 is target, 𝑦k is output neuron k, y_ink is output unit yk. moreover, weight correction factor and bias was calculated using equation (7). ∆𝑤𝑗𝑘 = 𝛼𝛿𝑘𝑥𝑗 (7) ∆𝑤0𝑘 = 𝛼𝛿𝑘 where, ∆w jk is hidden neuron weight wjk (t) with wjk (t+1), ∆w0k is bias weight for hidden neuron k, a is learning rate, 𝛿k is weight correction factor wjk, x is input in step 7, each hidden layer (𝑍 , 𝑍 = 1, … p) summing the input layer values (a) from the units in the top layer, by using equation (8). 𝛿 _𝑖𝑛j =∑ 𝛿𝑘𝑤𝑗𝑘 (8) where, 𝛿k is weight correction factor wjk, wjk is neuron hidden to neuron output. in the next process, calculating error in each layer has been done using equation (9). fig. 1. bpnn architecture fig. 2. flowchart bpnn 4 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 𝛿𝑗 = 𝛿_𝑖𝑛𝑗𝑓( 𝑥_𝑖𝑛𝑗) (9) where, 𝛿j is weight correction factor vij, 𝛿 is correction factor, x is input. while calculating weight correction factor and bias has been done using equation (10). ∆𝑣𝑖𝑗𝑗 = 𝛼𝛿𝑗𝑥𝑖 (10) where, ∆𝑣ij is neuron input weight to neuron hidden, 𝛼 is learning rate, 𝛿j is weight correction factor vij, 𝑥i is neuron input i c. update weight and bias steps feed forward steps contains step 8 where each output (yk, k = 1, … m) calculating update weight and bias (j = 0.1, ... p) using equation (11). 𝑤 (𝑛𝑒𝑤) = 𝑤 (𝑜𝑙𝑑) + ∆𝑤 (11) where, wjk is neuron hidden to neuron output weights, ∆𝑤jk is difference in weight of hidden neurons to output neurons each hidden layer (zj, z = 1, … p) calculating update weight and bias (i = 0.1, ... n) using equation (12). 𝑣 (𝑛𝑒𝑤) = 𝑣 (𝑜𝑙𝑑) + ∆𝑣 (12) where, vij is neuron input to neuron hidden weights, ∆𝑣ij is difference in weight of hidden neurons to output neurons in this study, historical data was obtained from pt. perkebunan nusantara xiii long pinang village, paser, east kalimantan, indonesia. the fresh fruit bunch (ffb) harvest data from 2015 to 2018. a normalized ffb data can be seen in table 1 the algorithm performance must be measured. a statistical methods (i.e., sse, r, r2, mape, mse etc.) are usually used to measure the algorithm performance [7]. in this paper, mean squared error (mse) has been performed to evaluate the bpnn algorithm in predictions. the mse are sum of squares for all prediction errors values in each period and dividing by the number of prediction periods [18][19][20]. the mse can be calculate using (1), where i is real values, and i is predicted values 𝑀𝑆𝐸 = 1 𝑛 ∑ 𝛶𝑖 − 𝛶 𝑛 𝑖−1 (13) where, 𝛶i is real values, and 𝛶i is predicted values table 1. harvest data of tbs inti tajati (2015–2018) after normalization years/months 2015 2016 2017 2018 january 0.4460 0.5143 0.5052 0.5010 february 0.3510 0.3315 0.5960 0.3398 march 0.3695 0.2659 0.4005 0.1143 april 0.4249 0.1523 0.3348 0.1000 may 0.5156 0.1256 0.3207 0.1501 june 0.5862 0.2659 0.3163 0.1000 july 0.6061 0.2724 0.5103 0.1000 august 0.5214 0.4077 0.1000 0.1930 september 0.7185 0.6322 0.5582 0.6115 october 0.9000 0.7779 0.7613 0.7360 november 0.8830 0.8298 0.7613 0.1000 december 0.7977 0.7015 0.6097 0.1000 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 5 iii. results and discussion in this experiment, the crude palm oil prediction have been tested to get a good bpnn model. therefore, a try-and-error approach has been implemented. several variables includes hidden layer architecture (hla), learning function (lf), activation function (af) and learning rate (lr) have been explored. furthermore, the bpnn variables can be seen in table 2. based on neural network principles, a total of 48 data have divided by 2 include 36 for training and 12 for testing data. meanwhile, five inputs data (2015, 2016, 2017, and 2018) with bias and one output have been utilized. based on the experiment, the hidden layer architectures (hla) were 5-10-11-1 (2); 5-10-11-1213-1 (4); 5-10-11-11-12-12-13-1 (6); 5-10-11-11-12-12-12-13-13-1 (8).the learning function (lf) were trainlm; traingd; traingdx. the activation function (af) on the input and hidden layers were logsig, and the output layer was purelin. the learning rate (lr) were 0.1; 0.3; 0.5; 0.7 and also the other variables values, maximum epoch of 1.000, and error limit of 0.01 have been evaluated. afterwards, mean square error (mse) is used to statistically measure the forecasting accuracy. in principle, the best bpnn architecture and variables is one with the lowest mse value. table 3 shows the results of bpnn training and testing. after experiment many times, the bpnn architectures by using hidden layer architectures (hla) 5-10-11-12-13-1 (4), the learning function (lf) was trainlm, activation function (af) on the input and hidden the layers were logsig, and the output layer was purelin, learning rate (lr) were 0.5 and 0.7, maximum epoch of 1.000, and the error limit of 0.01 have been produced good model with mse of 0.0643. the result can be seen in table 4. based on the best bpnn parameters (in table 4), the forecasting for the following year has been carried out. in this test, the first bpnn model (with lr=0.5) have been used for predict the next year (2019) production. figure 3 and figure 4, the first bpnn model results of training and testing have almost the same values with the target. table 2. bpnn parameters variables values hidden layer architecture (hla) 2, 4, 6, 8 5-10-11-1; 5-10-11-12-13-1; 5-10-11-11-12-12-13-1; 5-10-11-11-12-12-12-13-13-1 learning function (lf) trainlm; traingd; traingdx activation function (af) logsig; purelin learning rate (lr) 0.1; 0.3; 0.5; 0.7 fig. 3. plot results of training bpnn with lr 0.5 6 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 table 3. bpnn training and testing results no tbs inti tajati parameters mse hidden layer training function (tf) learning rate (lr) training testing 1 2 trainlm 0.1 0.0075 0.0962 2 traingd 0.1 0.0272 0.0173 3 traingdx 0.1 0.0311 0.1079 4 trainlm 0.3 0.0030 0.1291 5 traingd 0.3 0.0231 0.3381 6 traingdx 0.3 1.7055 0.0873 7 trainlm 0.5 0.0029 0.1088 8 traingd 0.5 1.0789 0.1290 9 traingdx 0.5 0.0349 0.1596 10 trainlm 0.7 0.0093 0.0431 11 traingd 0.7 0.7285 0.0706 12 traingdx 0.7 0.0182 0.1100 1 4 trainlm 0.1 0.0087 0.0091 2 traingd 0.1 0.0119 0.0446 3 traingdx 0.1 0.0081 0.0240 4 trainlm 0.3 0.0035 0.1237 5 traingd 0.3 0.0089 0.0952 6 traingdx 0.3 0.0174 0.0806 7 trainlm 0.5 0.0147 0.0033 8 traingd 0.5 0.0099 0.0382 9 traingdx 0.5 0.0194 0.2411 10 trainlm 0.7 0.0072 0.0015 11 traingd 0.7 0.0158 0.1128 12 traingdx 0.7 0.0380 0.1350 1 6 trainlm 0.1 0.0071 0.0211 2 traingd 0.1 0.0148 0.0601 3 traingdx 0.1 0.0431 0.0673 4 trainlm 0.3 0.0184 0.1616 5 traingd 0.3 0.0283 0.0580 6 traingdx 0.3 0.0420 0.0943 7 trainlm 0.5 0.0118 0.1586 8 traingd 0.5 0.0195 0.0809 9 traingdx 0.5 0.0990 0.0521 10 trainlm 0.7 0.0127 0.1101 11 traingd 0.7 0.0455 0.1083 12 traingdx 0.7 0.0103 0.1288 1 8 trainlm 0.1 0.0368 0.0467 2 traingd 0.1 0.0455 0.1045 3 traingdx 0.1 0.0658 0.0331 4 trainlm 0.3 0.0078 0.6451 5 traingd 0.3 0.6687 1.6575 6 traingdx 0.3 0.0103 0.0463 7 trainlm 0.5 0.0191 0.1633 8 traingd 0.5 0.4928 0.8608 9 traingdx 0.5 0.0455 0.1114 10 trainlm 0.7 0.0192 0.0425 11 traingd 0.7 0.8493 0.3114 12 traingdx 0.7 0.0473 0.0883 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 7 furthermore, the cpo prediction in january – december 2019 has been carried out with the first bpnn architecture. figure 5 shows the cpo prediction. the prediction slowly increased until april, then dropped in may and june. afterwards, it mat increased until october and decreased again until december. table 5 shows the prediction results based on the first bpnn model. the average prediction is 1668008.93 with -0.175 of error. it can be concluded from the table that there are movements in monthly cpo production. in table 5, prediction results based on the first bpnn model. based on experiment, the average prediction of 1668008.93 and error prediction of -0.175 have been obtained. in other words, in each month there are an increase and decrease in cpo production. fig. 4. plot of testing bpnn with lr 0.5 fig. 5. cpo prediction year 2019 table 4. the best parameters farm hidden layer training function (tf) learning rate (lr) architecture mean square error (mse) training testing tbs inti 4 trainlm; 0.5 5-10-11-12-13-1 0.0049 0.0643 tajati 4 purelin 0.7 0.0214 0.0652 8 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 iv. conclusion the implementation of backpropagation neural network (bpnn) method has been presented. in this study, several variables values (i.e., hidden layer architecture (hla), learning function (lf), activation function (af) and learning rate (lr) and other parameter values such as maximum epoch and error limit) have been investigated. based on experiment, bpnn architecture with 5-10-11-1213-1, learning rate of 0.5, learning function of trainlm, and activation function of logsig and purelin has a very good accuracy with mean square error (mse) of 0.064249. therefore, this model can be used to predict crude palm oil production in 2019. the bpnn with metaheuristic optimization will be conducted in the future experiment. acknowledgment this research was partially supported by artificial intelligence research center, faculty of computer science and information technology (csit), universitas mulawarman. we thank our colleagues from pt. perkebunan nusantara (ptpn) xiii, long pinang village, paser regency, east kalimantan who provided insight and expertise that greatly assisted the research. references [1] kementan, “kementan: industri kelapa sawit berkontribusi besar terhadap ekonomi,” kompas.com, 2018. [2] b. p. statistik, “badan pusat statistik,” 2018. [online]. available: https://bps.go.id/subject/6/tenagakerja.html#subjekviewtab3. [accessed: 27-aug-2018]. [3] a. norhidayu, m. nur-syazwani, r. radzil, i. amin, and n. balu, “the production of crude palm oil in malaysia,” int. j. econ. manag., vol. 11, no. 3 special issue, pp. 591–606, 2017. [4] haviluddin and n. dengen, “comparison of sarima, narx and bpnn models in forecasting time series data of network traffic,” in proceeding 2016 2nd international conference on science in information technology, icsitech 2016: information science for green society and environment, 2017. [5] h. haviluddin and a. jawahir, “comparing of arima and rbfnn for short-term forecasting,” int. j. adv. intell. informatics, vol. 1, no. 1, pp. 15–22, 2015. [6] a. a. karia, i. bujang, and i. ahmad, “forecasting on crude palm oil prices using artificial intelligence approaches,” am. j. oper. res., 2013. [7] purnawansyah, haviluddin, r. alfred, and a. f. o. gaffar, “network traffic time series performance analysis using statistical methods,” knowl. eng. data sci., vol. 1, no. 1, pp. 1–7, 2018. [8] s. ahmad and h. a. latif, “forecasting on the crude palm oil and kernel palm production: seasonal arima approach,” in 2011 ieee colloquium on humanities, science and engineering research (chuser 2011), dec 5-6 2011, penang, 2011, pp. 939–944. [9] s. ramakrishnan, s. butt, m. a. chohan, and h. ahmad, “forecasting malaysian exchange rate using machine learning techniques based on commodities prices,” in international conference on research and innovation in information systems, icriis, 2017. table 5. prediction results tbs inti tajati 2019 prediction actual difference error january 2232026.331 1983190 -248836.3 -0.0503 february 1468812.350 1185890 -282922.3 -0.0572 march 1285529.301 70830 -1214699 -0.2456 april 1312788.635 0 -1312789 -0.2654 may 1237399.359 247820 -989579.4 -0.2001 june 1312788.635 0 -1312789 -0.2654 july 1312788.635 0 -1312789 -0.2654 august 1217143.764 459980 -757163.8 -0.1531 september 2771911.583 2529800 -242111.6 -0.0489 october 3239341.295 3145660 -93681.29 -0.0189 november 1312788.635 0 -1312789 -0.2654 december 1312788.635 0 -1312789 -0.2654 total 20016107.16 9623170 -10392937 -2.1012 average 1668008.93 801930.833 -866078.1 -0.1751 h. aini and haviluddin / knowledge engineering and data science 2019, 2 (1): 1–9 9 [10] d.h. arasim, a.a. karia, “identifying and forecasting the factors that derive cpo prices in malaysia using narx model,” int. j. of case studies, vol. 4, no. 2, pp. 04-14, 2015. [11] m. geurts, g. e. p. box, and g. m. jenkins, “time series analysis: forecasting and control,” j. mark. res., 2006. [12] l. seymour, p. j. brockwell, and r. a. davis, “introduction to time series and forecasting,” j. am. stat. assoc., 2006. [13] z.-y. wang, y.-c. lin, s.-j. lee, and c.-c. lai, “a time series forecasting method,” itm web conf., 2017. [14] r. rojas, “the backpropagation algorithm,” in neural networks, 2011. [15] purnawansyah and h. haviluddin, “comparing performance of backpropagation and rbf neural network models for predicting daily network traffic,” in 2014 makassar international conference on electrical engineering and infonnatics (miceei), 2014, pp. 166–169. [16] haviluddin and r. alfred, “a genetic-based backpropagation neural network for forecasting in time-series data,” in proceedings 2015 international conference on science in information technology: big data spectrum for future information economy, icsitech 2015, 2016. [17] m. lehtokangas, “modelling with constructive backpropagation,” neural networks, 1999. [18] b. k. nelson, “statistical methodology: v. time series analysis using autoregressive integrated moving average (arima) models,” acad. emerg. med., 1998. [19] a. kaya, “statistic modelling for outlier factors,” 2010. [20] j. fürnkranz et al., “mean squared error,” in encyclopedia of machine learning, 2010. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 199–214 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p199-214 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) recurrent session approach to generative association rule based recommendation tubagus arief armanda a,1, , ire puspa wardhani a,2, tubagus m. akhriza b,3,*, tubagus m. adrie admira a,4 a stmik jakarta sti&k jl. bri radio dalam no.17, jakarta selatan 12140, indonesia b stmik pradnya paramita (stimata) jl. laksda adi sucipto 249a, malang 65126, indonesia 1 tb_armanda@yahoo.com; 2 irepuspa@gmail.com; 3 akhriza@stimata.ac.id*; 4adrie.admira@jak-stik.ac.id * corresponding author i. introduction the recommendation system (rs) has become a mandatory feature in e-commerce [1][2][3]. this system principally filters large-scale transaction data to produce a list of items that e-commerce application users might like or even buy. an rs generates personalized recommendations for individual users, and this is effective if the user is logged in because the data regarding items that have been purchased or rated by the user personally has been recorded so that the resulting recommendations can be relevant to user preferences. for personalized recommendations, an rs can be built with a collaborative approach by measuring the similarity of item features that users u like with those of other users [4][5]; items that have never been rated by u, but rated by other users will be offered to u. the preferences of u are represented by the items vector, iu which contains the rating value given by u to each item. the similarity of iu with ip, the items vector belonging to another user p, is calculated according to the distance formula d(iu, ip). if there is no rating data, then the system utilizes the features of items that u once liked or bought. for example, descriptions of films or books [6][7], or categories or ingredients in food menus [8][9]. when u is looking for item x with a description of dx, the system will look for other items, for example, y, with a description of dy that is similar to dx. the similarity is measured by a distance formula d(dy, dx). here dx and dy are presented in feature vectors of the items x and y, respectively. popular distance calculation formulas include cosine, euclidean, manhattan, and jaccard coefficients. these collaborative and content filtering approaches are practical if the user has logged into the system, where sr then scans the database of transactions the user has made with items in the store. article info a b s t r a c t article history: received 09 july 2023 revised 29 july 2023 accepted 30 october 2023 published online 02 november 2023 this article introduces a generative association rule (ar)-based recommendation system (rs) using a recurrent neural network approach implemented when a user searches for an item in a browsing session. it is proposed to overcome the limitations of the traditional ar-based rs which implements query-based sessions that are not adaptive to input series, thus failing to generate recommendations. the dataset used is accurate retail transaction data from online stores in europe. the contribution of the proposed method is a next-item prediction model using lstm, but what is trained to develop the model is an associative rule string, not a string of items in a purchase transaction. the proposed model predicts the next item generatively, while the traditional method discriminatively. as a result, for an array of items that the user has viewed in a browsing session, the model can always recommend the following items when traditional methods cannot. in addition, the results of user-centered validation of several metrics show that although the level of accuracy (similarity) of recommended products and products seen by users is only 20%, other metrics reach above 70%, such as novelty, diversity, attractiveness and enjoyability. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: association rules recommendation system recurrent neural network long-short term memory session-based recommendation http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 200 in case the application user is not logged in, then the association rule (ar)-based rs can be applied where recommendations are generated from rules x→y, mined out from transactional data t [6][10]. x is called the antecedent, and y is the consequent of the rule and in practice, x and y are presented as a bag of item ids (itemid) or itemid vectors. in this article, itemid refers to an item with a unique identity code. item x and y are associated not because of the similarity of their descriptions or user-given ratings but on fulfilling two main interestingness metrics: support and confidence. consequently, ar-based rs provides a variety of item recommendations. support of x, written as sup(x) as in (1), represents the number of transactions. 𝑡𝑖 in t that contain itemset x; and length x can be one or more items. 𝑋 𝑡 ∈ 𝑇 indicates that the itemset x is a subset of t, the records in t that in principle, are also an itemset. 𝑆𝑢𝑝(𝑋) = |𝑡𝑖∈𝑇| |𝑇| ; 𝑋 𝑡 (1) confidence x→y, written as conf(xy) as in (2), represents the probability that if x appears in some transactions, then y also appears. 𝐶𝑜𝑛𝑓(𝑋𝑌) = 𝑆𝑢𝑝(𝑋𝑌) 𝑆𝑢𝑝(𝑋) ; 𝑋, 𝑋𝑌 𝑡 ∈ 𝑇 (2) rules are mined from t if the minimum support (minsup) and minimum confidence (minconf) thresholds set by the data miner are met. an itemset that satisfies minsup is called a frequent itemset, and from the explanation above, the itemsets x and y that make up the rules must be frequent itemsets. the problem of ar-based rs is that it does not personalize recommendations to users, thus recommendations are general, monotonous, thus look unrelated to the item being browsed by u. to improve this limitation, session-based rs is proposed, where the session is a virtual time-space created when a user browses a web portal url [11]–[16]. within this time space, the items that the user is or had been looking for, thus assumed as his/her preferences, can be temporarily recorded locally [14]–[16]. some methods use markov chains [17]–[19], artificial neural networks [11], [20]– [22], and association rule learning approaches [23]–[28] to develop session-based rs. implementing a session approach to ar-based rs produces several approaches, as explained in. in the first approach, the rules database is generated from t. items users have seen/purchased at recent sessions, for example. 𝑞𝑈 = {𝑥1, 𝑥2, 𝑥3} are used as a query to the rules database to find rules x→y, where 𝑋 = {𝑥1, 𝑥2, 𝑥3}. the items y obtained are recommended items if xy satisfied minsup and minconf thresholds [23], [29]. in the second approach, the method used sequence itemsets that are mined not from t, but from q i.e., a set of sessions 𝑞𝑖 created by the user while browsing the items over some periods, thus 𝑄 = {𝑞1, 𝑞2, . . . , 𝑞|𝑄| } [27], [28]. from the mining, a set of sequence itemsets is obtained and stored in si = {𝑝1, 𝑝2, ..., 𝑝|𝑆𝐼||}. assumed, u is currently browsing the items 𝑥𝑖 thus creates a session 𝑞𝑈 = {𝑥1, 𝑥2, . . . , 𝑥𝑅 }, and 𝑥𝑅 is the item u saw most recently. if 𝑥𝑅 ∈ 𝑞𝑈 and 𝑥𝑅 ∈ 𝑝 in si, thus 𝑝 = {… , 𝑥𝑅 , 𝑥𝑆, 𝑥𝑆+1, … } then p contains the order of items relevant to u's preference. all items that appear after 𝑥𝑅, namely 𝑥𝑆, 𝑥𝑆+1 and so on are candidate items to be recommended. however, the traditional query-based session approach for ar-based rs still suffers from some problems. a large number of long frequent itemsets are required since some subsets of these itemsets are expected to match 𝑞𝑈 . consequently, large enough memory is required to store long itemsets because the amount is quite significant if the minsup threshold is minimal [30]. on the other side, if the minsup is large, the resulting itemsets tend to be short and can result in no following items to be recommended. another problem, especially in the second approach, itemsets sequences are mined only from q which does not cover all items contained in t; consequently, many items in t are not explored by u. in business, this situation is detrimental to e-commerce owners. to sum up – traditional methods are not adaptive to a series of items the user visits, so recommendations look monotonous. traditional methods also cannot generate recommendations 201 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 from a series of input items that are not frequent because they refer to the rule database, while rules are composed of frequent itemsets only. this study was conducted with the objectives of building a generative model based on recurrent neural network (rnn) and association rules which can predict the next item generatively from a series of items that the user has visited in a browsing session even though this series of items is not a frequent itemset. applying rnn to session and ar-based rs, this method is called the recurrent-session approach to ar-based rs, or rs-arrs. the model is built using long-short-term memory (lstm), a type of layer in rnn, and dropout layers. the novelty of this model is that the dataset that is trained is not a series of items that customers have purchased but a series of rules that are arranged according to the support and confidence of the rules. the series of items visited by the user in a browsing session is considered an input prompt for the model, and the model responds by generatively predicting the items that will appear next. the rest of the paper is structured as follows. in the methods section, the proposed approach is explained, followed by a discussion of generating a training set for the model. after that, the flow of the model development cycle is explained, including the proposed model design. experiments on model benchmarking were organized with the aim of testing and comparing the performance of the proposed model with traditional models. after that, the experimental results are discussed in the results and discussion section. the article concludes with conclusions and recommendations for future research. ii. methods a. research framework the framework of the proposed method is explained using figure 1, which is divided into four main activities: a) generating the training dataset (trainds), b) developing the proposed model, c) determining the top-k recommendations, d) benchmarking the model and e) validating the recommendation. before explaining the steps for creating a training set, the basic idea of the proposed approach is explained first. generating training dataset developing proposed model determining top-k recommendation benchmarking the model validating the recommendation fig. 1. research framework rnn is usually used to estimate a next-value in the future by learning time series of data in the past and present [17], [31], [32], so how does rnn predict the next term of a current sentence? intuitively, a sentence or phrase is made up of terms, and a term is made up of letters, which are written or typed one letter at a time. as such, a text written can be assumed as time series data as well. for example, large-scale textual paragraphs, such as a collection of scholarly publications on deep learning, are used as a training set for model building. given an input prompt such as "recurrent neural net" to the model, the model predicts the appearance of the following letter or term, referring to all the text in the dataset of deep learning publication. the nature of the prediction is generative because the sentences formed are composed of new terms [33], [34]. while generative predictions are formed by modeling the probability distribution of the entire input data domain, a discriminative prediction aims to differentiate or classify input data into specific categories or labels [35]–[38]. some examples include sentiment analysis or textual classification. several studies explain that rnn predicts the next item in a market basket. rnn that uses time series data can be used to predict the next item with the assumption that the user picks up item by item and puts it in the shopping cart following a particular time series [32], [33], [39], [40]. in another perspective, items viewed sequentially within a browsing session can also be considered time-series data [5], [17], [20], [41], [42]. however, the next-item prediction model that learns items that have been purchased still have weaknesses, namely that the process of recording items by the cashier (both t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 202 in an offline and online store) is carried out randomly and ignores the order in which the customer picks up the items. as a result, the time series nature of the items picked up by customers is lost. in this study, as explained via figure 2, a solution to this limitation is also included in the proposed model generation. time-series training data is not created from item purchase transaction data but from associative rules mined from transaction data. the rules also form a predictive relationship via the confidence metrics that if an item 𝑥1 is purchased, then 𝑥2 is purchased if 𝑥2 is purchased, then so is 𝑥3, and so on. if it is sorted in such a way based on the highest support and confidence, then the confidence relationship of this rule also forms an item series, namely 𝑥1 → 𝑥2 → 𝑥3. similarly, a model can be built to predict the next item if this rule series is trained to the rnn. there is no percentage division between the training and testing set because it models the probability distribution of the entire input data domain to form a generative model. the model then produces the probability of all existing items as next-items with a total probability of one. by ranking these probabilities, topk item recommendations are obtained. fig. 2. illustration of model development from series of rules for illustration, as in figure 2, the series of items visited by the user in a browsing session, e.g. [𝑥1, 𝑥2, 𝑥4], is considered an input prompt for the model, and the model responds by generatively predicting the items that will appear next, similar to how generative text-generation works. all items have a certain probability of being the next-item, and a computer program will sort these probabilities to get, for example, the top 3 items that are the next-item recommended to the user. b. generating training dataset training dataset generation is described in figure 3. process #1 is pre-processing of raw transaction dataset t, including feature (column) selection which produces a dataset t1 consisting of two columns: invoice number (invno) and itemids purchased according to that number. 203 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 raw dataset t #1 pre-processing invno, bag of itemid dataset t1 #2 ar mining ruledb, with sorted rules #3 training set building trainds, in the form of [x1,…xslen]:y fig. 3. training dataset generation flow process #1 also generates a data dictionary containing the itemid and item’s description. examples of records in t1 are as follows: invno; itemids 000001; 𝑥1, 𝑥2, 𝑥3, 𝑥4 000002; 𝑥1, 𝑥3, 𝑥4, 𝑥5 etc. process #2 is mining the association rules from the itemids column in t1, uses apriori principles, with mining parameters: minsup, minconf and maximum rule length. the found rules are sorted based on the highest support and confidence and then stored in the rule database (ruledb). process #3, forming a training set from ruledb. the rules have been obtained and are sorted based on the highest support and confidence as follows. {𝑥1→𝑥2, 𝑥1→𝑥3, 𝑥2→𝑥4, 𝑥3→𝑥2, 𝑥4→𝑥5, 𝑥5→𝑥6} after sorting, a series of rules are created with the following notes: 1) the consequence of the rule in the i-th term becomes the antecedent for the (i+1)-th term. rules can only be used once to construct a series. an i-th series is made as long as possible by using as many rules as possible; after no more rules can arrange the i-th series, the (i+1)-th series is the same way using the rest of the rules. from the previous ordered rule example, the resulting rule series is as in (3) and (4). 𝑆1 = 𝑥1 → 𝑥2 → 𝑥4 → 𝑥5 → 𝑥6, or simplified 𝑆1 = [𝑥1, 𝑥2, 𝑥4, 𝑥5, 𝑥6] (3) 𝑆2 = 𝑥1 → 𝑥3 → 𝑥2, or simplified 𝑆2 = [𝑥1, 𝑥3, 𝑥2] (4) an illustration of the rule series pattern handled by lstm in the learning phase is given in figure 4. the model learns the itemid flow pattern as arranged in the rule series in two parts: x and the label of x, namely y. x has a dimension, which is also called series or sequence length (slen), while the y dimension is one. slen represents the duration of a session that neurons can remember; in the example above if slen = 3, then in the first session [𝑥1, 𝑥2, 𝑥4] is x, and 𝑥5 is the y, which is the next item of x. as the session moves forward, x is now [𝑥2, 𝑥4, 𝑥5] and y is 𝑥6, while 𝑥1 is already out of session and will be forgotten by neurons. in the illustration, the l box represents the lstm layer, and f box represents the output layer, which is fully connected to the total number of available next-items (labels), i.e., all itemids in ruledb. if x is stored in an array or a list in python language, then the above explanation also implies that shifting the session forward (towards the right), algorithmically pops up the itemid on the leftmost x, pushes itemid y to the rightmost x, and assigns a new next-item y as label for the new x. this algorithm also describes the mechanism for forming a training dataset (trainds) from series of rules that have been built. t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 204 fig. 4. rule series patterns learned by the model for given a series of rules s, session duration slen = 3, and ruledb, do the following stages: • initialization: aims to create an initial record in the form x:y, with x’s length = slen and y’s length = 1. the following steps are performed 1. x = s[0 : slen] #python’s way to take s[0] to s[slen-1] as x 2. y = s[slen] # set s[slen] as y. 3. idx = slen # index last accessed from s • shifting: aims to generate the next record from the previous x by shifting the session forward: 1. x = x.pop(0) # pop the leftmost x value 2. x.append(y) # push y into the rightmost x • labelling: aims to labelling the new record x with y, 1. idx +=1 # increase index of s 2. y = s[idx] # set s[idx] as y 3. s = s[idx:] # trim s starting from index 0 to idx s is trimmed so that the shifting step can be repeated. however, if all entries in s have been used so that s becomes empty, then the formation of training data from a rule series s also ends. the training data of a rule series is said to be complete if all xs with length slen and its label y have been developed. however, in practice, because the value of slen can vary (depending on the needs of model development), all s entries have been accessed even though the length of x has not yet reached slen. an example of this case is 𝑆2 = [𝑥1, 𝑥3, 𝑥2], where all entries in 𝑆2 can only form x, but it does not yet have a label y. when this case arises, the fourth stage must be done as follows: • padding: aims to complete entry x so that it has the length slen, and has a label y. the steps are as follows: 1. while the length of x < slen: 1 search for rules �́� → �́� in ruledb, where �́� = x[-1]. 2 if found: x.append(�́�). 2. if the length of x == slen, then the formation of x is complete, and i. continue searching from the last position for the rule �́� → �́�, if �́� = x[-1] then set label y = �́� the result of the padding step for 𝑆2 is x = [𝑥1, 𝑥3, 𝑥2] and y = 𝑥4. c. developing proposed model like the text generator model, the proposed next-item prediction model is also generative – a model that can generate predictions for next items for several sessions in the future. the flow of model development in this study is given in figure 5, which forms a cycle as described in [43]. the trainds and existing reference models are materials for designing and tuning models. models that have met the requirements regarding loss and accuracy will be deployed to an implementable recommendation system. if it does not meet the requirements, then the model will be redesigned 205 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 which includes the composition of the layers and neuron cells, as well as the number of epochs and batches in the training process. the requirement is to have a model with a loss level < 0.5, and an accuracy > 80%. #1 model designing/ tuning model specification #2 model training training results #3 model deployment meet requirement ? no trainds, in the form of [x1,…xslen]:y reference model yes fig. 5. model development lifecycle flow the neural network layers that make up the model are divided into three parts where the term is used about the keras library for python: • input layer with dimension (slen, 1), with slen = 3, which is the dimension of x, and 1 is the dimension of y. • hidden layers, for observation purposes, one to three lstm layers are used in the experiments, where loss and accuracy are observed at each additional layer. each lstm layer is followed by a dropout layer, which removes cells that contribute to overfitting. the number of neurons is set to 256. the activation function applied is tanh. • output layer, which uses the dense layer after the lstm layers. this layer is called the fully connected layer to the output, which in this case, the output dimension is the number of itemids as they are all potentially following items. the activation function used is softmax. lstm is an rnn-type layer designed to handle time series data. lstm has main components: cell, input gate, output gate and forget gate [31], [32]. cells have the function of remembering past patterns in a series or sequence, is useful for remembering contexts that appeared in the past to be combined with current information in order to forecast patterns that will occur in the future. the memory duration that the lstm layer will remember is specified in the sequence or series length. lstm can produce generative predictions, where the model can generate new samples from the same data distribution [36], [37]. for example, given a reading book as training data, a generative model for text-generator can generate several terms that will appear after a series of terms is given as a trigger so that composed sentences look new. to do this, the model requires the entire training data to be studied [34], [44]. the proposed model design is depicted in figure 6, while a summary of the model using one layer of lstm + dropout + dense is also given in figure 7. a summary of models using two and three ltsm is not given, but intuitively it can be understood from such a figure. all itemid series in trainds are used as training data, without a test dataset because the model built is a generative model that must learn the probability of each itemid in a series of itemid in a whole trainds. the design of this model is implemented with the keras library using functional modeling. the layer t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 206 composition in each model is compiled by applying categorical cross-entropy loss, optimizer adam. after compilation, the model is fitted to all vectors x and y, with 1000 epochs in 8 batches. lstm dense input output lstm drop out drop out fig. 6. proposed model design fig. 7. summary of model with one lstm layer d. determining top-k recommendation briefly, the procedure carried out to generate next-items predictions is shown in pseudocode 1. pseudocode 1. lstm and gru stack 1 prepare the inputs: a) matrix xs of all arrays x in trainds in which x has a shape (slen, 1). if the number of records in trainds is n, then the xs dimension is (n, slen, 1) b) matrix y i.e., x's label with shape (n, 1) 2 compile the arrangement of the layers into a model m 3 perform training by fitting data xs to data y with m, usually written m.fit(xs, y, number of epochs, number of batches), save the fitting with the lowest loss (along with accuracy) into m_best or an external file, such as “m.hdf5” #m_best is now ready to predict any series of itemid as input 4 input "enter a series of itemid as input" 5 prediction = m_best.predict(input) 6 prediction is obtained in the form of a matrix containing the probability predictions for each itemid to become the next-items 207 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 the process of determining the top-k recommendations from prediction is given shown in pseudocode 2. pseudocode 2. generate next-items predictions 1. determine the value of k; 2. sort the probabilities in the prediction array from the highest value, noting that each element represents an itemid index in the itemiddescription data dictionary. 3. get the first k index of itemid in the array, 4. print item descriptions in itemid order 5. k recommendation items obtained e. benchmarking the model the activity flow of benchmarking the model is given in figure 8, which shows that the proposed model is compared with the query-based session method. the aspect being compared is the ability of the model always to be able to get predictions of the probabilities of all itemids to become the next-item concerning the items that the user is looking for in a session. two test scenarios were run to examine both methods in terms of their adaptability in generating next-item recommendations. #1 proposed model testing input item series, as in some sessions next-items prediction #2 query-based method testing next-items prediction #3 comparison analysis analysis results fig. 8. model benchmarking flow test #1: the rule that produces the next-item in the query-based method is tested on the proposed method. the steps are as follows: 1. generate all rules x→y with |x| = 3 and |y| = 1, 2. each x in the rules has at least one next-item y 3. enter all the xs as the input for the proposed method and get the top-10 recommendations. 4. count the number of x that have top-10 recommendations 5. if all x have top-10 recommendations, then the proposed model is adaptive to all query-based method inputs test #2: combining several items that can produce recommendations through proposed items, then used as a query to find recommendations in the traditional method.: 1. simulate one 3-item series as input for the query-based method to find rules x→y, where x is equal to the respective 3-item series t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 208 2. if the traditional query-based method cannot produce recommendations, then it is not an adaptive method. f. validating the recommendation the validity of the recommendation list can be confirmed through two approaches: systemor user-centered validity [45]–[50]. the recommendation results are matched with a set of items generated by the system, and here the validation results are objective. in the second approach, which is the one used in this study, recommendations are validated based on the user's perspective because in the end, users are expected to take action after seeing the contents of the recommendations. these perspective metrics include accuracy, familiarity, attractiveness, enjoyability, novelty, diversity, and context compatibility. the number of 25 users were asked to evaluate seven metrics through one related question as follows [49], [50]: 1. accuracy: the recommended items match my interests and vice versa 2. familiarity: some recommended items are familiar to me and vice versa. 3. attractiveness: some recommended item to me is attractive and vice versa 4. enjoyability: i enjoy the items recommended and vice versa 5. novelty: the rs helps me discover new items and vice versa 6. diversity: the items recommended to me are varied and vice versa 7. context compatibility: recommended items take into account my personal context and vice versa for each question, users give a rating of 1 to 3, where 1 means the user strongly agrees with the question asked, 2 means a neutral perception, and 3 shows the user strongly disagrees. users are uniformly asked to rate the ten 3-items in the generated rules and are assumed to be viewed by the user. the user validation table is described in table 1. table 1. the user validation table description of items seen by user previously illustration of item jumbo bag red retro spot, jumbo bag woodland animals, jumbo storage bag suki top-10 recommendation by proposed method user’s validation rate (1, 2, or 3) on seven metrics items recommended illustration 1 2 3 4 5 6 7 3 birds canvas screen 36 doilies vintage christmas advent calendar gingham sack antique glass heart decoration assorted color t-light holder 209 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 iii. results and discussions the dataset used is online retail data available on the uci web portal. the number of records initially was 541,909 lines, but after grouping by invoice number, the number of records became 22,106, consisting of 4059 unique items. as explained in the uci web portal, this transnational dataset contains all transactions between 01/12/2010 and 09/12/2011 (almost one year) for uk-based and registered non-store online retailers. this company primarily sells unique gifts for any occasion. in order to make the proposed method be compared fairly with the query-based session method, the rules are mined with minsup = 1% and minconf = 50%. using a lower minsup and minconf such as 0.1% and 10% respectively, results in an explosion of the number of rules to more than 2 million rules, which is not adequate for demonstrating the features and functionality of the proposed method and of the compared traditional method as well. the difference between the proposed approach and traditional ar-based rs methods is that only rules with x and y lengths of precisely one item were mined out, or |x| = 1 and |y| = 1; whereas to the traditional method 0 < |x|  3 and |y| = 1 were applied. these approaches are carried out with the following considerations: first, with short rules, the number of rules that must be maintained in memory is less than long rules [51], [52]. in the proposed approach, the number of rules generated is 194 rules which are then arranged as a series of rules that are used as the training dataset. the size of the training dataset becomes 824 records. for the traditional method approach, the resulting rules are 194 rules, of which 40 rules have |x| = 3 and |y| = 1 which is used for test #1. mining results for this traditional method are stored in ruledb-trad. the results of applying 1 to 3 layers of lstm show no significant difference between loss and accuracy. the lowest loss values for each application, respectively, are 0.2234, 0.2163 and 0.3118 with an accuracy of 84.2%, 83.8% and 84.4%. charts of changes in loss and accuracy for each epoch for these three treatments with 1 lstm layer is given in figure 9. fig. 9. loss and accuracy of model with 1 lstm + dropout + dense layers an essential note during experiments is that if the dropout layer is not applied, there is an improvement in loss, which is an average of 0.07 and an average accuracy of 93.7%. it is explained that dropout can avoid overfitting by deleting cells randomly [39]. however, in some literature regarding text generators, no comparison was found between the results of the dropout and nondropout models[31], [44], [53]. in addition, because of its generative nature, the text-generator t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 210 method results in the formation of new sentences from new term arrangements so that the 'accuracy' of terms that should appear after the previous term intuitively does not result from applying the dropout layer only, but also the richness of vocabulary and sentences available in the training set. the results of test #1 show that the proposed method can predict next-items and produce top-10 recommendations for all 40 three-item x series where the query-based method can generate the nextitems. in contrast, the query-based method cannot generate top-10 recommendations for all x, but only 2 items, as shown in table 2 (left-side), is because not all x which is the antecedent of the rules has 10 consequent items y. this is an advantage offered by the proposed method. for test #2, a manual inspection found several item combinations not in the ruledb-trad database. these items are trained to the developed model to seek recommendations. one of the results is given in table 2 (right-side), where the proposed method produces top-10 recommendations, and the traditional method does not find any items, which means traditional query-based methods are not adaptive in generating recommendations for any input itemids entered. table 2. top-10 recommendations produced for items seen by the user, which is a frequent itemset (left-side), and is not a frequent itemset (right-side) itemid description itemid description items seen by user is frequent itemset not a frequent itemset 85099b jumbo bag red retrospot 85099b jumbo bag red retrospot 20712 jumbo bag woodland animals 20711 jumbo bag toys 21931 jumbo storage bag suki 20712 jumbo bag woodland animals top-10 recommendation by proposed method 84731 3 birds canvas screen 22282 12 egg house painted wood 22950 36 doilies vintage christmas 84559b 3d sheet of cat stickers 90199b 5 strand glass necklace amethyst 72801c 4 rose pink dinner candles 22580 advent calendar gingham sack 22371 airline bag vintage tokyo 78 21143 antique glass heart decoration 23068 aluminum stamped heart 17164b ass col small sand gecko weight 90183a amber drop earrings w long beads 47421 assorted color lizard suction hook 84879 assorted color bird ornament 20749 assorted color mini cases 20749 assorted color mini cases 47420 assorted color suction cup hook 47420 assorted color suction cup hook 84950 assorted color t-light holder 84950 assorted color t-light holder top recommendation by traditional method 22386 jumbo bag pink polka dot 22411 jumbo shopper vintage red paisley no result next, the proposed method's ability to generatively find recommendations for each input given in a session is demonstrated with the following step: 1) get the top-k recommendations from the itemids series, called x1, 2) the itemid in the first position of recommendation is assumed to be clicked by the user, so it goes into x1, and simultaneously pushes out a product from x1, and then this series becomes x2; 3) the second step is repeated until x5 is obtained, then the results are analyzed. using k = 3, the result is shown as follows: • x1: {85099b: jumbo bag red retrospot, 20711: jumbo bag toys, 20712: jumbo bag woodland animals} top-3 recommendations: 23697: a pretty thank you card (clicked), 85161: acrylic geometric lamp, 22915: assorted bottle top magnets • x2: {23697: a pretty thank you card, 85099b: jumbo bag red retrospot, 20712: jumbo bag woodland animals} top-3 recommendations: 22282: 12 egg house painted wood (clicked), 22374: airline bag vintage jet set red, 22915: assorted bottle top magnets • x3: {22282: 12 egg house painted wood, 23697: a pretty thank you card, 20712: jumbo bag woodland animals} top-3 recommendations: 84558a: 3d dog picture playing cards (clicked), 22915: assorted bottle top magnets, 84879: assorted color bird ornament • x4: {22282: 12 egg house painted wood, 84558a: 3d dog picture playing cards, 23697: a pretty thank you card} 211 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 top-3 recommendations: 21448: 12 daisy pegs in wood box (clicked), 23442: 12 hanging eggs hand painted, 22906: 12 message cards with envelopes • x5: {21448: 12 daisy pegs in wood box, 22282: 12 egg house painted wood, 84558a: 3d dog picture playing cards} top-3 recommendations: 22436: 12 colored party balloons, 22150: 3 stripey mice felt craft, 84559a: 3d sheet of dog stickers. in this simulation, it can be understood that whatever order of items the user sees, the system can always generate a new list of recommendations, and with this ability, the recommendation system is said to be generative in generating recommendations. the results of the user-centric validity test on the list of recommendations produced by the proposed model are shown in figure 10, with the metrics being measured as accuracy, familiarity, attractiveness, enjoyability, novelty, diversity, and context compatibility, which are captured from the user's perspective. as seen, users feel that the recommended items are less accurate than those the user has seen. however, in other metrics, users give the opposite response. in terms of familiarity, even though it is inaccurate, as many as 56% of users feel familiar with the recommended item. furthermore, as many as 72% of users agree that recommended items are attractive, 76% of users enjoy the list of recommended items, and they also feel that they just found out that the recommended items are related to items they have previously viewed. 80% of users agree that the list of recommended items is diverse, and 56% of users also agree that the items are related to the context of the items they have seen. on the other hand, although it appears that many users have a neutral opinion, it can be said that few users disagree with the questions asked regarding the metrics being measured. fig. 10. user validation of measured metrics an interesting thing to note is that 20% of users who have a neutral perception of accuracy think that the recommended product still has something to do with the product they have seen, namely that it has elements of animal shapes or something related to christmas, such as the color red, and ornaments to decorate christmas or new year celebrations. this result is in line with the results of previous studies, which show that accuracy versus novelty and diversity are inverse metrics [54]–[58]. if accuracy is essential, recommendation results tend to be uniform because accuracy is associated with the degree of similarity between the recommended product and those the user has seen or purchased. diversity, on the other hand, brings a list of recommended products that are not similar to any products the user has ever seen. novelty is closely related to diversity because the user's new understanding of the product usually arises when they are presented with products that are not similar to those previously visited. another important note is that ar-based rs does not produce recommended item y with high similarity to a series of items x that the user has visited or purchased. the pair (x, y) is formed from the support and confidence metrics, so if the results from the traditional method show that y and x t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 212 look similar, it is because (x, y) were purchased together, not because the product descriptions are similar. iv. conclusions the ability of the proposed rnn-based session method to generatively and adaptively produce recommendations after recommendations from a series of items viewed by a user in a session has been demonstrated. traditional query-based methods are incapable of this because next-item recommendations are not generated from the learning process but instead rely on rules. as a result, when the item array that a user is looking for in a session is not a frequent itemset, then the traditional method fails to find the next-item, hence also recommendations. the results of user-centered validation of several matrices toward the proposed method show that although the level of accuracy of recommended products and products seen by users is only 20%, other metrics reach above 70%, such as novelty, diversity, attractiveness and enjoyability. as a suggestion for future development, the model can be built by adding several layers of attention to remember a more extended sequence of rules, such as in the transformers model. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] t. silveira, m. zhang, x. lin, y. liu, and s. ma, “how good your recommender system is? a survey on evaluations in recommendation,” int. j. mach. learn. cybern., vol. 10, no. 5, 2019. [2] j. ben schafer, j. a. konstan, and j. riedl, “e-commerce recommendation applications,” in applications of data mining to electronic commerce, 2011. [3] f. o. isinkaye, y. o. folajimi, and b. a. ojokoh, “recommendation systems : principles , methods and evaluation,” egypt. informatics j., vol. 16, no. 3, pp. 261–273, 2015. [4] q. y. shambour, m. m. abu-alhaj, and m. m. al-tahrawi, “a hybrid collaborative filtering recommendation algorithm for requirements elicitation,” int. j. comput. appl. technol., vol. 63, no. 1–2, 2020. [5] w. jiang et al., “a new time-aware collaborative filtering intelligent recommendation system,” comput. mater. contin., vol. 61, no. 2, pp. 849–859, 2019. [6] k. yi, t. chen, and g. cong, “library personalized recommendation service method based on improved association rules,” libr. hi tech, vol. 36, no. 3, pp. 443–457, 2018. [7] y. tian, b. zheng, y. wang, y. zhang, and q. wu, “college library personalized recommendation system based on hybrid recommendation algorithm,” in procedia cirp, 2019. [8] m. ge, f. ricci, and d. massimo, “health-aware food recommender system,” in recsys 2015 proceedings of the 9th acm conference on recommender systems, 2015. [9] x. li et al., “application of intelligent recommendation techniques for consumers’ food choices in restaurants,” front. psychiatry, 2018. [10] l. d. adistia, t. m. akhriza, and s. jatmiko, “sistem rekomendasi buku untuk perpustakaan perguruan tinggi berbasis association rule,” j. resti (rekayasa sist. dan teknol. informasi), vol. 3, no. 2, 2019. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1007/s13042-017-0762-9 https://doi.org/10.1007/s13042-017-0762-9 https://doi.org/10.1007/978-1-4615-1627-9_6 https://doi.org/10.1007/978-1-4615-1627-9_6 https://doi.org/10.1016/j.eij.2015.06.005 https://doi.org/10.1016/j.eij.2015.06.005 https://doi.org/10.1504/ijcat.2020.107908 https://doi.org/10.1504/ijcat.2020.107908 https://doi.org/10.32604/cmc.2019.05932 https://doi.org/10.32604/cmc.2019.05932 https://doi.org/10.1108/lht-06-2017-0120 https://doi.org/10.1108/lht-06-2017-0120 https://doi.org/10.1016/j.procir.2019.04.126 https://doi.org/10.1016/j.procir.2019.04.126 https://doi.org/10.1145/2792838.2796554 https://doi.org/10.1145/2792838.2796554 https://doi.org/10.3389/fpsyt.2018.00415 https://doi.org/10.3389/fpsyt.2018.00415 https://doi.org/10.29207/resti.v3i2.971 https://doi.org/10.29207/resti.v3i2.971 213 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 [11] t. k. dang, q. p. nguyen, and v. s. nguyen, “a study of deep learning-based approaches for session-based recommendation systems,” sn computer science, vol. 1, no. 4. 2020. [12] s. wang, l. cao, y. wang, q. z. sheng, m. a. orgun, and d. lian, “a survey on session -based recommender systems,” acm comput. surv., vol. 54, no. 7, 2022. [13] m. ludewig and d. jannach, “evaluation of session-based recommendation algorithms,” user model. user-adapt. interact., vol. 28, no. 4–5, 2018. [14] s. latifi, n. mauro, and d. jannach, “session-aware recommendation: a surprising quest for the state-of-the-art,” inf. sci. (ny)., vol. 573, 2021. [15] m. ludewig, n. mauro, s. latifi, and d. jannach, “empirical analysis of session-based recommendation algorithms,” user model. user-adapt. interact., vol. 31, no. 1, 2021. [16] m. maher et al., “comprehensive empirical evaluation of deep learning approaches for session -based recommendation in e-commerce,” entropy, vol. 24, no. 11, 2022. [17] d. wang, d. xu, d. yu, and g. xu, “time-aware sequence model for next-item recommendation,” appl. intell., vol. 51, no. 2, 2021. [18] g. m. harshvardhan, m. k. gourisaria, s. s. rautaray, and m. pandey, “ubmtr: unsupervised boltzmann machinebased time-aware recommendation system,” j. king saud univ. comput. inf. sci., 2021. [19] j. li, y. wang, and j. mcauley, “time interval aware self-attention for sequential recommendation,” in wsdm 2020 proceedings of the 13th international conference on web search and data mining, 2020. [20] y. guo, y. ling, and h. chen, “a time-aware graph neural network for session-based recommendation,” ieee access, vol. 8, 2020. [21] t. m. phuong, t. c. thanh, and n. x. bach, “neural session-aware recommendation,” ieee access, vol. 7, 2019. [22] j. zhao et al., “dcfgan: an adversarial deep reinforcement learning framework with improved negative sampling for session-based recommender systems,” inf. sci. (ny)., vol. 596, 2022. [23] x. huang, y. he, b. yan, and w. zeng, “fusing frequent sub-sequences in the session-based recommender system[formula presented],” expert syst. appl., vol. 206, 2022. [24] l. van maasakkers, d. fok, and b. donkers, “next-basket prediction in a high-dimensional setting using gated recurrent units,” expert syst. appl., vol. 212, 2023. [25] t. liu, x. yin, and w. ni, “next basket recommendation model based on attribute-aware multi-level attention,” ieee access, vol. 8, 2020. [26] b. peng, z. ren, s. parthasarathy, and x. ning, “m2: mixed models with preferences, popularities and transitions for next-basket recommendation,” ieee trans. knowl. data eng., vol. 35, no. 4, 2023. [27] u. niranjan, r. b. v. subramanyam, and v. khanaa, “developing a web recommendation system based on closed sequential patterns,” in communications in computer and information science, 2010. [28] g. e. yap, x. l. li, and p. s. yu, “effective next-items recommendation via personalized sequential pattern mining,” in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 2012. [29] b. mobasher, h. dai, t. luo, and m. nakagawa, “effective personalization based on association rule discovery from web usage data,” in proceedings of the third international workshop on web information and data management (widm), 2001. [30] t. m. akhriza, y. ma, and j. li, “revealing the gap between skills of students and the evolving skills required by the industry of information and communication technology,” int. j. softw. eng. knowl. eng., vol. 27, no. 05, pp. 675–698, 2017. [31] m. lippi, m. a. montemurro, m. degli esposti, and g. cristadoro, “natural language statistical features of lstmgenerated texts,” ieee trans. neural networks learn. syst., vol. 30, no. 11, 2019. [32] r. dolphin, “lstm neworks a detailed explanation,” web page, 2020. (accessed jan. 06, 2022). [33] p. p. barman and a. boruah, “a rnn based approach for next word prediction in assamese phonetic transcription,” in procedia computer science, 2018. [34] j. brownlee, “text generation with lstm recurrent neural networks in python with keras,” machine learning mastery, 2018. [35] j. gordon and j. m. hernández-lobato, “combining deep generative and discriminative models for bayesian semisupervised learning,” pattern recognit., vol. 100, 2020. [36] a. fujino, n. ueda, and k. saito, “a hybrid generative/discriminative approach to text classification with additional information,” inf. process. manag., vol. 43, no. 2, 2007. [37] c. l. p. chen and s. feng, “generative and discriminative fuzzy restricted boltzmann machine learning for text and image classification,” ieee trans. cybern., vol. 50, no. 5, 2020. [38] n. c. dvornek, x. li, j. zhuang, and j. s. duncan, “jointly discriminative and generative recurrent neural networks for learning from fmri,” in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 2019. [39] a. carta, “building a rnn recommendation engine with tensorflow,” medium.com2, 2021. (accessed jan. 06, 2022). [40] s. ambulgekar, s. malewadikar, r. garande, and b. joshi, “next words prediction using recurrent neuralnetworks,” itm web conf., vol. 40, 2021. https://doi.org/10.1007/s42979-020-00222-y https://doi.org/10.1007/s42979-020-00222-y https://doi.org/10.1145/3465401 https://doi.org/10.1145/3465401 https://doi.org/10.1007/s11257-018-9209-6 https://doi.org/10.1007/s11257-018-9209-6 https://doi.org/10.1016/j.ins.2021.05.048 https://doi.org/10.1016/j.ins.2021.05.048 https://doi.org/10.1007/s11257-020-09277-1 https://doi.org/10.1007/s11257-020-09277-1 https://doi.org/10.3390/e24111575 https://doi.org/10.3390/e24111575 https://doi.org/10.1007/s10489-020-01820-2 https://doi.org/10.1007/s10489-020-01820-2 https://doi.org/10.1016/j.jksuci.2021.01.017 https://doi.org/10.1016/j.jksuci.2021.01.017 https://doi.org/10.1145/3336191.3371786 https://doi.org/10.1145/3336191.3371786 https://doi.org/10.1109/access.2020.3023685 https://doi.org/10.1109/access.2020.3023685 https://doi.org/10.1109/access.2019.2926074 https://doi.org/10.1016/j.ins.2022.02.045 https://doi.org/10.1016/j.ins.2022.02.045 https://doi.org/10.1016/j.eswa.2022.117789 https://doi.org/10.1016/j.eswa.2022.117789 https://doi.org/10.1016/j.eswa.2022.118795 https://doi.org/10.1016/j.eswa.2022.118795 https://doi.org/10.1109/access.2020.3018030 https://doi.org/10.1109/access.2020.3018030 https://doi.org/10.1109/tkde.2022.3142773 https://doi.org/10.1109/tkde.2022.3142773 https://doi.org/10.1007/978-3-642-15766-0_25 https://doi.org/10.1007/978-3-642-15766-0_25 https://doi.org/10.1007/978-3-642-29035-0_4 https://doi.org/10.1007/978-3-642-29035-0_4 https://doi.org/10.1007/978-3-642-29035-0_4 https://doi.org/10.1145/502932.502935 https://doi.org/10.1145/502932.502935 https://doi.org/10.1145/502932.502935 https://doi.org/10.1142/s0218194017500255 https://doi.org/10.1142/s0218194017500255 https://doi.org/10.1142/s0218194017500255 https://doi.org/10.1109/tnnls.2019.2890970 https://doi.org/10.1109/tnnls.2019.2890970 https://towardsdatascience.com/lstm-networks-a-detailed-explanation-8fae6aefc7f9 https://towardsdatascience.com/lstm-networks-a-detailed-explanation-8fae6aefc7f9 https://doi.org/10.1016/j.procs.2018.10.359 https://doi.org/10.1016/j.procs.2018.10.359 https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ https://doi.org/10.1016/j.patcog.2019.107156 https://doi.org/10.1016/j.patcog.2019.107156 https://doi.org/10.1016/j.ipm.2006.07.013 https://doi.org/10.1016/j.ipm.2006.07.013 https://doi.org/10.1109/tcyb.2018.2869902 https://doi.org/10.1109/tcyb.2018.2869902 https://doi.org/10.1007/978-3-030-32692-0_44 https://doi.org/10.1007/978-3-030-32692-0_44 https://doi.org/10.1007/978-3-030-32692-0_44 https://medium.com/decathlontechnology/building-a-rnn-recommendation-engine-with-tensorflow-505644aa9ff3 https://medium.com/decathlontechnology/building-a-rnn-recommendation-engine-with-tensorflow-505644aa9ff3 https://doi.org/10.1051/itmconf/20214003034 https://doi.org/10.1051/itmconf/20214003034 t. a. armanda et al. / knowledge engineering and data science 2023, 6 (2): 199–214 214 [41] s. g. vadlamudi, s. kumar, s. sahu, a. malviya, and p. choudhury, “multi-window time-aware personalized recommendation system,” in201741023671, 2017. [42] q. zhang, l. cao, c. shi, and z. niu, “neural time-aware sequential recommendation by jointly modeling preference dynamics and explicit feature couplings,” ieee trans. neural networks learn. syst., 2021. [43] h. miao, a. li, l. s. davis, and a. deshpande, “towards unified data and lifecycle management for deep learning,” in proceedings international conference on data engineering, 2017. [44] t. iqbal and s. qureshi, “the survey: text generation models in deep learning,” journal of king saud university computer and information sciences, vol. 34, no. 6. 2022. [45] m. kaminskas and d. bridge, “diversity, serendipity, novelty, and coverage: a survey and empirical analysis of beyond-accuracy objectives in recommender systems,” acm transactions on interactive intelligent systems, vol. 7, no. 1. 2016. [46] p. gravino, b. monechi, and v. loreto, “towards novelty-driven recommender systems,” comptes rendus physique, vol. 20, no. 4. 2019. [47] k. kapoor, v. kumar, l. terveen, j. a. konstan, and p. schrater, “‘i like to explore sometimes’: adapting to dynamic user novelty preferences,” in recsys 2015 proceedings of the 9th acm conference on recommender systems, 2015. [48] z. zolaktaf, r. babanezhad, and r. pottinger, “a generic top-n recommendation framework for trading-off accuracy, novelty, and coverage,” proc. ieee 34th int. conf. data eng. icde 2018, pp. 149–160, 2018. [49] k. nanath and m. ahmed, “user-centric evaluation of recommender systems: a literature review,” int. j. bus. inf. syst., vol. 1, no. 1, 2020. [50] d. y. a. waykar, “human-ai collaboration in explainable recommender systems: an exploration of user-centric explanations and evaluation frameworks,” interantional j. sci. res. eng. manag., vol. 07, no. 07, 2023. [51] t. m. akhriza and i. d. mumpuni, “quantitative class association rule-based approach to lecturer career promotion recommendation,” int. j. inf. decis. sci., vol. 13, no. 2, 2021. [52] t. m. akhriza, y. ma, and j. li, “novel push-front fibonacci windows model for finding emerging patterns with better completeness and accuracy:,” etri j., vol. 40, no. 1, 2018. [53] r. patel and s. patel, “deep learning for natural language processing,” in lecture notes in networks and systems, 2021, pp. 523–533. [54] k. bradley and b. smyth, “improving recommendation diversity,” in proceedings of the 12th irish conference on artificial intelligence and cognitive science, 2001. [55] t. yu, j. guo, w. li, h. j. wang, and l. fan, “recommendation with diversity: an adaptive trust -aware model,” decis. support syst., vol. 123, 2019. [56] r. xie et al., “improving accuracy and diversity in matching of recommendation with diversified preference network,” ieee trans. big data, vol. 8, no. 4, 2022. [57] y. lin, c. huang, w. yao, and y. shao, “personalised attraction recommendation for enhancing topic diversity and accuracy,” j. inf. sci., vol. 49, no. 2, 2023. [58] c. matt, a. benlian, t. hess, and c. weiß, “escaping from the filter bubble? the effects of novelty and serendipity on users’ evaluations of online recommendations,” in 35th international conference on information systems “building a better world through information systems”, icis 2014, 2014. https://patents.google.com/patent/us20190012718a1/en https://patents.google.com/patent/us20190012718a1/en https://doi.org/10.1109/tnnls.2021.3069058 https://doi.org/10.1109/tnnls.2021.3069058 https://doi.org/10.1109/icde.2017.112 https://doi.org/10.1109/icde.2017.112 https://doi.org/10.1016/j.jksuci.2020.04.001 https://doi.org/10.1016/j.jksuci.2020.04.001 https://doi.org/10.1145/2926720 https://doi.org/10.1145/2926720 https://doi.org/10.1145/2926720 https://doi.org/10.1016/j.crhy.2019.05.014 https://doi.org/10.1016/j.crhy.2019.05.014 https://doi.org/10.1145/2792838.2800172 https://doi.org/10.1145/2792838.2800172 https://doi.org/10.1145/2792838.2800172 https://doi.org/10.1109/icde.2018.00023 https://doi.org/10.1109/icde.2018.00023 https://doi.org/10.1504/ijbis.2020.10035978 https://doi.org/10.1504/ijbis.2020.10035978 https://doi.org/10.55041/ijsrem24775 https://doi.org/10.55041/ijsrem24775 https://doi.org/10.1504/ijids.2021.116507 https://doi.org/10.1504/ijids.2021.116507 https://doi.org/10.4218/etrij.18.0117.0175 https://doi.org/10.4218/etrij.18.0117.0175 https://doi.org/10.1007/978-981-16-0882-7_45 https://doi.org/10.1007/978-981-16-0882-7_45 https://www.bibsonomy.org/bibtex/219b57c3acb162cffa6f825a3deb0603d/bsmyth https://www.bibsonomy.org/bibtex/219b57c3acb162cffa6f825a3deb0603d/bsmyth https://doi.org/10.1016/j.dss.2019.113073 https://doi.org/10.1016/j.dss.2019.113073 https://doi.org/10.1109/tbdata.2021.3103263 https://doi.org/10.1109/tbdata.2021.3103263 https://doi.org/10.1177/0165551521999801 https://doi.org/10.1177/0165551521999801 file:///f:/elektro/keds/layout/volume%206%20nomor%202/escaping%20from%20the%20filter%20bubble file:///f:/elektro/keds/layout/volume%206%20nomor%202/escaping%20from%20the%20filter%20bubble file:///f:/elektro/keds/layout/volume%206%20nomor%202/escaping%20from%20the%20filter%20bubble microsoft word 4-4797-hernandez-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 1, no 2, september 2018, pp. 64–73 eissn 2597-4637 https://doi.org/10.17977/um018v1i22018p64-73 ©2018 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) energy efficiency metrics of university data centers leonel hernandez a, 1, *, genett jimenez a, 2, piedad marchena a, 3 a institución universitaria itsa, carrera 45 # 48 – 31, barranquilla, colombia 1 lhernandezc@itsa.edu.co*; 2 gjimenez@itsa.edu.co; 3 pmarchena@itsa.edu.co * corresponding author i. introduction in a short time, the internet changed our way of life, the way we work, have fun or learn, making the data more valuable. likewise, emerging technologies such as the internet of things (iot), cloud computing and virtualization, programming network, big data, and digital transformation have promoted the existence of large data centers responsible for the secure storage of data. as baccour says [1], data centers use the network between 5 % and 25 %, and therefore the energy consumed by inactive devices is wasted. data centers, composed of many servers, can absorb as much power as a small city, in fact, numerous studies have shown that average server utilization is usually less than 30% of maximum utilization [2]. on the other hand, the efficient use of electric power is gaining relevance due to the growing demand and the reduction of resources, and the concern about the environmental impact generated by the increase in data centers, which is why it is necessary to promote the efficient use of electrical energy, based on the quantification of different energy efficiency metrics. a few years ago, the concept of energy efficiency in data centers was very subjective, since it was often not clear how to measure energy, where it should be estimated or what units to use [3]. for this reason, energy efficiency metrics were developed to convert data centers into "green data centers," friendly to the environment. some of these metrics, applied to the data centers of the university institution itsa, which will be analyzed in this study include: data center infrastructure efficiency (dcie), power usage effectiveness (pue), hvac system effectiveness and space, watts and performance (swap), which verifies if they are friendly to the environment. article info a b s t r a c t article history: received 15 july 2018 revised 5 august 2018 accepted 8 august 2018 published online 31 august 2018 the data centers are fundamental pieces in the network and computing infrastructure, and evidently today more than ever they are relevant. since they support the processing, analysis, assurance of the data generated in the network and by the applications in the cloud, which every day increases its volume thanks to technologies such as internet of things, virtualization, and cloud computing, among others. precisely the management of this large volume of information makes the data centers consume a lot of energy, generating great concern to owners and administrators. green data centers offer a solution to this problem, reducing the impact produced by the data centers in the environment, through the monitoring and control of these. the metrics are the tools that allow us to measure in our case the energy efficiency of the data center and evaluate if it is friendly to the environment. these metrics will be applied to the data centers of the itsa university institution, barranquilla and soledad campus, and the analysis of these will be carried out. in previous research, the most common metric (pue) was analyzed to measure the efficiency of the data centers, to verify if the university's data center is friendly to the environment. it is planned to extend this study by carrying out an analysis of several metrics to conclude which is the most efficient and which allows defining the guidelines to update or convert the data center in a friendly environment. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: energy green data center performance metrics hvac system pue dcie environment l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 65 a green data center is defined as one in which the mechanical systems, electrical, lighting and computation are designed for maximum efficiency and environmental impact [3]. the core of the internet, or of any technological infrastructure of an entity or company, is the data center. the amount of data they process increases as users' needs demand more storage capacity, speed of transmission and processing of information. this trend will continue to grow, which is why data centers can generate a high impact on the environment due to the advanced cooling, lighting and temperature systems they use. due to this, metrics have been defined that provide guidelines to follow for the design and implementation of green data centers, friendly to the environment. many factors contribute to the increase of energy consumption in a data center, but there are also some areas in which you can reduce energy consumption, as shown in fig. 1 [4]. emerging technologies such as virtualization and cloud and fog computing have contributed to the design of efficient data centers, reducing energy consumption, the use of space, and environmental requirements, even so, some metrics allow measuring the energy consumption of the data centers and thus verify if these are green data centers. for some years it has become evident the increase in the number of data centers and their size, and with it, the increase in the use of energy. according to [5] in its report on energy use in data centers in the united states, in 2006, data centers consumed 61 kwh (equivalent to the consumption of 5.8 million american households), while in 2008, that figure increased to 69 kwh. according to this report, it is expected that between 2014 and 2020, the increase in the use of energy will be 4 %, which is equivalent to 73 billion kwh in 2020, which shows the excessive consumption of energy required for its operation. this excessive energy consumption affects not only the supplier's economy but also becomes a social and environmental problem, to the extent that resources are consumed indiscriminately. the infrastructure of the data center building currently accounts for almost half of the total energy consumption, as shown in fig. 2. in a previous research at the university, reference [6] designed and implemented a prototype of sensors for the detection of environmental variables in a data center (temperature, humidity, current). the prototype was implemented in the data center of the barranquilla fig. 1. areas where the power consumption of data centers can be reduced [4] fig. 2. energy consumption in the data center building 66 l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 campus, and its objective is to control, monitor and notify when events happen that go beyond a defined threshold. the sensors are connected to the wifi network, which in turn connects to the institutional network. the results can be seen in an application that is currently in the testing phase. in our environment, some research has been carried out regarding the impact that the devices and elements that make up a data center on the environment can produce. however, companies have not yet been fully aware of this impact, and when designing and building a data center, the environmental element is not considered. this study seeks to create sensitivity with respect to this situation, beginning by analyzing the state of the two data centers of the university, to contribute to research in this field and draw up a plan for improvement, given that the analysis of the various metrics has shown that our data centers are not friendly to the environment. this study can be replicated in other educational institutions and in companies of the productive sector that each one begins by analyzing their infrastructure and carrying out strategic planning that contributes to the improvement of the environment. the paper is organized as follows. after having contextualized the reader with the theme of the project in the introduction, a basic review of the literature about the green data centers is made. continue with the research method used to make way for calculations of the most relevant metrics in our data centers. it concludes with a concise analysis of the results and possible future work that can be developed from this work. ii. methods for the elaboration of this project, the methodology of descriptive and non-experimental research was used [7]. descriptive since all the documentation related to green data center have been reviewed not experimental because it focuses on the study of the reality of the data centers of the university in its natural dynamics. the study does not create situations to observe what changes in the environment from a created situation but seek to describe, explain and predict reality, from an approach. the research design to be developed in the project corresponds to a qualitative, transactional design. qualitative because it is based on a working hypothesis, defined as that the data centers of the institution are not friendly to the environment, which goes in the opposite direction to the institutional guideline of care for the environment. it is transactional because the measurements are taken in a single moment of time. for each metric, the tabulated results will be shown, after the application of the respective formulas, for each location. it is expected in the future to perform a multivariate statistical analysis to compare the gap between the real values versus the standard values of each metric reviewed in this study. iii. results and discussions as noted above, the data centers consume a significant amount of energy, according to wang, [8], the taxonomy of performance metrics of "green computing" is defined, shown in fig. 3, which verifies fig. 3. taxonomy of “green computing” performance metrics [8] pe rf or m an ce m et ri cs basic metrics greenhouse gas (section 4) humidity (section 5) thermal metrics (section 6) power/energy metrics (section 7) extended metrics multiple indicators (section 8) total cost of ownership (section 9) l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 67 if a data center is or not green. for this study, an emphasis was placed on the primary metrics, precisely the power, and energy metrics, considering the characteristics of the data centers that were evaluated. a. power/energy metrics a data center is a special building, which is used to house computer systems, and these are associated with components such as telecommunications and storage systems, backup power sources, redundant communication connections, environmental controls and security devices. the energy consumption varies greatly, depending on that building [9]. for this reason, it is essential to identify the components and devices that make up the data centers of the itsa university institution. the itsa university institution has two data centers, one on the soledad campus, and the other on the barranquilla campus. table 1 lists the devices that make up the data centers of the soledad and barranquilla campuses, correspondingly, of the itsa university institution. the power/energy metrics defined are those shown in fig. 4. considering the data collected in the data centers of the university institution itsa, an analysis is made of the energy efficiency of these, based on the metrics: efficiency of the data center infrastructure (dcie), efficiency of the use of the power (pue), efficiency of the hvac system, and swap. b. data center infrastructure efficiency metric (dcie) the data center infrastructure efficiency metric (data center infrastructure efficiency-dcie) is a metric widely accepted by the green grid to help it professionals determine the energy efficiency of data centers and monitor the impact of its efficiency efforts [10], is given by (1): 𝐷𝐶𝑖𝐸 = (𝑤𝑎𝑡𝑡𝑠)  the higher the value of the dcie, the more efficient the data center infrastructure, the best practice is that this value is above 70% [8] and should never be greater than 100% [11]. according to green grid, the efficiency of the use of the power is measured by levels of efficiency, as shown in table 2. table 1. devices of the data center of the university institution itsa – campus soledad and barranquilla no device description it device quantity soledad barranquilla 1 firewall sophos wg450 2 2 switch cisco 2960 3 2 3 switch quidway s3300 1 4 switch tplink tl-sg3424 1 5 storage hp storageworks p2000 2 6 router cisco 2900 1 7 servers hp. proliant dl 380p gen8 3 1 8 conditioned air comfortstar 1 9 ups galleon x9b 6k 1 10 switch huaweis2300 1 11 switch catalyst3560 1 12 transceiver raisecomrc001 1 13 router cisco 3925 1 fig. 4. taxonomy of power consumption metrics [8] po w er /e ne rg y m et ri cs dcie pue hvac system effectiveness dcip swap 68 l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 the measurements were taken in the data center of the barranquilla campus of the university institution itsa, and the data necessary to calculate the pue are related in table 3. the measure of the infrastructure efficiency of the barranquilla campus is 33.78 %, which indicates that it is inefficient, as established in table 3, which suggests that measures must be taken in this regard. the measurements were taken in the data center of the soledad campus of the university institution itsa, and the data necessary to calculate the dcie are listed in table 4. the measure of the efficiency of the soledad campus infrastructure is 42.26 %, which indicates that the use of the power in said campus is average, as established in table 3, it is advisable to follow up. c. power usage effectiveness (pue) the power usage effectiveness-pue metric is a metric that allows measuring the efficiency in the use of the energy of a data center, was created by the green grid organization, and is calculated using the eq. (2): 𝑃𝑈𝐸 = 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑤𝑒𝑟 𝐸𝑞𝑢𝑖𝑝𝑚𝑒𝑛𝑡 𝐷𝑎𝑡𝑎 𝐶𝑒𝑛𝑡𝑒𝑟 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑤𝑒𝑟 𝐼𝑇 𝐸𝑞𝑢𝑖𝑝𝑚𝑒𝑛𝑡 (𝑤𝑎𝑡𝑡𝑠) (2) table 2. levels of efficiency according to green grid pue dcie levels of efficiency 3,0 33 % very inefficient 2,5 40 % inefficient 2,0 50 % average 1,5 67 % efficient 1,2 83 % very efficient table 3. calculation of the dcie in the data center of the barranquilla campus quantity device description consumption unit total consumption 1 switch huawei s2300 12.8 w – 38 w 38 w 2 switch catalyst2960 464 w – 870 w (con poe) 1.740 w 1 switch catalyst3560 449 w 449 w 1 transceiver raisecomrc001 36 w – 72 w dc 72 w 1 router cisco 3925 85 w – 400 w 400 w total consumption of it 2.699 w 1 conditioned air comfortstar 5290 w 5.290 w total consumption of data center 7.989 w dcie barranquilla 33.78 % table 4. calculation of the dcie in the data center of the soledad campus quantity device description consumption unit total consumption 2 firewall sophos wg450 66 w-180 w 360 w 2 switch cisco 2960 464 w – 870 w (con poe) 1.740 w 1 switch quidway s3300 100 v – 264 v 264 w 1 switch tplink tl-sg3424 100 v – 240 v 240 w 1 switch cisco 2960 464 w – 870 w (con poe) 870 w 2 storage hp storageworks p2000 374 w – 432 w 390 w 1 router cisco 2900 80 w360 w 360 w 3 servers hp proliant dl 380p gen8 460 w, 750 w – 1200 w 3.600 w total consumption of it 7.824 w 1 conditioned air comfortstar 5290 w 5.290 w 1 ups galleon x9b 6k 5400 w 5.400 w total consumption of data center 18.514 w dcie soledad 42.26 % l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 69 the pue value indicates the relationship between the energy used by it equipment and the energy consumed by all data center equipment, for example, the air conditioners needed to maintain the temperature necessary for it equipment to work safely. the pue, which is an instantaneous representation of the consumption of electrical energy, which transmits an understanding of the minimum possible use of energy; therefore, the average method of these metrics is proposed over a significant period, for example, one year, to understand the energy efficiency of the data center better and to develop energy rating / rating systems [12]. the higher the pue value, the lower the efficiency of the installation as more "overload" energy is consumed to feed the electrical load. the ideal pue value is 1, which indicates the maximum achievable efficiency without overload energy [9]. this metric has one disadvantage: it only measures the efficiency of the construction infrastructure that supports a given data center and does not indicate anything about the efficiency of the it equipment itself [5]. the measurements were taken in the data center of the barranquilla campus of the university institution itsa, and the data necessary to calculate the pue are listed in table 5. the measure of the efficiency of the use of power in the barranquilla campus is 2.96, which indicates that the use of power in the barranquilla campus is inefficient, as established in table 3, which suggests that measures must be taken about. the measurements were taken in the data center of the soledad campus of the university institution itsa, and the data necessary to calculate the pue are listed in table 6. the measure of the efficiency of the use of the power in the campus is 2.37, which indicates that the use of the power in said campus is average, according to that established in table 3, it is advisable to follow up. table 5. calculation of the pue in the data center of the barranquilla campus quantity device description consumption unit total consumption 1 switch huawei s2300 12.8 w – 38 w 38 w 2 switch catalyst2960 464 w – 870 w (con poe) 1.740 w 1 switch catalyst3560 449 w 449 w 1 transceiver raisecomrc001 36 w – 72 w dc 72 w 1 router cisco 3925 85 w – 400 w 400 w total consumption of it 2.699 w 1 conditioned air comfortstar 5290 w 5.290 w total consumption of data center 7.989 w pue barranquilla 2.96 % table 6. calculation of the pue in the data center of the soledad campus quantity device description consumption unit total consumption 2 firewall sophos wg450 66 w-180 w 360 w 2 switch cisco 2960 464 w – 870 w (con poe) 1.740 w 1 switch quidway s3300 100 v – 264 v 264 w 1 switch tplink tl-sg3424 100 v – 240 v 240 w 1 switch cisco 2960 464 w – 870 w (con poe) 870 w 2 storage hp storageworks p2000 374 w – 432 w 390 w 1 router cisco 2900 80 w360 w 360 w 3 servers hp. proliant dl 380p gen8 460 w, 750 w – 1200 w 3.600 w total consumption of it 7.824 w 1 conditioned air comfortstar 5290 w 5.290 w 1 ups galleon x9b 6k 5400 w 5.400 w total consumption of data center 18.514 w pue soledad 2.37 % 70 l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 d. hvac system efficiency metric the hvac system (heater, fan and air conditioning) is the fraction of the data center that is responsible for maintaining appropriate environmental conditions for the proper functioning of the devices included in the data center. a typical data center includes air conditioning, ventilation, a large central plant, lights, and mirror loads [8]. the designs of the data centers are varied; however, a typical design or good practice, implies that the cabinets are organized as islands, so that the use of resources can be optimized, as shown in fig. 5 [13]. the efficiency metric of the hvac system is calculated by establishing the relationship between the energy consumption of the it equipment of the data center between the consumption of the hvac system added to the amount of fuel, steam and chilled water multiplied by 293, as shown in eq. (3): 𝐻𝑉𝐴𝐶 _𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 = _ _ _ ∗( ) (3) a low value of hvac efficiency implies that the hvac system is using a high amount of energy, and therefore has a high potential to be optimized. according to a database of data centers surveyed by lawrence berkeley national laboratory, the effectiveness of the hvac system can vary from 0.6 to 3.5 [14], as outlined in table 7. since the data centers of the university institution itsa does not use cooling mechanisms such as fuel, steam or ice water, the relationship on which the calculation of the relationship between the amount of energy consumed by it devices and the quantity is based on energy consumed by hvac devices. the measurements were taken in the data centers of the barranquilla and soledad campus of the itsa university institution, and the data necessary to calculate the hvac efficiency are listed in tables 8 and 9. table 7. hvac efficiency levels according to the study conducted by lawrence berkeley national laboratory hvac effciency efficiency level 0,7 standard 1,4 good 2,5 best table 8. calculation of the hvac in the data center of the barranquilla campus quantity device description consumption unit total consumption 1 switch huawei s2300 12.8 w – 38 w 38 w 2 switch catalyst2960 464 w – 870 w (con poe) 1.740 w 1 switch catalyst3560 449 w 449 w 1 transceiver raisecomrc001 36 w – 72 w dc 72 w 1 router cisco 3925 85 w – 400 w 400 w total consumption of it 2.699 w 1 conditioned air comfortstar 5290 w 5.290 w total consumption of data center 7.989 w hvac barranquilla 1.51 % fig. 5. typical hot / cold aisle design [13] l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 71 the measure of the efficiency of the hvac system in the barranquilla campus is 1.51, which indicates its effectiveness is better, compared to other metrics, and therefore its potential for optimization is minimal, as established in table 8. also, the measure of the efficiency of the hvac system in the soledad campus is 3.5, which indicates its effectiveness is good, and that it has the potential to be optimized. e. metrics of space efficiency, watts and performance (swap) the space efficiency, power and performance metric (swap) allows measuring the energy efficiency of the reference performance of the server contrasted with the product of the energy consumed by the space used, measured in rack units (ru), as appreciates in the eq. (4): 𝑆𝑊𝑎𝑃 = ∗ (4) sun introduced the swap metric; it allows to compare the efficiency of the servers regarding space and energy consumption; however, this metric can be applied to network equipment or storage [14]. in the case of the data centers of the itsa university institution, the measurements were made by devices and by the rack, including servers, network and storage devices, as shown in tables 10 and 11. when comparing the swap value of the switches located in the barranquilla data center, it is evident that the huawei s2300 switch is more efficient concerning space and power consumption than the other installed switch models (cisco catalyst 2960 and cisco catalyst 3560). table 9. calculation of the hvac in the data center of the soledad campus quantity device description consumption unit total consumption 2 firewall sophos wg450 66 w – 180 w 360 w 2 switch cisco 2960 464 w – 870 w (con poe) 1.740 w 1 switch quidway s3300 100 v – 264 v 264 w 1 switch tplink tl-sg3424 100 v – 240 v 240 w 1 switch cisco 2960 464 w – 870 w (con poe) 870 w 2 storage hp storageworks p2000 374 w – 432 w 390 w 1 router cisco 2900 80 w – 360 w 360 w 3 servers hp. proliant dl 380p gen8 460 w, 750 w – 1200 w 3.600 w total consumption of it 7.824 w 1 conditioned air comfortstar 5290 w 5.290 w 1 ups galleon x9b 6k 5400 w 5.400 w total consumption of data center 18.514 w hvac soledad 3.5 % table 10. calculation of the swap in the data center of the barranquilla campus qty device description unit consumption total consumption unit performance (w) total performance (w) ru per device ru total swap (%) 1 switch huawei s2300 12.8 – 38 w 38 38 38 1 1 100.00 2 switch catalyst 2960 464 – 870 w (con poe) 1740 370 740 1 2 21.26 1 switch catalyst 3560 449 w 449 60 60 1 1 13.36 1 transceiver raisecom rc001 36 – 72 w dc 72 15 15 2 2 10.42 1 router cisco 3925 85 – 400 w 400 100 100 1 1 25.00 total 2699 953 7 5.04 72 l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 when comparing the swap value of the switches located in soledad's data center, it is evident that the quidway s3300 switch is more efficient regarding space and power consumption than the other installed switch models (cisco catalyst 2960 and tp link tl -sg3424). on the other hand, the swap of the routers shows that the router located in soledad offers higher efficiency than the one installed in barranquilla; and regarding the efficiency of the racks, soledad's rack is more efficient than barranquilla's, concerning power and space, although the difference is not substantial. iv. conclusion during this research, the energy efficiency metrics applied to the data centers of the itsa university institution, at barranquilla and soledad campuses, were evaluated. each of these metrics is recognised in the scientific field and are valid to determine if a data center is friendly to the environment. it is essential to take them into account, to undertake improvement projects that have a favorable impact on the environment that surrounds us. as a compendium of each metric, the following can be said. dcie efficiency metric indicates that the efficiency of the infrastructure of the data centers of the itsa university institution can improve its efficiency, since the current performance is inefficient in barranquilla, and is average in soledad. pue efficiency metric confirms the results obtained with the previous metric, however, it is recommended that this metric is evaluated based on an average of measures for periods of time established by the analyst, to control and monitor the efficiency strategies energy implemented. the efficiency of the hvac system metric allows calculating the relationship of energy efficiency between hvac equipment and it equipment. in our case, it indicated that the barranquilla data center hvac system is efficient and therefore its optimization is potentially small. on the other hand, the data center of soledad has good the energy efficiency of the hvac system, is possible to optimise it. the data centers of the university institution itsa, have an inefficient use of energy, even though the hvac system installed in both data centers are relatively efficient, so it is recommended to carry out an improvement plan that optimizes energy efficiency. swap efficiency is a metric initially presented by sun to allow the comparison of different servers regarding power and space, however, using the same principles can be used for other network and storage devices. in the case of our research, we could see which of the installed devices are being more efficient with the use of energy compared to the others; it could also be observed that soledad's data center is a little more efficient than barranquilla's concerning the power space ratio. as future work starting from this project, the following could be developed: propose a multicriteria methodology to evaluate the energy performance of data centers or perform an analysis of the practices table 11. calculation of the swap in the data center of the soledad campus qty device description unit consumption total consumption unit performance (w) total performance (w) ru per device ru total swap (%) 2 firewall sophos wg450 66 – 180 w 360 83 166 1 2 23.06 3 switch catalyst 2960 464 – 870 w (con poe) 2610 370 1110 1 3 14.18 1 switch quidway s3300 100 – 264 v 264 92 92 1 1 34.85 1 switch tp link tl-sg3424 100 – 240 v 240 23.3 23.3 1 1 9.71 2 storage hp storage works p2000 374 – 432 w 390 390 780 2 4 50.00 1 router cisco 2900 80 – 360 w 360 210 210 1 1 58.33 3 servidores hp. proliant dl 380p gen8 460, 750 – 1200 w 3600 1200 3600 1 3 33.33 total 7824 5981,3 15 5,10 l. hernandez et al. / knowledge engineering and data science 2018, 1 (2): 64–73 73 used in different data centers, to improve energy consumption [15][16]. even propose the design of an application based on internet of things (iot) to measure the performance, consumption levels of data center [6]. another future work that can be done is to calculate the gap between the real values obtained from each metric versus the standard values that should be and analyze how far we can be from what is really a data center friendly to the environment. references [1] e. baccour, s. foufou, r. hamila, z. tari, and a. y. zomaya, “ptnet: an efficient and green data center network,” j. parallel distrib. comput., vol. 107, pp. 3–18, sep. 2017. [2] m. s. obaidat, a. anpalagan, i. woungang, y. zhang, and n. ansari, “green data centers,” in handbook of green information and communication systems, academic press, 2013, pp. 331–352. [3] l. hernandez and g. jimenez, “characterization of the current conditions of the itsa data centers according to standards of the green data centers friendly to the environment,” adv. intell. syst. comput., vol. 574, pp. 329–340, 2017. [4] j.-m. pierson, large-scale distributed systems and energy efficiency: a holistic view, reimpresa. new jersey, 2015. [5] a. shehabi, s. j. smith, n. horner, i. azevedo, r. brown, j. koomey, e. masanet, d. sartor, m. herrlin, and w. lintner, “united states data center energy usage report,” 2016. [6] l. hernandez, a. pranolo, i. riyanto, y. calderon, and h. martinez, “design of a system for detection of environmental variables applied in data centers,” in 2017 3rd international conference on science in information technology (icsi tech 2017), 2017, pp. 389–395. [7] r. hernandez sampieri, c. fernandez collado, and m. del p. baptista lucio, metodología de la investigación. 2010. [8] wang and s. u. khan, “review of performance metrics for green data centers: a taxonomy study,” j. supercomput., vol. 63, no. 3, pp. 639–656, 2013. [9] m. sharma, k. arunachalam, and d. sharma, “analyzing the data center efficiency by using pue to make data centers more energy efficient by reducing the electrical consumption and exploring new strategies,” procedia comput. sci., vol. 48, pp. 142–148, jan. 2015. [10] 42u data center solutions, “42u data center solutions,” 2018. [online]. available: https://www.42u.com/measurement/pue-dcie.htm. [accessed: 21-jul-2018]. [11] pablo fernández, “las nuevas métricas de green grid para la eficiencia energética,” silicon, mar-2009. [12] epa, “report to congress on server and data center energy efficiency public law 109-431,” berkeley, california, 2007. [13] j. yuventi and r. mehdizadeh, “a critical analysis of power usage effectiveness and its use in communicating data center energy consumption,” energy build., vol. 64, pp. 90–94, sep. 2013 [14] w. lintner, b. tschudi, and o. vangeet, “best practices guide for energy-efficient data center design,” u.s dep. energy, no. march, p. i-24, 2011. [15] b. dennis, “five ways to reduce data center server power consumption,” green grid, 2009. [16] j. judge, j. pouchet, a. ekbote, and s. dixit, “reducing data center energy consumption,” ashrae j., vol. 50, no. november, 2008. microsoft word 4-9869-haviluddin-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 2, december 2019, pp. 82–89 eissn 2597-4637 https://doi.org/10.17977/um018v2i22019p82-89 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) handwriting character recognition using vector quantization technique haviluddin a, 1, *, rayner alfred b, 2, ni’mah moham a, 3, herman santoso pakpahan a, 4, islamiyah a, 5, hario jati setyadi a, 6 a faculty of computer science and information technology, universitas mulawarman jl. kuaro, gn. kelua, kota samarinda, east kalimantan, indonesia b faculty of computing and informatics, universiti malaysia sabah 88400 kota kinabalu, sabah, malaysia 1 haviluddin@unmul.ac.id *; 2 ralfred121@gmail.com; 3 nimahmoham@gmail.com; 4 herman.pakpahan@fkti.unmul.ac.id; 5 islamiyah@fkti.unmul.ac.id; 6 hario.setyadi@fkti.unmul.ac.id * corresponding author i. introduction it is unquestionable that indonesia possesses a prosperous cultural heritage. one substantial cultural heritage of indonesia is the manifestation of local language with its native script. a lontara script from buginese language makassar is one instance of native script from south sulawesi. the lontara script as one substantial cultural heritage of buginese tribe in makassar ought to be taken into account in all conscience by the public, particularly by the local community of buginese in south sulawesi. unless lontara script would have disappeared due to expeditious modernization. for the time being, lontara script is endangered and requires meticulous attention since we found limited data and information regarding this script. in order to assist the local community as well as the public in recognizing the buginese lontara script pattern from makassar, several attempts have been conducted. with the rapid development of technology, some attempts employed technology-based approaches. one of the approaches taken was using artificial intelligence (ai) to develop a technology to recognize script patterns. essentially, ai constitutes a system to develop a system enabling a computer machine to do human work. to enable computer machines in performing human work, the computer must be equipped with the knowledge possessed by a human to construct human behavior. this paper intends to implement learning vector quantization (lvq) in recognizing the buginese lontara script pattern from makassar and converting to handwriting pattern. this paper, in the near future, aims at assisting the local community in learning buginese lontara script. this paper is divided into four parts. first, it discusses research motivation; why did the researcher intrigue to conduct this research. second, it presents what methodology and technique which were employed by the article info a b s t r a c t article history: received 13 october 2019 revised 9 december 2019 accepted 13 december 2019 published online 23 december 2019 this paper seeks to explore learning vector quantization (lvq) processing stage to recognize the buginese lontara script from makassar as well as explaining its accuracy. the testing results of lvq obtained an accuracy degree of 66.66 %. the most optimal variant of network architecture in the recognition process is a variation of learning rate of 0.02, a maximum epoch of 5000 and a hidden layer of 90 neurons which was the result of recognition based on feature 8. based on these variations, the obtained performance with a mean square error (mse) of 0.0306 and the time required during the learning process was quite short, 6 minutes and 38 seconds. based on the results of the testing, the lvq method has not been able to provide good recognition results and still requires development to generate better recognition results. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: lontara script pattern recognition learning vector quantization mean square error haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 83 researcher. third, it discusses lvq method assessment results. finally, it sums up the assessment results as well as providing suggestions and recommendations for future endeavors of research. ii. materials and methods this part briefly discusses buginese lontara script pattern, lvq method, and image processing employed in this research. a. lontara script lontara is a traditional script originating from buginese community in makassar. its scriptwriting is based on sulapa eppa wala suji. wala suji was derived from the words wala and suji-wala mean divider/fence/keeper and suji means princess. wala suji is a kind of a trapezoid bamboo fence using in traditional rituals. sulappa eppa (four sides) is a classic mystical faith of buginese in makassar. it represents four elements of the universe: fire, water, air, and earth. antecedently, lontara script was used to write governmental regulations and rules as well as the social norms applied within the community. the manuscripts were written using stick [1]. the type of data used in this study was the primary data. primary data was obtained from the original or first source. this data was not available in compiled form or in the form of files. it must be sought through sources, commonly referred as a respondent. respondent plays a role as the person to obtain the data [2]. data samples from 23 the buginese script can be seen in table 1. b. pattern recognition pattern recognition is a constituent of artificial intelligence science. several scholars have discussed some definitions of pattern recognition from the initial research; pattern recognition deals with a physical object or certain event recognition and is classified into single or multiple categories [3][4][5]. in the same way, it is a science which emphasizes on the description and classification of certain measurement [6][7][8][9]. c. handwriting recognition according to plamondon and srihari, handwriting recognition constitutes a process of changing certain language manifested in the form of spatial form; converting handwriting pattern into a symbolic representation [10]. in principle, the handwriting processing stages including data acquisition, pre-processing, feature extraction, and machine learning algorithms such as bpnn, rbfnn, svm, lvq and so forth [11][12][13][14]. d. image processing image processing constitutes a number of processes to improve image quality to meet the demand and requirement of users for accessible and uncomplicated interpretation performed by human and/or computer machines. image processing processes signal which input is an image. image processing generates an image or a group of characteristics or parameters correlated with the image. image processing basically is a system intending to classify objects within categories and classes based on neither knowledge a priory nor statistical information is taken from object pattern [3][6][7][15]. table 1. script samples lontara script alphabetical lontara script alphabetical lontara script alphabetical ka ta ya ga da ra nga na la ngka nra wa pa ca sa ba ja a ma nya ha mpa nca 84 haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 1) image acquisition image acquisition constitutes a capturing process or scanning process of certain analog images and converting to digital form. commonly, this stage begins by initially capturing the image on the object using the scanner as primary media. this paper captured buginese lontara script from makassar from 16 different respondents using a4-sized paper. 2) enhancing image quality first, cropping; object cropping was the process of cutting the image area and only the image area containing the object has remained. this process was done to remove unnecessary empty areas around the object and avoid learning errors due to the location of the object (letter writing) that has different positions. cropping was performed by cutting 23 points of the image simultaneously and produced 23 image characters that are ready to be processed by the next stage. basically, the cropping machine was separated from other parts of the image processing. cropping object is a process in which it crops some part of the image to only. next, converting rgb image to grayscale; grayscale is an image whose pixel intensity values are based on gray degrees. at this stage, the rgb image was converted to a grayscale image and it generated only one color channel [16][17]. 3) image quality enhancement image sharpening or commonly referred to as image transformation constitutes one process of image quality enhancement. it is commonly employed to increase color contrast and brightness of an image. this process intends to simplify the process of interpretation and analysis of an image. image contrast sharpening is purposively taken to correct the image display by maximizing image contrast between lightening and darkening of the image [17][18]. 4) image segmentation thresholding constitutes one of the image segmentation methods which separated object and background in the certain image based on the brightness level. thresholding was a process in changing colorful image or grayscale image into binary image (black-white). this process employed thresholding concept. the thresholding process took color value on each image pixel and it compared with the value of the threshold. each image pixel was then altered into white color if the value of color or its grayscale is above the threshold. on the other hand, if the color value or its grayscale is below the threshold, then it was altered into black color [19]. 5) size feature extraction first, image resizing; image resizing constitutes the process of reducing the size of an image based on the number of pixels. for example, images with a size of 186 x 186 pixels and reduced to 42×42 pixels. the image resizing process is needed to reduce the number of pixel images but still consider the shape of the object hence it does not change significantly. reduction of the number of pixels aims at limiting the number of nodes or neurons in the input layer of the artificial neural network which requires longer processing time. then, image thinning; image thinning used an interactive process that removes black pixels (turning them into white pixels) at the edges of the pattern. the purpose of depletion was to reduce the unnecessary part (redundant part) and only important information was produced [20]. e. learning vector quantization (lvq) the lvq method constitutes a classification algorithm utilizing connected vectors and works competitively but is guided to solve a problem. the lvq network is composed of competitive layers (composed of several neurons) working to classify input vectors based on the principle of competition in which the class is determined based on the winner of the competition. the winner of the competition is determined based on the distance of the input vector with the reference vector. lvq is widely used in pattern recognition and data classification. this method is quite simple but very effective [21][22]. lvq architecture consists of an input layer (input layer), a competitive layer (a competition in the input to enter a class based on the proximity of its distance) and the output layer. the input layer is connected to the competitive layer by weights. in the competitive layer, the learning process is supervised. input competes to be able to enter into a class. then, the stages of lvq are explained as follows: haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 85 1. starting by entering the training data, the existing input data is the result of image processing, then the training data is a digital image that is ready to be processed by the pattern recognition system. 2. determining weight (w), maximum iteration (maximum epoch), expected minimum error (eps), and learning rate (α). 3. initial condition of epoch=0, error=1. 4. do if : (epoch < maximum iteration ) or ( α > eps ) a. epoch = epoch + 1; b. do for i = 1 to n i. determine j then || x – wj || is a minimum. ii. correct wj with the following clause: iii. if t = j then wj (new)=wj(old) + α (x-wj(old)) iv. if t not equal to j then wj(new)=wj(old) α (x-wj(old)) v. then reduce the value of α in the next iteration. f. performance accuracy some methods in statistics to measure the accuracy of an algorithm are mean absolute error (mae), mean square error (mse), root mean squared error (rmse), and mean absolute percentage error (mape). the measurement algorithm aims at attaining the best value [23][24][25][26]. in this study, the mse method was chosen to measure accuracy. meanwhile, mse used (2) 𝑀𝑆𝐸 = ∑ (𝑥 − 𝑥 ) (2) where, 𝑥 is a data value; 𝑥 − 𝑥 it is a result value; 𝑀is a pattern value. iii. results and discussions this part discusses the results of image processing, feature extraction, and followed by assessment using learning vector quantization (lvq). a. image processing results image processing intends to increase image quality and produce an image in accordance with the demand of the user to be easily interpreted by humans or machines. based on the stages of image processing, the first stage was cropping processes such as image acquisition (converting original images to grayscale images), improving image quality, and image sharpening. second, the image segmentation process by changing into a grayscale image was converted into a binary image using the threshold method. then, the size feature extraction was done by resizing into a 42 x 42 and depletion of the image. third, the process of feature extraction to obtain the required data was obtained. the cropping process results are presented in figure 1. next, the fourth stage was the segmentation process. in this study, the chessboard method that divided images into objects on the squares has been used. the objects were formed into a square with a certain size. in this study, the image was divided into nine segments which then used by ioc to count the number of black pixels in each segment or section. the results of segmentation can be seen in figure 2. last, the extraction of features in each segment by using mark direction to calculate several values including vertical (vert), horizontal (horz), left diagonal (dig1) and right diagonal (dig2) masking were performed. the feature extraction process was carried out on 16 x 23 letters of data which then formed 8 variations of features on each buginese lontara script. this feature will be used as input on the network. the amount of data used can be seen in table 2. in this experiment, the results of the feature extraction process were saved into a dataset with a file extension .xlsx. thus, feature data sets from the 1st to the 8th features to be analyzed by the bpnn method as training and testing data have been used. 86 haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 b. image calculation with learning vector quantization (lvq) method the lvq network architecture used was similar to the previous bpnn network architecture which consists of the input layer, hidden layer, and output layer. the number of hidden layers can be changed as required. the input layer consists of 9.18 and 45 neurons which are a combination of features consisting of 8 different features that have been obtained from the feature extraction process. while the output layer consists of 23 neurons since the target output is the number of lontara alphabets consisting of 23 letters. network training using the lvq method used 230 training data and 138 test data. in this study, network training using the lvq method was also made as the bpnn method. it is influenced by the learning rate, the number of epochs and the number of neurons in the hidden layer. fig. 1. image processing fig. 2. image segmentation divided into nine segments table 2. features extraction feature combination data amount of feature 1 [black] 9 2 [dig1] 9 3 [dig2] 9 4 [black + dig1] 9 + 9 = 18 5 [black + dig2] 9 + 9 = 18 6 [horz + vert] 9 + 9 = 18 7 [dig1 + dig2] 9 + 9 = 18 8 [black + dig1 + dig2 + horz + vert] 9 + 18 + 18 = 45 haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 87 simulations performed on the lvq method were alike with the bpnn method; it intended to make both methods were compared with similar parameters. simulations performed 6 times with 3 different parameters were the effect of the learning rate, then the number of epochs with a maximum limit and hidden layer. the variation of each parameter can be seen in table 3. based on simulations performed 6 times, the 4th simulation was the best simulation in the calculation using the lvq method with one hidden layer with a total of 90 neurons, the maximum number of epochs given was 5000 and the target error was 0. the learning function used was learnlv1 and the number the output layer had 32 neurons. the learning rate given in this simulation was 0.02. this network pattern was applied to a combination of characteristics 1 to 8. the results of the 5th simulation can be seen in table 4. based on table 4, 92 letters out of 138 letters can be read properly and precisely with an accuracy of 66.66 % using a combination of features 8. the performance (mse) obtained by 0.0306 was the second largest sequence approaching 0 after feature 5 with mse 0.0276. the best epoch obtained was 184 which was influenced by 6 minutes 38 seconds learning time. c. learning vector quantization (lvq) method analysis learning rate variation, changing the amount of maximum epoch and neurons unit on the hidden layer. learning rate variation on the lvq method during the first and second simulation was influenced by the varied changing of accuracy; each feature combination accuracy may increase or decrease. overall, the results in this variation were less good since the accuracy obtained was only around 42.02 %. it was also influenced by the maximum epoch. giving maximum epoch during the third and fourth simulations can be stated to have an effect, yet it was similar with giving learning rate. the accuracy obtained was increased and decreased on each combination. thus, the accuracy was not consistently increased in each feature combination. if the accuracy increased, it affected the generated mse. all learning in this variation stops at best epoch according to the maximum epoch given, which was 10 epoch and 50 epoch. but the time spent in the learning process also increases overall. the greater the epoch gave results the longer time spent on the learning process. then, the maximum epoch in this study also has a significant influence on learning time. however, the number of changes from the neuron unit itself does not provide an increase that exceeds the highest value from the default value or reference point, the 5th simulation with the highest accuracy of 66.66 %. the best results can be seen in table 4, taking into account the number of letters reads with the shortest learning time found in the 5th simulation of 92 letters out of 138 letters with an accuracy of 66.66 %, mse 0.0306, best epoch 184 with a learning time of 6 minutes 38 seconds. table 3. lvq assessment parameter parameter value learning rate 0.02; 0.03; 0.04 epoch 10; 25; 50; 5000 hidden layer 90; 120 learning function learnlv1 table 4. fourth simulation results on the lvq network feature combination hidden layer testing data correct data accuracy performance best epoch time feature 1 90 138 77 55.79 % 0,0367 91 02:55 feature 2 90 138 23 16.67 % 0,0684 31 11:11 feature 3 90 138 21 15.21 % 0,065 46 01:38 feature 4 90 138 83 60.14 % 0,0321 510 16:11 feature 5 90 138 76 55.07 % 0,0376 390 15:03 feature 6 90 138 9 6.52 % 0,0546 35 01:31 feature 7 90 138 40 28.98 % 0,0541 44 01:34 feature 8 90 138 92 66.66 % 0,0306 184 06:38 88 haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 iv. conclusion an analysis of the introduction of the buginese lontara script from makassar using the lvq method has been implemented. based on the results of the experiment, the lvq method has the highest accuracy rate of 66.66 %; obtained from the 5th simulation of feature 8 with data that can be recognized as many as 92 data from a total of 138 input data. meanwhile, the testing time needed was 6 minutes 38 seconds. it can be said that parameters such as learning rate, the number of neurons in the hidden layer and the maximum number of epochs, greatly affect the results of recognizable data and the accuracy of the recognition results. in this study, the best parameters of the bpnn method were learning rate = 0.02, the number of neurons in the hidden layer = 90, epochs = 5000 has been used to obtain the best accuracy. thus, the lvq method has generated a level of recognition accuracy measured by mse of 0.0306. acknowledgement this project has been completed, thanks to the help and support from various parties that cannot be mentioned one by one. researchers say a significant appreciation to family of laboratory of artificial intelligence, faculty of computer science and information technology, universitas mulawarman and laboratory of artificial intelligence, faculty of computing and informatics, universiti malaysia sabah who has given support to complete this project. hopefully, this research can be useful declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] i. wihanry and p. chyan, “perancangan aplikasi pembelajaran aksara lontara dengan metode game based learning,” telemat. j. informatics inf. syst., vol. 3, no. 1, pp. 1–8, 2015. [2] s. shilpa and m. kaur, “big data and methodology-a review,” int. j. adv. res. comput. sci. softw. eng., vol. 3, no. 10, pp. 991–995, 2013. [3] d. oliva, m. abd elaziz, and s. hinojosa, “image processing,” in metaheuristic algorithms for image segmentation: theory and applications, 2019, pp. 27–45. [4] b. d. ripley, pattern recognition and neural networks. cambridge university press, 2007. [5] n. abramson, d. braverman, and g. sebestyen, “pattern recognition and machine learning,” ieee trans. inf. theory, vol. 9, no. 4, pp. 257–261, 1963. [6] r. c. gonzalez and r. e. woods, digital image processing (3rd edition), 3rd ed. upper saddle river, nj, usa: prentice-hall, 2007. [7] v. hlavac, “fundamentals of image processing,” in optical and digital image processing: fundamentals and applications, 2011. [8] z. abidin and a. alamsyah, “wavelet based approach for facial expression recognition,” int. j. adv. intell. informatics, vol. 1, no. 1, pp. 7–14, 2015. [9] w. novan, “capital letter pattern recognition in text to speech by way of perceptron algorithm,” knowl. eng. data sci., vol. 1, no. 1, pp. 26–32, 2017. [10] r. plamondon and s. n. srihari, “on-line and off-line handwriting recognition: a comprehensive survey,” ieee trans. pattern anal. mach. intell., vol. 22, no. 1, pp. 63–84, 2000. [11] s. winardi and h. hamzah, “rancang bangun analisis pengenalan tulisan tangan aksara hanacaraka,” in seminar nasional teknologi informasi dan multimedia 2015, 2015, pp. 6–8. haviluddin et al. / knowledge engineering and data science 2019, 2 (2): 82–89 89 [12] f. zamora-martínez, v. frinken, s. españa-boquera, m. j. castro-bleda, a. fischer, and h. bunke, “neural network language models for off-line handwriting recognition,” pattern recognit., 2014. [13] i. yousif and a. shaout, “off-line handwriting arabic text recognition : a survey,” int. j. adv. res. comput. sci. softw. eng., vol. 4, no. 9, pp. 68–82, 2014. [14] a. andana, r. widyati, and m. irzal, “pengenalan citra tulisan tangan dengan metode backpropagation,” j. mat. ter., vol. 2, no. 1, pp. 36–44, 2018. [15] r. j. radke, s. andra, o. al-kofahi, and b. roysam, “image change detection algorithms: a systematic survey,” ieee trans. image process., vol. 14, no. 3, pp. 294–307, 2005. [16] k. iqbal, m. odetayo, a. james, r. a. salam, and a. z. h. talib, “enhancing the low quality images using unsupervised colour correction method,” in conference proceedings ieee international conference on systems, man and cybernetics, 2010, pp. 1703–1709. [17] r. chityala and s. pudipeddi, image processing and acquisition using python. chapman and hall/crc, 2014. [18] h. k. sawant and m. deore, “a comprehensive review of image enhancement techniques,” int. j. int. j. comput. technol. electron. eng., vol. 1, no. 2, pp. 39–44, 2010. [19] r. m. haralick and l. g. shapiro, “image segmentation techniques.,” comput. vision, graph. image process., vol. 29, no. 1, pp. 100–132, 1985. [20] i. guyon and a. elisseeff, “feature extraction, foundations and applications: an introduction to feature extraction,” stud. fuzziness soft comput., pp. 1–25, 2006. [21] i. afrianto and d. priatama, “aplikasi mobile pengenalan citra menggunakan metode learning vector,” pp. 39–44, 2013. [22] c. l. liu, k. nakashima, h. sako, and h. fujisawa, “handwritten digit recognition: benchmarking of state-of-the-art techniques,” pattern recognit., vol. 36, no. 10, pp. 2271–2285, 2003. [23] h. haviluddin, r. alfred, j. h. obit, m. h. a. hijazi, and a. a. a. ibrahim, “a performance comparison of statistical and machine learning techniques in learning time series data,” adv. sci. lett., vol. 21, no. 10, pp. 3037–3041, 2015. [24] r. rojas, “the backpropagation algorithm,” in neural networks, berlin, heidelberg: springer, 1996, pp. 149-182. [25] a. susanti, suhartono, h. j. setyadi, m. taruk, haviluddin, and p. p. widagdo, “forecasting inflow and outflow of money currency in east java using a hybrid exponential smoothing and calendar variation model,” j. phys. conf. ser., vol. 979, no. 1, pp. 1–13, mar. 2018. [26] j. fürnkranz et al., “mean squared error,” in encyclopedia of machine learning, 2010. microsoft word 1-6935-azhari-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 2, december 2019, pp. 47–57 eissn 2597-4637 https://doi.org/10.17977/um018v2i22019p47-57 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) neural network classification of brainwave alpha signals in cognitive activities ahmad azhari a, 1, *, adhi susanto b, 2, andri pranolo a, c, 3, yingchi mao c, 4 a informatics department, universitas ahmad dahlan jl. prof. dr. soepomo, sh, janturan, umbulharjo yogyakarta, indonesia b electrical and information engineering department, universitas gadjah mada jl. grafika no 2 kampus ugm yogyakarta, indonesia c college of computer and information, hohai university, china 1 xikang road, nanjing, jiangsu 210098 1 ahmad.azhari@tif.uad.ac.id *; 2 susanto@ugm.ac.id; 3 andri.pranolo@tif.uad.ac.id; 4 maoyingchi@gmail.com * corresponding author i. introduction the brain-wave signal is one of the typical characteristics produced by the body. signals carry information and are represented in electrical signals produced from the brain in a typical waveform. human brain wave activity will always be active even when sleeping. brain waves will produce different characteristics in different individuals. physical and behavioral characteristics can be identified from patterns of brain wave activity. electroencephalogram (eeg) is a tool used to measure and record the electrical potential of brain waves. the application of eeg signals in the biomedical field is the development of bio-signal processing. biomedical is a combination of several basic principles of science which include chemistry, biology, science, mathematics, and engineering. one of the goals of biomedical applications that are widely used is to help solve health-related problems [1]. eeg signals are an essential source used to study, monitor, and evaluate the activity of the human brain. in recent years, much research work has been developed related to eeg signals in the biomedical field [2]–[9]. eeg signals are also developed and applied to the security field for user authentication [10]–[12]. the development of the application of eeg signals in the biomedical field is caused by the eeg recording process that can be easily obtained without side effects on the brain. the use of eeg electrodes can be placed on the scalp without the need to penetrate the scalp like a conventional method. the development of the number of eeg electrodes used also makes it easier for article info a b s t r a c t article history: received 25 march 2019 revised 19 may 2019 accepted 19 may 2019 published online 23 december 2019 the signal produced by human brain waves is one unique feature. signals carry information and are represented in electrical signals generated from the brain in a typical waveform. human brain wave activity will always be active even when sleeping. brain waves will produce different characteristics in different individuals. physical and behavioral characteristics can be identified from patterns of brain wave activity. this study aims to distinguish signals from each individual based on the characteristics of alpha signals from brain waves produced. brain wave signals are generated by giving several mental perception tasks measured using an electroencephalogram (eeg). to get different features, eeg signals are extracted using first-order extraction and are classified using the neural network method. the results of this study are typical of the five first-order features used, namely average, standard deviation, skewness, kurtosis, and entropy. the results of pattern recognition training show that 171 successful iterations are carried out with a period of execution of 6 seconds. performance tests are performed using the mean squared error (mse) function. the results of the performance tests that were successfully obtained in the pattern test are in the number 0.000994. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: electroencephalogram brainwave neural network feature extraction cognitive task 48 a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 researchers to obtain eeg signals. some research work was completed using single-channel eeg signals [1][12][13]. to be able to measure brainwaves, stimulation is needed to stimulate the brain. the application of cognitive activity as a stimulus is one way to produce brainwaves that are specifically located in the realm of active thinking. brain wave oscillation can be seen in figure 1. the ability to focus attention on a particular activity requires different concentrations of each person. several factors, such as fatigue and the environment, become one of the causes of loss of concentration. the difficulty of concentrating is experienced by various occupations, both students and workers. implementation of meditation therapy in the process of measuring and recording eeg waves will significantly help reduce some disturbances from the environment, in addition to the application of meditation therapy can also help improve concentration on the stimulus provided. the application of meditation therapy used is hypnotherapy. the purpose of the application of hypnotherapy is to obtain eeg alpha waves without removing the concentration of the study subjects. references [14] in research related to the effectiveness of hypnotherapy suggest that hypnotherapy is a conscious state in a person that involves focused attention characterized by an increased capacity to respond to suggestions. the statement regarding the state of hypnosis is also given by reference [15], hypnosis is the use of fantasy and imagination in a state of consciousness that modifies the attention and concentration of subjects involved with new possibilities for self-control and thinking. this research work aims to distinguish signals from each individual based on the characteristics of alpha signals from brain waves produced. this research work explores human brain activity based on a cognitive perspective. brainwaves are explicitly measured to gain cognitive activity from the brain. therefore, appropriate stimulation is required. providing the right stimulus will produce the right brainwaves. the stimulus in this research work is derived from several previous studies [2][9][16][17]. here, nine kinds of cognitive tasks are applied. in this paper, eeg signals are obtained from participants by applying hypnotherapy first. the purpose is to get specific cognitive activity and focus from the brain. chebyshev filters and crosscorrelation methods are applied to obtain brainwave features. each feature acquired is processed further by performing a matching process on each feature. euclidean distance is applied to show the similarity level. at the end of the research work, a performance evaluation will be applied. based on figure 1, the alpha signal is a signal representation in very comfortable conditions to deepen meditation. in the process of data acquisition, this condition is the best condition for the resource person, which is protected from sound disturbances and visual disturbances (focus disturbances). the resource person is expected to be able to focus on all the cognitive tasks given to obtain optimal results from the signal classification process and reduce the existence of natural noise. fig. 1. oscillation of brainwaves a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 49 hypnotherapy is applied in the data acquisition process to get specific alpha signals from the brain. in the classification process, neural networks are expected to be able to optimally acquire knowledge, generalize and extract from a particular data pattern, create a pattern of knowledge through selforganizing, have a fault tolerance, computation in parallel so that the process is shorter. ii. materials and methods a. related works the application of single-channel eeg to research work [14] aims to detect mental states. there are two types of stimuli: each part consists of several sentences and isolated words. there are six easy and six difficult parts. in addition to the parts, there are 20 right words, 20 false words, and ten nonword words. classification tasks are capable of generating 31 % simple classification accuracy for adults, 35 % for adults and children together, and 24 % for children. research [16] applies an eeg with a single channel to detect the level of attention. this research work focuses on alpha waves. stimulus applied in the form of giving several questions. interpretation of focus is seen from a combination of some information in the form of time taken to answer each question and the number of correct and wrong answers. the results show there are as many as 35 % aware of the error in answering the question. eeg analysis was also developed on the research work [17], which aims to establish a basic wave brainwave index (bbi). the stimulus used was a psychoanalysis test on 51 participants. the brainwaves observed focused on the beta waves. the analysis shows that psd provides a reliable bbi with 80 % conformity. research by [12] classified sleep phase based on time domain features and structural graph similarity coupled with k-means clustering. the results of the research work found that 12 feature sets produced 95.93 % better performance for all stages of sleep. the research work [18] reviews the application of eeg signals used for diagnosis, monitoring, and treatment in patients with epilepsy. the result of several multiparadigm approaches found that the processing of eeg signals using wavelets, linear dynamics, and chaos theory, as well as neural networks is the most effective method for the diagnosis of eeg-based epilepsy. b. brainwave data the data acquisition process was obtained using the neurosky mindset eeg tool. this eeg tool uses a single eeg sensor (called a single electrode). electrode placement is based on the 10-20 system [19][20]. the placement of the electrode position in this neurosky mindset eeg device based on the 10-20 system is in the position of fp1, which is in the position of the frontal lobe. in this study, four men and four women were included as eeg subjects, with eeg signals as objects. the data acquisition process is carried out in two different times. the retrieval process is carried out for 20 seconds. the sampling frequency used in data acquisition is 128 hz per second. c. cognitive task brain cognitive activity is based on several studies related to psychological perception. this cognitive activity aims to get a specific response from the cognitive activities of the brain (called cognitive tasks). there are nine cognitive tasks involved in the data acquisition process of this research, including breath, color, face, fingers, mathematics, objects, password thinking, singing, and sports. these nine types of cognitive tasks are based on previous research [18]. in table 1, it can be seen a description of the cognitive tasks of the brain along with detailed work instructions for each person. d. feature extraction in this study, feature extraction using statistical features first order based on the characteristics of the histogram. first-order feature extraction is better at presenting measurable parameters, including mean, skewness, standard deviation, kurtosis, and entropy. the first-order characteristic value then becomes input value in the classification process. 50 a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 eeg data obtained after feature extraction are grouped according to three categories, including cognitive assignments, time data collection, and subjects. the mean represents data distribution. the standard deviation represents the variation of data. the skewness represents the rate of diffusion of asymmetric data. the kurtosis represents the high-low distribution of data to normal distribution and randomness. the entropy represents the size of the distribution data. five statistical features can be calculated using (1) to (5). 𝑚𝑒𝑎𝑛 = �̅� = ∑ 𝑥 (1) 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 = ( ) ∑ (𝑥 − �̅�) (2) 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑠 = ∑ (𝑥 − �̅�) /𝑁𝜎 (3) 𝑘𝑢𝑟𝑡𝑜𝑖𝑠 = 𝑘 = ∑ (𝑥 − �̅�) /𝑁𝜎 (4) 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻 = 𝐸[−𝑙𝑜𝑔 𝑃(𝑥)] = − ∑ 𝑃(𝑥)𝑙𝑜𝑔 𝑃(𝑥)𝑏 (5) e. proposed method the general procedure of the proposed method can be seen in figure 2. eeg data obtained from data acquisition using the eeg neurosky mindset tool filtered first using bandpass filters with a frequency range of 8 to 12 hz. neural network classification with the backpropagation algorithm supervised learning algorithm where the learning process is carried out during training data. input data table 1. cognitive stimulus (task) cognitive task alias description breathing task breath in this task, the stimulus is focused on breathing. breathing is done using a measured time of 20 seconds while closing the eyes. subjects are not permitted to carry out any body movements. object counting color task color this task is given as a brain stimulus to remember colors. subjects are shown several colors in a certain order to be remembered, then subjects are asked to mention nonverbally the color sequence according to their memories. this stimulus is carried out quietly for 20 seconds. simulated movement finger finger this task is a stimulus task focused on the finger. without moving a finger, the subject is asked to imagine moving a finger. this stimulus is carried out with a measured tempo of 20 seconds by closing the eyes. simulated facial reconstruction face this task focuses on simulating a person's face known to the subject. the subject was asked to close his eyes and reconstruct the person's face. without body movements and sound, this stimulus is carried out for 20 seconds. simulated object reconstruction object this task focuses on the detailed simulation of object reconstruction. the subject is shown one object in a limited time. then for 20 seconds, the subject is asked to close his eyes and reconstruct the object in detail without making a gesture and making a sound. mathematical task math this task serves to stimulate the brain to do simple mathematical calculations. the calculation includes addition, subtraction, multiplication, and division. the subject will be shown some mathematical calculation questions. the subject is given 20 seconds to answer the question without making a sound. wrong answers and correct answers are ignored in this task. simulated password recall task passthought this task focuses on the brain stimulus to remember passwords in the form of sentences consisting of a combination of letters and numbers. the subject is shown a line of passwords that must be remembered. the subject is asked to close their eyes and repeat the password without making a sound. this stimulus is carried out for 10 seconds. song recitation task song this task focuses on the brain stimulus in repeating song lyrics. this stimulus is carried out without movement and sound. by closing the eyes, the subject is asked to imagine repeating the preferred song lyrics in sequence for 20 seconds. simulated sport task sport this task focuses on simulating the preferred sports movement. this stimulus is carried out in silence and without movement. within 20 seconds, the subject was asked to imagine doing the preferred sports movement. a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 51 on input neurons are used as training data, which will be continued to output neurons as output data. each network is given a weight, if the output value is not following the expected value, there will be an improvement in weight and propagated back to spread to the previous neuron network. after the classification stage is complete, a matching phase is carried out by applying the euclidean distance and the evaluation phase of the classification results. iii. results and discussions the eeg data that has been obtained from the acquisition is extracted to acquire the typical characteristics of the eeg signal. feature extraction is used based on first-order statistical features: average, standard deviation, skewness, kurtosis, and entropy. total of 72 eeg data obtained in the primary are grouped by subject, cognitive task (stimulus), and retrieval time. the sources involved in this study were eight people with twice the time of data collection and involving nine cognitive tasks. data is filtered first specifically using the 8 to 12 hz frequency. eeg data is focused explicitly on alpha waves. in figure 3, the filter design for the eeg signal is used. furthermore, the data is extracted using first-order features to obtain 144 variables for each feature. table 2 shows the features produced after the extraction process is carried out. to get the same range of results, the normalization process is performed on the feature extraction results so that the range of values is -1 to 1. the feature extraction after normalization can be seen in table 3. the comparison histogram of feature extraction results before and after extraction can be seen in figure 4. the next step is to match the feature data using the normalized euclidean distance. the smaller the score, the more similar the two feature vectors are matched. conversely, the bigger the score, the more different the two feature vectors will be. the properties of the normalized euclidean distance are the results in the range0 ≤ �̅�(𝑢, 𝑣) ≤ 2. table 4 and table 5 show the results fig. 2. general procedure of proposed method 52 a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 of matching signals before and after normalization. furthermore, table 4 shows that the score is in the range between0 ≤ �̅�(𝑢, 𝑣) ≤ 2. therefore, it can be concluded that the two matching vectors have no similarities between the two. pattern recognition tests are carried out by using five input layers (mean, standard deviation, skewness, kurtosis, and entropy), the first layer using ten hidden layers, the second layer using two hidden layers and the last using six output layers. the network architecture for testing patterns using multilayer perceptron neural network (nn) methods can be seen in figure 5. after determining the hidden layer, the second step is to determine the test data and training data from all existing data first. the third step is to determine the target pattern from the output. the final step is to determine training using a learning rate of 0.1 with a momentum of 0.95 and a training time of 10,000. the activation table 2. description of the elements variable mean standard deviation skewness kurtosis entropy subject 1 min -0.0332 7.1533 -0.0018 3.3917 1.8554 max 0.0594 25.2712 0.0023 24.7364 2.4369 subject 2 min -0.0372 8.7013 -0.0039 3.0186 1.7237 max 0.0571 36.5630 0.0155 29.3101 2.3774 subject 3 min -0.0422 9.5379 -0.0024 3.1019 1.5620 max 0.0686 17.0422 0.0071 4.4353 1.8182 subject 4 min -0.0487 8.8169 -0.0023 3.0937 1.4881 max 0.0687 21.9282 0.0022 3.9735 1.9093 subject 5 min -0.0497 8.5643 -0.0023 3.0872 1.5507 max 0.0582 18.1952 0.0644 4.8329 2.3674 subject 6 min -0.0382 7.4248 -0.0021 3.2309 1.6165 max 0.0474 22.9615 0.0015 4.9105 2.4038 subject 7 min -0.0666 10.6232 -0.0013 2.8828 1.4272 max 0.0631 31.8212 0.0028 5.2983 1.9224 subject 8 min -0.0388 10.2271 -0.0024 3.0455 1.3790 max 0.0503 27.5562 0.0023 4.6097 2.2522 table 3. feature extraction after normalization variable mean standard deviation skewness kurtosis entropy subject 1 min -0.3120 0.1496 -0.3740 0.3176 0.3444 max 0.5005 0.4292 0.3380 0.6317 0.4253 subject 2 min -0.3659 0.1834 -0.7768 0.2838 0.3475 max 0.4817 0.6210 0.9822 0.7485 0.4150 subject 3 min -0.3887 0.2498 -0.4537 0.0886 0.2925 max 0.5320 0.4864 0.7211 0.4029 0.3450 subject 4 min -0.4481 0.2925 -0.4370 0.0790 0.2655 max 0.5167 0.5245 0.5631 0.3735 0.3682 subject 5 min -0.4578 0.2002 -0.4789 0.0855 0.3083 max 0.4859 0.3991 0.9941 0.4302 0.4014 subject 6 min -0.3763 0.2299 -0.5207 0.0848 0.3214 max 0.4354 0.5131 0.2410 0.4616 0.4285 subject 7 min -0.6131 0.1804 -0.2517 0.0736 0.2491 max 0.4666 0.6656 0.7176 0.4400 0.3879 subject 8 min -0.3574 0.2751 -0.5064 0.0822 0.2822 max 0.4393 0.5711 0.1349 0.3828 0.3818 a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 53 function used uses the sigmoid function. the entire eeg data obtained was collected based on the subject or test so that eeg data was obtained per subject. the results of the pattern recognition test can be seen in figure 6. fig. 3. filter design for eeg signals (a) (b) fig. 4. histogram feature extraction; (a) raw data; (b) normalize data table 4. signal matching using euclidean distance before normalization signal1 signal2 signal3 signal4 signal5 signal6 signal2 3087.2 0 0 0 0 0 signal3 2885.1 3193.5 0 0 0 0 signal4 3142 3350.6 3214.6 0 0 0 signal5 2647 3068.8 2826.3 3064.1 0 0 signal6 2854.8 3019.5 3022 3232.6 2855.7 0 signal7 3292 3585.5 3424.7 3559.9 3301.7 3439.2 signal8 3154.3 3434.7 3293.2 3437.3 3052.5 3259.2 table 5. signal matching using euclidean distance after normalization signal1 signal2 signal3 signal4 signal5 signal6 signal2 1 0 0 0 0 0 signal3 1 1 0 0 0 0 signal4 1 1 1 0 0 0 signal5 1 1 1 1 0 0 signal6 1 1 1 1 1 0 signal7 1 1 1 1 1 1 signal8 1 1 1 1 1 1 54 a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 the pattern recognition training results in figure 6 show that 171 successful iterations were carried out with a period of the execution time of 6 seconds. the performance test results can be seen in figure 7. performance tests are performed using the mean squared error (mse) function. mse has a function to measure network performance according to the average squared error. the performance test that was successfully obtained in the pattern test is at the number 0.000994 with the success parameters of the test on mse as much as 0.001 (1 × 10-3) with repetition iterations of 171. figure 7 shows the best validation performance of the neural network and shows the training, validation, and test curves of the training process. the minimum gradient is decided as 1 × 10-3 for the purpose of the training process. training has reached the goal of the 171 epochs. the regression plot of the training results can be seen in figure 8, and the throughput of training status for 171 epoch neural networks is shown in figure 9. in figure 8, the plot shows the approximate amount of output that deviates from the actual target. the ideal case is the exact same output and target, in this case, the actual deviation lies in the dashed line, as shown in the plot. in general, the output will definitely deviate from the target. this deviation is indicated by the blue line. the training state data is shown in figure 9. in figure 9, changes in gradient and learning rate of neural networks and the number of validation examinations occurred 171 epochs during the process of training neural networks. the output of this training process shows the state of a successful neural network. fig. 5. recognition of eeg data patterns using the neural network method fig. 6. the pattern recognition test results use a neural network fig. 7. performance test curve of the neural network a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 55 accuracy test is calculated using the false acceptance rate (far) method and the false reject rate (frr). referring to sub-section 2.6 about testing accuracy. calculations are carried out using (6) to (8). accuracy test results can be seen as follows: 𝐹𝐴𝑅 = 8/8 × 100 % = 100 % (6) 𝐹𝑅𝑅 = 0/8 × 100 % = 0 % (7) 𝐺𝐴𝑅 = (1 − 𝐹𝑅𝑅) × 100 % = 100 % (8) the accuracy test also calculates the minimum error using (9) 𝐸 = min(𝐹𝐴𝑅 + 𝐹𝑅𝑅) = min(1 + 0) = 1 (9) fig. 8. regression plot results from the training data fig. 9. training state data 56 a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 iv. conclusion the results of this study indicate the process of identifying alpha signals from brain waves based on the characteristics of the features of the subject. to get the characteristics of features contained in the eeg signal of each individual that can be specifically identified using the five first-order statistical features used, namely the mean, standard deviation, skewness, kurtosis, and entropy. the results of the pattern recognition training in the form of classification of eeg signals using neural networks with the backpropagation algorithm showed that there were 171 successful iterations carried out with a period of execution time of 6 seconds. performance tests are performed using the mse function. the results of the performance tests that were successfully obtained in the pattern test are at 0.000994 with the success parameters of the test on mse as much as 0.001 (1×10-3) with repetition iterations of 171. based on the evaluation results, identification of alpha brain wave signals is absolutely successful, with a far value of 100 %. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research is supported by lpp universitas ahmad dahlan research grant no. pdp-040/sp3/lppm-uad/vi/2018. conflict of interest the authors declare no conflict of interest. additional information no additional information is available for this paper. references [1] w. jatmiko et al., teknik biomedis teori dan aplikasi. depok, universitas indonesia: fakultas ilmu komputer universitas indonesia, 2013. [2] a. azhari and l. hernandez, “brainwaves feature classification by applying k-means clustering using single-sensor eeg,” int. j. adv. intell. inform., vol. 2, no. 3, pp. 167–173, nov. 2016. [3] p. ackermann, c. kohlschein, j. á. bitsch, k. wehrle, and s. jeschke, “eeg-based automatic emotion recognition: feature extraction, selection and classification methods,” in e-health networking, applications and services (healthcom), 2016 ieee 18th international conference on, 2016, pp. 1–6. [4] r. f. ahmad, a. s. malik, n. kamel, f. reza, and a. h. a. karim, “simultaneous eeg-fmri data acquisition during cognitive task,” in intelligent and advanced systems (icias), 2014 5th international conference on, 2014, pp. 1–4. [5] d. bright, a. nair, d. salvekar, and s. bhisikar, “eeg-based brain controlled prosthetic arm,” in advances in signal processing (casp), conference on, 2016, pp. 479–483. [6] r. chai, y. tran, g.r. naik, t.n. nguyen, s.h. ling, a. craig, h.t. nguyen, “classification of eeg based-mental fatigue using principal component analysis and bayesian neural network,” in engineering in medicine and biology society (embc), 2016 ieee 38th annual international conference of the, 2016, pp. 4654–4657. [7] r. lahiri, p. rakshit, a. konar, and a. k. nagar, “evolutionary approach for selection of optimal eeg electrode positions and features for classification of cognitive tasks,” in evolutionary computation (cec), 2016 ieee congress on, 2016, pp. 4846–4853. [8] z. pang, j. li, h. ji, and m. li, “a new approach for eeg feature extraction for detecting error-related potentials,” in progress in electromagnetic research symposium (piers), 2016, pp. 3595–3597. [9] p.-y. zhou and k. c. c. chan, “fuzzy feature extraction for multi-channel eeg classification,” ieee trans. cogn. dev. syst., pp. 1–1, 2016. [10] r. palaniappan, “two-stage biometric authentication method using thought activity brain waves,” int. j. neural syst., vol. 18, no. 01, pp. 59–66, 2008. [11] i. jayarathne, m. cohen, and s. amarakeerthi, “brainid: development of an eeg-based biometric authentication system,” in information technology, electronics and mobile communication conference (iemcon), 2016 ieee 7th annual, 2016, pp. 1–6. [12] j. klonovs, c. k. petersen, h. olesen, and j. s. poulsen, “development of a mobile eeg-based feature extraction and classification system for biometric authentication,” aalborg university copenhagen, 2012. [13] m. diykh, y. li, and p. wen, “eeg sleep stages classification based on time domain features and structural graph similarity,” ieee trans. neural syst. rehabil. eng., vol. 24, no. 11, pp. 1159–1168, nov. 2016. [14] y. zhong and z. jianhua, “recognition of cognitive task load levels using single channel eeg and stacked denoising autoencoder,” in proceedings of the 35th chinese control conference, 2016, pp. 3907–3912. a. azhari et al. / knowledge engineering and data science 2019, 2 (2): 47–57 57 [15] e. davis, “literature review of the evidence-base for the effectiveness of hypnotherapy,” pacfa, p. 19, 2016. [16] l. w. cowen, “literature review into the effectiveness of hypnotherapy,” aust. couns. res. j., vol. 1, pp. 1–55, 2016. [17] b. h. tan, “using a low-cost eeg sensor to detect mental states,” carnegie mellon university, 2012. [18] b. johnson, t. maillart, and j. chuang, “my thoughts are not your thoughts,” 2014, pp. 1329–1338. [19] g. rebolledo-mendez et al., “assessing neurosky’s usability to detect attention levels in an assessment exercise,” in international conference on human-computer interaction, 2009, pp. 149–158. [20] z. h. murat, m. n. taib, s. lias, r. s. s. a. kadir, n. sulaiman, and m. mustafa, “eeg analysis for brainwave balancing index (bbi),” 2010, pp. 389–393. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 157–169 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p157-169 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) eeg classification while listening to murottal al-quran and classical music using random forest method heni sumarti a,1, fahira septiani a,2, agus sudarmanto a,3, wahyu caesarendra b,4, rizki edmi edison c,,d,e,5,* a department of physics, faculty of science and technology, universitas islam negeri walisongo semarang jl. prof. dr. hamka no.3-5, semarang 50185, indonesia b faculty of integrated technologies, universiti brunei darussalam jalan tungku link be1410, brunei darussalam c neuroscience institute, universitas prima indonesia jl. sampul no. 3, medan 20118, indonesia dinstitute of leadership, innovation, and advancement, universiti brunei darussalam jalan tungku link be1410, brunei darussalam e research center of public policy, national research and innovation agency jl. m.h. thamrin no.8, jakarta pusat 10340, indonesia 1heni_sumarti@walisongo.ac.id; 2fahira.septiani.fs@gmail.com; 3agussudarmanto@walisongo.ac.id; 4wahyu.caesarendra@ubd.edu.bn; 5rizkiedmiedison@unprimdn.ac.id* * corresponding author i. introduction the anatomy of the human brain is divided into three main parts: cerebrum, cerebellum, and brainstem. the largest part of the brain is the cerebrum, which consists of 200 million neurons that connect the left and right hemispheres. on the brain’s surface is a gray matter called the cerebral cortex. the cerebral cortex is mostly folded in humans, allowing for millions of additional neurons. this greeting differs from the cerebral cortex in animals, which does not have folds this large. this difference allows humans to read, speak, stretch, write poetry, sing, and do other things. in the frontal lobe, motor areas generate impulses for voluntary movements. in the motor frontal region is the premotor zone, where managing the motor skill being learned requires a series of movements. the prefrontal or orbital cortex is the part of the frontal lobe that lies just behind the eyes. this area includes such things as maintaining an appropriate emotional response to a situation, being aware that there is a standard of behavior (a cardinal law or rule or simple decency), respecting it, and predicting and planning for the future [1]. the transition from adolescence to adulthood is marked by an increase in article info a b s t r a c t article history: received 06 october 2023 revised 13 october 2023 accepted 18 october 2023 published online 19 october 2023 this study is aimed to classify the brain activity of adolescents associated with audio stimuli; murottal al-quran and classical music. the raw data were filtered using independent component analisys (ica) and followed by band-pass filter in python on the google colab extraction was processed with power spectral density (psd) and the random forest method in weka machine learning was used for classification. the research results showed the same results between the two types of stimulation, namely the order of brain waves from highest to lowest were delta, alpha, theta and beta. the average brain waves of teenagers when given murottal al-quran stimulation were 45.32% delta, 31.60% alpha, 17.02 theta and 6.05% beta. meanwhile, the average brain waves of teenagers when given classical music stimulation were 46.54% delta, 28.64% alpha, 19.21% theta and 5.50% beta. classification is obtained with the best value that frequently appears (mode) from the prediction results for each sample using random forest methods. the accuracy, precision, and recall of classifying adolescent brain waves when given murottal and classical music stimuli using the random forest method with cross-validation technique (optimum at k-fold=5) were 65.38%, 76.92%, and 70.00%, respectively. the results of this study show that stimulation using murottal al-quran and classical music effectively improves adolescent relaxation conditions. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: classification brain wave murottal al-quran classical music random forest http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ mailto:rizkiedmiedison@unprimdn.ac.id https://creativecommons.org/licenses/by-sa/4.0/ h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 158 high-level cognitive abilities and improvements in the structure and function of the parts of the brain that support them [2]. the electric currents that travel between neurons are called brain waves because they are like cyclic waves. the relationship between temporal frequency and the spatial distribution of synaptic activity predicted brain waves. such a dispersion relationship, which essentially defines a more general phenomenon as a wave, is shown to limit the spatial–temporal dynamics of synaptic action with many experimental electroencephalogram (eeg) consequences [3]. in bataineh & jarrah [4], eeg records physiological signals from electrophysiological processes of brain electrical activity using electrodes placed on the scalp. the results of eeg studies show that there are four main types of brain waves: delta, theta, alpha, and beta. these studies also show that these brain waves correlate with state of mind [5]. eeg produces a signal that is a discrete-time (i.e., with many dimensions) multivariate time series. the number of eeg channels determines the dimensions of each point in the time series. each time point corresponds to an eeg sample obtained at the same time point. the number of points in the time series depends on the time recorded and the sampling rate. this raw signal is rarely used because it includes dc offset and drift, electromagnetic noise, and artifacts that must be filtered out [4]. adolescent deviant behavior can affect brain and cognitive development, leading to cognitive impairment, which in turn becomes a vicious cycle of perpetuation of deviant behavior. these deviations can take the form of eating disorders [6], drug abuse [7], and cognitive negativity [8]. brain stimulation approaches have been widely used to gain causal mechanistic insights into the relevance of the brain's neurophysiological and/or functional systems for human cognitive function. in previous literature studies, transcutaneous vagus nerve stimulation (tvns) has been reviewed, which is a noninvasive brain stimulation technique based on vagus nerve stimulation. this stimulation can potentially enhance cognition, particularly the possibility of enhancing certain memory functions [9]. another stimulation often used for therapy in adolescents is classical music. classical music has been shown to have positive effects in the fields of autism spectrum disorders and neonatal care [10]. in addition, traditional music therapy for 10 weeks reduces students' anxiety and aggression [11]. however, other research shows that listening to classical music for 60 days causes a significant decrease in participating students' anxiety levels and a significant improvement in their level of subjective well-being [12]. furthermore, listening to the murottal al-quran can be an alternative therapy for adolescents. previous research has shown that listening to murottal al-quran can reduce the anxiety level of grade ix students in facing exams at junior high school muhammadiyah 1 kalirejo, central lampung [13]. combining yoga and murottal al-qur'an has been shown to reduce the dysmenorrhea pain scale in adolescents by increasing beta-endorphin levels [14]. moreover, combining classical and murottal music can reduce pain levels in breast cancer patients [15]. this study used the stimulus murottal al-quran and classical music because both of them had a positive effect on reducing the sample's anxiety and increasing cognitive ability. research conducted by norsiah and amira [16] showed a significant increase in the scores obtained by participants before and after listening to the chair verse. neurologist majid [17] found a relationship between memorizing the koran and increasing scientific thinking and discoveries. when memorizing the koran, the temporal lobes are for learning and remembering. another study by abdurrochman [18] on the influence of listening to classical music, relaxing music, and reading the koran. the results showed that the subject's brain waves were dominated by alpha waves when listening to classical and relaxing music, while delta waves dominated the brain waves when listening to murottal al-quran. similar research by abdullah & omar [19] on brain waves when listening to the koran and rock music. the results showed that listening to the recitation of the koran produces alpha waves and helps individuals to be calm compared to listening to rock music. research on identifying patterns in brain activity that correspond to certain stimuli, especially the relationship between sound stimuli and brain waves, is of particular concern. previous research by rahman et al. [20] identified the relationship between musical stimulus and brain waves and analyzed the effects of 3 different musical genres. statistical features are extracted from signal classification models based on k-nearest neighbor (knn), support vector machine (svm), and neural network (nn). this study shows that nn and genetic algorithm (ga) feature selection, can achieve the 159 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 highest accuracy of 97.5% in classifying 3 music genres. this model also achieved 98.6% accuracy in classifying music based on the participants' subjective emotional ratings. another study by sumarti et al. [21] compared several data classification methods (svm, naive bayes, multi-layer perceptron (mlp), multiclass classifier, and random forest) to differentiate malignant and benign cancer based on textural features with not too large differences in data, showing that the random forest method is the best method with an accuracy of up to 100%. so, we use the random forest method for this study. this study uses the mne library with the independent component analysis (ica) algorithm and a bandpass filter to filter the raw data of the eeg wave signal in order to obtain the real signal. this is based on previous research by winkler et al. [22], which shows that ica and band-pass filters can remove artifacts and significantly improve the snr (signal to noise ratio) and accuracy. data management uses google colab because it has proven to be efficient and effective in processing data quickly and the necessary libraries are available without installing it first [23]. no previous study used the random forest method to examine the classification of brainwave with the stimulus of listening to murottal al-quran and classical music. this study looks for brain activity patterns based on frequency (delta, theta, alpha, and beta) in adolescents associated with audio stimuli using murottal al-quran and classical music. knowing this pattern can be used to determine the difference between stimuli using murottal al-quran or classical music. it can determine the best method for voice therapy in adolescents. ii. methods a. research instrument this research uses electroencephalography (eeg), type contec kt88, to measure weak electromagnetic signals originating from nerve impulses in the brain. data collection uses eeg with 16 channels and 2 additional channels. data is generated every 1 second with a wave amplitude of 7.5 mm/50μv and a speed of 30mm/s. data on these signals is useful for classifying brain activity. meanwhile, the audio recording is used as a sample in listening to the stimulus murottal al-quran albaqarah or classical music mozart eine kleine nachtmisik. k.525: i. allegro. the software used in this research is phyton on google colab for extraction and weka machine learning for data classification. b. research sample the population in this study consisted of n = 100 science and technology students, consisting of 4 classes with 25 students in each class. the slovin formula [24] determines the sample by the with an error limit of e = 20% (small population). so, the minimum number of samples in this study, n = n/(1+n(e)2) = 100/(1+100(0.2)2) = 20. therefore, in this study, 26 volunteers who were final year adolescents participated, with male and female gender. the research sample was divided into two groups, 13 people who were given al-quran murotal stimulation and 13 who received classical music. informed consent was obtained from each volunteer according to ethical guidelines. inclusion criteria included 5th or semester 6th-semester students of uin walisongo semarang with an age range between 19-23 years who are muslim, physically and mentally healthy, and are not under the influence of drugs. meanwhile, the exclusion criteria included students who had hearing impairments and were married. c. research design this research was conducted using an experimental method, in which volunteers were measured directly using the eeg in the integrated laboratory faculty of science and technology, universitas islam negeri walisongo semarang. the placement of the electrodes was arranged according to the national standard for the installation of electrodes 10-20, with the direction of the electrode using the ear reference, as shown in figure 1. placement of the electrodes corresponds to the parts of the human cerebrum, namely frontal (f), parietal (p), occipital (o), and temporal (t). in this study, the data used were only electrodes with several frontal positions, namely fp1-a1, fp2-a2, f3-a1, f4-a2, f7a1, and f8-a2. because frontal lobe function is associated with performance involving cognitive h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 160 control functions such as suppressing habitual responses, maintaining attention, and managing distractions [25]. (a) (b) fig. 1. placement of elektrode with (a) 10-20 system and (b) direction elektrode using ear reference data was collected by placing electrodes on the subject's scalp according to the 10-20 system in a quiet laboratory room. subjects were in a comfortable and relaxed sitting position in the measuring cabin's shielded chair. subjects were asked to close their eyes. furthermore, recording of brain signals using eeg. the recording procedure was performed for 6 minutes with the following protocol: 3 minutes without stimulation and 3 minutes with stimulation. furthermore, the data used to be processed is data for 1 minute in stimulation condition. brain signal data is stored as an edf file, then separated using the edf browser to get brain signals for one minute. d. research procedure the research procedure is presented in figure 2. before taking measurements using eeg, the population was filtered based on inclusion criteria, resulting in a sample of 26 people divided into 2 groups. next, the sample's brain waves were measured using eeg with murottal al-quran stimulation and classical music with a procedure of 3 minutes in a quiet and 3 minutes in a stimulated condition. brain signal data is extracted in edf form, then processed using brain wave data for 1 minute, selected based on the best signal (with no/little noise and artifacts) separated using the edf browser. data processing uses python on google colab in the form of filters and data extraction. in the final stage, the extracted data is classified using the random forest method with weka machine learning software and then analyzed. e. data processing flowchart data processing is shown in figure 3. this stage consists of preprocessing and processing data. the first stage, preprocessing consists of filtering raw and extraction data using python on google colab. raw edf data is filtered using the independent component analysis (ica) algorithm and bandpass filter. data was extracted using the power spectral density (psd) algorithm to convert data in the time domain to the frequency domain. the second stage, data processing consists of data labeling using ms excel, followed by data classification using the random forest method in weka machine learning. this classification consists of training and testing sets. the training set functions to measure the ability of the random forest method to classify data. the testing set uses a cross-validation technique with k-folds consisting of 5, 10, 15, 20, and 25. the testing set functions to test the ability of this random forest method to classify data. the description of data processing is explained as follows. • preprocessing data ica aims to decompose measured signals, or variables, into a set of basic variables applied in two research fields relevant to cognitive science: biomedical data analysis and computational modeling. one of the earliest biomedical applications of ica involved eeg data analysis, with ica being used to recover signals associated with visual target detection. ica has been used to recover the temporal independent component (tic) associated with visual target detection. in this case, each electrode 161 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 output is a temporal mixture. the signal recorded at each electrode is a mixture of tics, and temporal ica (tica) is used to recover estimates of these temporally independent components [26]. band-pass filter, often abbreviated as bpf, is a filter or frequency filter that passes frequency signals in a certain frequency range, namely passing signals between the lower limit frequency and the upper limit frequency. in other words, this band pass filter will reject or attenuate frequency signals outside the specified range [27]. power spectral density (psd) is the most widely used orthogonal signal decomposition due to its computational efficiency and ease of interpretation. it is implicitly assumed that the signal is stationary; however, psd is often used with transient events with long duration relative to the spectral content. transients are changes in the value of voltage or current or both, either momentarily or within a certain period (on the order of microseconds) from steady-state conditions. the weighting window function is required for a limited number of samples that would cause leakage and cause distortion of the spectral components and also has limited frequency resolution. however, this approach is the most widely used industrial tool for vibration analysis in rotating machines [28]. fig. 2. research procedure h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 162 fig. 3. flowchart data processing • processing data random forest is a supervised learning algorithm. the “forest” it constructs is an ensemble of decision trees, usually trained using the “pocket” method. the general idea of the bagging method is that a combination of learning models improves the overall result. for a random forest consisting of 𝑁 trees is formulated as in (1) [29]. processing data labeled data training set testing set classification random forest data filtration independent component analysis (ica) and bandpass acquisition data (eeg – contex kt88) preprocessing data data extraction power spectral density (psd) raw data (edf) start unlabeled data classify end 163 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 𝑙(𝑦) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐(∑ 𝐼ℎ𝑛(𝑦)=𝑐) 𝑁 𝑛=1 (1) where 𝐼 is the indicator function, and ℎ𝑛 is tree to-n in a random forest. the random forest algorithm has three main parameters, which must be set before training. this includes node size, number of trees, and number of sampled features. from there, random forest classifiers can solve regression or classification problems [30]. the random forest algorithm consists of a collection of decision trees, and each tree in the ensemble consists of a sample of data taken from the training set with a replacement called a bootstrap sample. among training samples used in the study, one-third was set aside as test data, known as an out-of-bag (oob) sample. another instance of randomness is then injected via feature bagging, adding more diversity to the dataset and reducing the correlation between decision trees. depending on the type of problem, predictions will vary. the individual decision trees will be averaged for the regression task, and for the classification task, the majority vote—i.e.. the most frequent categorical variable—will result in the predicted class. finally, the oob sample is used for cross-validation, completing the prediction [30]. random forest method coding is included in weka machine learning as in pseudocode 1 [31]. pseudocode 1. random forest method 1. initialize matrix p for j = 1 to 10: for k = 1 to 1000: p[j, k] = 0.2 * random + 0.01 end for end for 2. modify matrix p for j = 1 to 10: for i = 1 to nint(400 * random): k = nint(1000 * random) p[j, k] = p[j, k] + 0.4 * random end for end for 3. initialize matrix x and array y for n = 1 to n: j = nint(10 * random) for m = 1 to 1000: if random < p[j, m] then x[m, n] = 1 else: x[m, n] = 0 end if y[n] = j end for f. data analysis the confusion matrix is used when solving classification problems. the confusion matrix can be applied to binary classification as well as to multiclass classification problems. accuracy is low when used with unbalanced data sets; therefore, another matrix based on the confusion matrix can be useful for evaluating performance. precision and recall are widely used for classification. precision indicates how accurately the model predicts positive values. this is known as the positive predictive value. recall is useful for measuring a model's power to predict a positive outcome and is also known as model sensitivity. both measures provide valuable information, but the aim is to increase recall without affecting precision. precision and recall values can be calculated in python. the formula for calculating accuracy, precision, and recall is given as in (2) to (4) [32]. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑥100% (2) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑥100% (3) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 𝑥100% (4) h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 164 where 𝑇𝑃 is the number of positive data and predicted correctly, 𝑇𝑁 is the number of negative data predicted correctly, 𝐹𝑃 is the number of negative data but predicted positively, and 𝐹𝑁 is the number of positive data but predicted negative. iii. result and discussion the result, after being filtered using independent component analysis (ica), managed to reduce artifacts and noise; there are no more signals that intersect between channels. brain signals less than 0.5 hz (low-pass filter) and signals exceeding 35 hz (high-pass filter) will be attenuated automatically using band-pass filter. the data generated in the extraction process is in the form of percentages for each type of wave; it is delta, theta, alpha, and beta. figure 4 shows the extraction results in percent using the psd algorithm for respondents who provided the stimulus of murottal al-quran and classical music. data in the time domain has been converted into the frequency domain presented in percentages. the extraction results show that the frequency percentages from highest to lowest are delta, alpha, theta, and beta, respectively. however, the delta and alpha data show an interrelated pattern, the delta and alpha wave lines crossing each other. (a) (b) fig. 4. percentage of waves in the sample given the stimulus of listening to (a) murottal al-quran (b) classical music. 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 12 14 b ra in w a v e ( % ) subject delta theta alpha beta 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 12 14 b ra in w a v e ( % ) subject delta theta alpha beta 165 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 figure 4 compares of the average percentage of respondents' brain wave frequencies when given murottal al-quaran stimulation and classical music. in the data extraction process obtained in the samples given the murottal al-quran stimulus, the average data of brain wave activity was dominated by delta waves at 45.32%, followed by alpha waves at 31.50%, theta at 17.02%, and beta at 6.05%. these results are reciprocal with previous research by abdurrochman et al. [18], which shows that the auditory evoked potential (aep) notes on murottal al-quran are dominated by deltas so they can be used as a therapy for sleep disorders. besides that, it is in accordance with research conducted by norsiah and amira [16] and abdullah and omar [19] that listening to the recitation of murottal alqur'an can help a person to always be relaxed. this relaxed condition occurs when brain waves with a frequency of 0.5-7 hz (delta and theta). relaxed states predominate during deep sleep, coma, and anesthesia due to very low frequency (delta activity). theta rhythms are usually observed in drowsiness and a state of low alertness. a specific type of theta, referred to as “frontal midline theta” can be observed during various tasks such as mental computation, working memory, error processing, and meditation [33]. based on figure 4, the data extraction process was obtained in samples given classical music stimulus. the average data of brain wave activity was dominated by delta waves at 46.65%, followed by alpha waves at 28.64%, theta at 19.21%, and beta at 5.50%. previous research by maity et al. [34] to investigate the response of the frontal brain when given musical stimulation (simple acoustic) produced a complex increase in alpha and theta waves using multifractal spectral width. this research shows different results due to the different types of music used, namely simple acoustics using the empirical mode decomposition (emd) data processing algorithm. this can cause differences in respondents' brain responses and different data processing can produce different results. figure 5 shows that delta and theta wave activity is greater when given classical music than the murottal al-quran stimulus. meanwhile, theta and beta wave activity were greater when given a murottal al-quran than classical music, when given a stimulus to listen to classical music, the activity of alpha and beta waves is greater when compared to when listening to murottal al-quran. this increase in alpha wave activity is by research conducted by abdurrochman et al. [18] and abdullah and omar [19] that listening to classical music or relaxing music can generate alpha waves and improve cognitive abilities. the alpha rhythm is usually dominant in the rest-awake state, both relaxed and comfortable. beta rhythms are usually associated with cortical integrity, increased alertness, and cognitive processing [35]. beta waves occur mainly during waking, and increased beta strength can be caused by stress, strong emotions, and tension [33]. the two sound stimuli given were in the form of murottal al-quran and classical music, both of which had dominant delta brain waves, this could mean that the sample was sleepy and some even fell asleep. this is because the electrode installation process takes quite a long time [33]. fig. 5. the average brain wave activity when given a murottal al-quran stimulus and classical music 45.32 17.02 31.60 6.05 46.65 19.21 28.64 5.50 0 10 20 30 40 50 60 delta theta alpha beta b ra in w a v e ( % ) murottal classical music h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 166 data classification using the random forest method in weka machine learning. the data is trained using the training set tool, until an accuracy of 100% is obtained, meaning the data can be grouped well. next, it was tested using cross-validation techniques. this involves ordering a specific sample from a data set on which the model is not trained. the larger the k-folds, the smaller the resampling subsets. the number of k-folds also determines how often the machine learning model is trained [36]. this study tested folds starting from 5, 10, 15, 20, and 25 shown in table 1. good results on folds k = 5, namely experiments with 5 stages. the value of folds k = 5 is the middle value. this is equal to the research conducted by furqon et al. [37], who used the modified k-nearest neighbor (mknn) and tested it with folds k = 1 to k = 10. the best result lies in k = 7. this differs from previous research conducted by tapikap et al. [38], which used the transformed complement naïve bayes (tcnb) method with a 2,3,4,5,6,7,8,9,10 test. the highest accuracy results are at 10 folds. the difference between previous research and this research is due to the difference in the classification method used. table 1. confusion matrix no. folds tp (data) fp (data) fn (data) tn (data) accuracy (%) precision (%) recall (%) 1 training 13 0 13 0 100.00 100.00 100.00 2 5 10 3 6 7 65.38 76.92 70.00 3 10 8 5 6 7 57.69 61.54 58.33 4 15 9 4 7 6 57.69 69.23 60.00 5 20 7 6 6 7 53.85 53.85 53.85 6 25 8 5 7 6 53.85 61.54 54.55 this research is the value of true positive (tp), which is a murottal al-quran that is detected as murottal al-quran 8, true negative (tn) is murottal al-quran which is detected as classical music 9, false positive (fp) is classical music which is detected as murottal al-quran was 4, and false negative (fn) for murottal al-quran detected as classical music was 5. the tn and fn values were quite large. this could be due to the non-homogeneous sample conditions in the research conducted by ulain irfan et al. [39] listening to murottal al-quran and classical music, both of which have a positive effect on reducing blood pressure and anxiety levels of patients. the accuracy value obtained in this study was 65.38%. compared to previous research by rahman et al. [20], identifying the relationship between musical stimulus and brain waves and analyzing the effects of 3 different musical genres based on knn, svm, and nn. the results show that nn and the genetic algorithm (ga) feature selection can achieve the highest accuracy of 97.5% in classifying 3 music genres. the results of this study used the random forest method to classify waves based on unfavorable frequencies. low accuracy can be caused by data imbalance so that the classifier cannot predict the data correctly [40]. some of the findings in this research include that adolescent brain signals have the same frequency pattern from highest to lowest, namely delta, alpha, theta, and beta. it shows that these two stimulations make teenagers more relaxed and reduce concentration. meanwhile, alpha and beta brain waves were higher when stimulated by murottal al-quran, while delta and theta were higher when stimulated by classical music. it shows that al-quran stimulation predominantly increases relaxation and concentration, while music stimulation can predominantly increase relaxation and short-term memory. the data classification results using random forest with k-fold 5 show that the accuracy obtained is 65.38%, meaning that when stimulated using murottal al-quran and classical music, adolescent brain waves are difficult to differentiate. it is due to the similarities in adolescent brain patterns when both types of stimulation are heard. it differs from the classification for distinguishing a person's emotional condition and cognitive [41][42]. the results of this study show that stimulation using murottal al-quran and classical music effectively improves adolescent relaxation conditions. murottal al-quran can improve concentration, while classical music can improve short-term memory. it can be used in the world of health for therapy in adolescents who experience anxiety disorders, increasing concentration in learning and improving short-term memory. however, this research is limited to manual data processing and uses several 167 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 different types of software. in future research, data processing can be done automatically using a graphical user interface (gui), making it easier for health workers. iv. conclusions research has been carried out to determine the response of adolescents when given murottal alquran stimulation and extraction-based classical music using the psd method. the research results showed the same results between the two types of stimulation, namely the order of brain waves from highest to lowest were delta, alpha, theta and beta. the average brain waves of teenagers when given murottal al-quran stimulation were 45.32% delta, 31.60% alpha, 17.02 theta and 6.05% beta. meanwhile, the average brain waves of teenagers when given classical music stimulation were 46.54% delta, 28.64% alpha, 19.21% theta and 5.50% beta. classification is obtained with the best value that frequently appears (mode) from the prediction results for each sample using random forest methods. the accuracy, precision, and recall of classifying adolescent brain waves when given murottal and classical music stimuli using the random forest method with cross-validation technique (optimum at k-fold=5) were 65.38%, 76.92%, and 70.00%, respectively. the results of this study show that stimulation using murottal al-quran and classical music effectively improves adolescent relaxation conditions. murottal al-quran can improve concentration, while classical music can improve shortterm memory. it can be used in the world of health for therapy in adolescents who experience anxiety disorders, increasing concentration in learning and improving short-term memory. however, this research is limited to manual data processing and uses several different types of software. in future research, data processing can be done automatically using a gui, making it easier for health workers. acknowledgment we thank lst (labolatorium saintek terpadu) for the eeg instrumentation that has been provided. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] h. sumarti, t. nurmar’atin, h. h. kusuma, i. istikomah, and i. s. prastyo, “development of chobmons prototype: cholesterol and blood sugar level monitoring system based on internet of things (iot) using blynk application,” j. fis. dan apl., vol. 18, no. 3, p. 53, sep. 2022. [2] b. larsen and b. luna, “adolescence as a neurobiological critical period for the development of higher-order cognition,” neuroscience and biobehavioral reviews, vol. 94. pp. 179–195, 2018. [3] p. l. nunez and r. srinivasan, “a theoretical basis for standing and traveling brain waves measured with human eeg with implications for an integrated consciousness,” clinical neurophysiology, vol. 117, no. 11. pp. 2424–2435, 2006. [4] a. al bataineh and a. jarrah, “high performance implementation of neural networks learning using swarm optimization algorithms for eeg classification based on brain wave data,” vol. 13, no. 1, pp. 1–17, 2019. [5] m. e. mccullough and e. c. carter, “religion, self-control, and self-regulation: how and why are they related?,” apa handbook of psychology, religion, and spirituality (vol 1): context, theory, and research. pp. 123–138, 2013. http://journal2.um.ac.id/index.php/keds https://doi.org/10.12962/j24604682.v18i3.12532 https://doi.org/10.12962/j24604682.v18i3.12532 https://doi.org/10.12962/j24604682.v18i3.12532 https://doi.org/10.1016/j.neubiorev.2018.09.005. https://doi.org/10.1016/j.neubiorev.2018.09.005. https://doi.org/10.1016/j.clinph.2006.06.754 https://doi.org/10.1016/j.clinph.2006.06.754 https://doi.org/10.4018/ijamc.292500 https://doi.org/10.4018/ijamc.292500 https://doi.org/10.1037/14045-006 https://doi.org/10.1037/14045-006 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 168 [6] g. olivo, s. gaudio, and h. b. schiöth, “brain and cognitive development in adolescents with anorexia nervosa: a systematic review of fmri studies,” nutrients, vol. 11, no. 8, 2019. [7] p. prini et al., “adolescent thc exposure in female rats leads to cognitive deficits through a mechanism involving chromatin modifications in the prefrontal cortex,” j. psychiatry neurosci., vol. 43, no. 2, pp. 87–101, 2018. [8] p. muris, h. otgaar, c. meesters, a. heutz, and m. van den hombergh, “self-compassion and adolescents’ positive and negative cognitive reactions to daily life problems,” j. child fam. stud., vol. 28, no. 5, pp. 1433–1444, 2019. [9] l. colzato and c. beste, “a literature review on the neurophysiological underpinnings and cognitive effects of transcutaneous vagus nerve stimulation: challenges and future directions,” j. neurophysiol., vol. 123, no. 5, pp. 1739 – 1755, 2020. [10] t. stegemann, m. geretsegger, e. phan quoc, h. riedl, and m. smetana, “music therapy and other music-based interventions in pediatric health care: an overview,” medicines, vol. 6, no. 1, p. 25, 2019. [11] m. vaezi, n. shafiee, and a. h. mehdizadeh, “the effectiveness of fusion music on anxiety and aggression in adolescent,” pp. 78–82, 2021. [12] d. e. osmanoğlu and h. yilmaz, “the effect of classical music on anxiety and well-being of university students,” int. educ. stud., vol. 12, no. 11, p. 18, 2019. [13] d. oktarosada, m. masrur, e. yunitasari, and h. mukhlis, “the effect of murottal al-quran on anxiety levels toward ix class students in facing examination at the junior high school of muhammadiyah 1 kalirejo central lampung,” j. aisyah j. ilmu kesehat., vol. 7, no. s1, pp. 113–116, 2022. [14] s. sumarni, s. d. h. atifah, t. ta’adi, and e. r. ambarwati, “does yoga-murottal reduce dysmenorrhea pain and improve beta-endorphin hormone levels in adolescents?,” open access maced. j. med. sci., vol. 10, no. t8, pp. 54–57, 2022. [15] m. e. y. fujianti, h. kristianto, and l. yuliatun, “combination of music therapy and murottal therapy on pain level of breast cancer patients,” j. aisyah j. ilmu kesehat., vol. 8, no. 1, pp. 405–414, 2023. [16] f. norsiah and a. nurul amira, “the effects of neurotherapy (nft) using ayatul kursi as stimulus on memory performance,” j. islam. soc. econ. dev., vol. 2, no. 4, pp. 22–31, 2017. [17] a. majid.s, “islamic post brain research: quranic memorization key to muslim scientific,” 2013. [18] r. a. abdurrochman, “the comparison of classical music, relaxation music and. the 2007 regional symposium on biophysics and medical physic,” 2007. [19] a. a. abdullah and z. omar, “the effect of temporal eeg signals while listening to quran recitation,” int. j. adv. sci. eng. inf. technol., vol. 1, no. 4, p. 372, 2011. [20] j. s. rahman, t. gedeon, s. caldwell, and r. jones, “brain melody informatics: analysing effects of music on brainwave patterns,” in 2020 international joint conference on neural networks (ijcnn), jul. 2020, pp. 1–8. [21] h. sumarti, s. r. anggita, f. r. pratama, and a. n. tasyakuranti, “texture-based classification of benign and malignant mammography images using weka machine learning : an optimal approach,” evergr. j., vol. 10, no. 03, 2023. [22] i. winkler, s. debener, k. r. muller, and m. tangermann, “on the influence of high-pass filtering on ica-based artifact reduction in eeg-erp,” proc. annu. int. conf. ieee eng. med. biol. soc. embs, vol. 2015-novem, pp. 4101–4105, 2015. [23] g. y. orhan, “4 reasons why you should use google colab for your next project,” towar. data sci., pp. 1–9, 2020. [24] j. j. tejada, j. raymond, and b. punzalan, “on the misuse of slovin’s formula,” philipp. stat., vol. 61, no. 1, p. 8, 2012. [25] m. andersson, m. ystad, a. lundervold, and a. j. lundervold, “correlations between measures of executive attention and cortical thickness of left posterior middle frontal gyrus a dichotic listening study,” behav. brain funct., vol. 5, no. 1, p. 41, 2009. [26] e. palmer, “africa: an introduction,” africa an introd., vol. 6, no. 2, pp. 1–308, 2021. [27] d. kho, “pengertian band pass filter (bpf) atau tapis lolos antara,” 2021. [28] g. louppe, “understanding random forests: from theory to practice,” arxiv, no. july, 2014. [29] c. ferreira lemos lima, f. m. assis, and c. p. de souza, “a comparative study of use of shannon, rényi and tsallis entropy for attribute selecting in network intrusion detection,” in 2011 ieee international workshop on measurements and networking proceedings (m&n), oct. 2011, pp. 77–82, [30] m. onesmus, “introduction to random forest in machine learning,” section.io, pp. 1–17, 2020. [31] z. jin, j. shang, q. zhu, c. ling, w. xie, and b. qiang, “rfrsf: employee turnover prediction based on random forests and survival analysis,” lect. notes comput. sci. (including subser. lect. notes artif. intell. lect. notes bioinformatics), vol. 12343 lncs, pp. 503–515, 2020. [32] a. kulkarni, d. chong, and f. a. batarseh, foundations of data imbalance and solutions for a data democracy. elsevier inc., 2020. [33] d. sammler, m. grigutsch, t. fritz, and s. koelsch, “music and emotion: electrophysiological correlates of the processing of pleasant and unpleasant music.,” psychophysiology, vol. 44, no. 2, pp. 293–304, mar. 2007. [34] a. k. maity et al., “multifractal detrended fluctuation analysis of alpha and theta eeg rhythms with musical stimuli,” chaos, solitons and fractals, vol. 81. pp. 52–67, 2015. [35] p. l. nunez and n. orleans, “a window on the mind,” vol. 2, pp. 1–12, 2002. [36] d. kici, g. malik, m. cevik, d. parikh, and a. başar, “a bert-based transfer learning approach to text classification on software requirements specifications,” proc. can. conf. artif. intell., pp. 1–13, 2021. [37] m. t. furqon, indriati, and a. hutapea, “penerapan algoritma modified k-nearest neighbour pada pengklasifikasian penyakit kejiwaan skizofrenia,” j. pengemb. teknol. inf. dan ilmu komput., vol. 2, no. 10, pp. 3957–3961, 2018. [38] h. f. tapikap, b. s. djahi, and t. widiastuti, “klasifikasi spam e-mail menggunakan metode transformed complement naïve bayes (tcnb) klasifikasi spam e-mail menggunakan metode transformed complement naïve bayes (tcnb),” j-icon (jurnal komput. inform., vol. 7, no. 1, pp. 21–26, 2019. https://doi.org/10.3390/nu11081907 https://doi.org/10.3390/nu11081907 https://doi.org/10.1503/jpn.170082 https://doi.org/10.1503/jpn.170082 https://doi.org/10.1007/s10826-019-01353-4 https://doi.org/10.1007/s10826-019-01353-4 https://doi.org/10.1152/jn.00057.2020 https://doi.org/10.1152/jn.00057.2020 https://doi.org/10.1152/jn.00057.2020 https://doi.org/10.3390/medicines6010025 https://doi.org/10.3390/medicines6010025 https://japer.in/storage/models/article/sxd7gpk93bjcclxzdsrar0c1gknmrxccd3qftwokknmwmwvfctd2vvoizum3/the-effectiveness-of-fusion-music-on-anxiety-and-aggression-in-adolescent.pdf https://japer.in/storage/models/article/sxd7gpk93bjcclxzdsrar0c1gknmrxccd3qftwokknmwmwvfctd2vvoizum3/the-effectiveness-of-fusion-music-on-anxiety-and-aggression-in-adolescent.pdf https://doi.org/10.5539/ies.v12n11p18 https://doi.org/10.5539/ies.v12n11p18 https://doi.org/10.30604/jika.v7is1.1210 https://doi.org/10.30604/jika.v7is1.1210 https://doi.org/10.30604/jika.v7is1.1210 https://doi.org/10.30604/jika.v8i1.1649 https://doi.org/10.30604/jika.v8i1.1649 https://doi.org/10.30604/jika.v8i1.1649 https://doi.org/10.30604/jika.v8i1.1649 https://doi.org/10.30604/jika.v8i1.1649 http://www.jised.com/pdf/jised-2017-04-06-03.pdf http://www.jised.com/pdf/jised-2017-04-06-03.pdf https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=islamic+post+brain+research%3a+quranic+memorization+key+to+muslim+scientific&btng= https://www.academia.edu/download/43739547/the_comparison_of_classical_music_relaxa20160315-625-yijudk.pdf https://www.academia.edu/download/43739547/the_comparison_of_classical_music_relaxa20160315-625-yijudk.pdf https://doi.org/10.18517/ijaseit.1.4.77 https://doi.org/10.18517/ijaseit.1.4.77 https://doi.org/10.1109/ijcnn48605.2020.9207392 https://doi.org/10.1109/ijcnn48605.2020.9207392 https://catalog.lib.kyushu-u.ac.jp/opac_detail_md/?lang=0&amode=md100000&bibid=7151705 https://catalog.lib.kyushu-u.ac.jp/opac_detail_md/?lang=0&amode=md100000&bibid=7151705 https://catalog.lib.kyushu-u.ac.jp/opac_detail_md/?lang=0&amode=md100000&bibid=7151705 https://doi.org/10.1109/embc.2015.7319296 https://doi.org/10.1109/embc.2015.7319296 https://doi.org/10.1109/embc.2015.7319296 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=4+reasons+why+you+should+use+google+colab+for+your+next+project&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=4+reasons+why+you+should+use+google+colab+for+your+next+project&btng= https://www.psai.ph/docs/publications/tps/tps_2012_61_1_9.pdf https://www.psai.ph/docs/publications/tps/tps_2012_61_1_9.pdf https://doi.org/10.1186/1744-9081-5-41 https://doi.org/10.1186/1744-9081-5-41 https://doi.org/10.1186/1744-9081-5-41 https://doi.org/10.4324/9781003111733 http://www.teknikelektronika.com/ http://arxiv.org/abs/1407.7502 https://doi.org/10.1109/iwmn.2011.6088496 https://doi.org/10.1109/iwmn.2011.6088496 https://doi.org/10.1109/iwmn.2011.6088496 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=introduction+to+random+forest+in+machine+learning&btng= https://doi.org/10.1007/978-3-030-62008-0_35 https://doi.org/10.1007/978-3-030-62008-0_35 https://doi.org/10.1007/978-3-030-62008-0_35 https://www.sciencedirect.com/science/article/pii/b9780128183663000058 https://www.sciencedirect.com/science/article/pii/b9780128183663000058 https://doi.org/10.1111/j.1469-8986.2007.00497.x https://doi.org/10.1111/j.1469-8986.2007.00497.x https://doi.org/10.1016/j.chaos.2015.08.016 https://doi.org/10.1016/j.chaos.2015.08.016 https://doi.org/10.4249/scholarpedia.1348revision https://assets.pubpub.org/rrc10aja/31621568150182.pdf https://assets.pubpub.org/rrc10aja/31621568150182.pdf http://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/2800 http://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/2800 https://ejurnal.undana.ac.id/jicon/article/view/878 https://ejurnal.undana.ac.id/jicon/article/view/878 https://ejurnal.undana.ac.id/jicon/article/view/878 169 h. sumarti et al. / knowledge engineering and data science 2023, 6 (2): 157–169 [39] n. ul-ain irfan, h. atique, a. taufiq, and a. irfan, “differences in brain waves and blood pressure by listening to quran-e-kareem and music,” j. islam. med. dent. coll., vol. 8, no. 1, pp. 40–44, 2019. [40] k. hastuti, “analisis komparasi algoritma klasifikasi data mining untuk prediksi mahasiswa non aktif,” semin. nas. teknol. inf. komun. terap. 2012, vol. 14, no. 1, pp. 241–249, 2012. [41] a. azhari, a. susanto, a. pranolo, and y. mao, “neural network classification of brainwave alpha signals in cognitive activities,” knowl. eng. data sci., vol. 2, no. 2, p. 47, 2019. [42] a. y. saleh and l. k. xian, “stress classification using deep learning with 1d convolutional neural networks,” knowl. eng. data sci., vol. 4, no. 2, p. 145, 2021. https://doi.org/10.35787/jimdc.v8i1.315 https://doi.org/10.35787/jimdc.v8i1.315 http://publikasi.dinus.ac.id/index.php/semantik/article/view/132 http://publikasi.dinus.ac.id/index.php/semantik/article/view/132 https://doi.org/10.17977/um018v2i22019p47-57 https://doi.org/10.17977/um018v2i22019p47-57 https://doi.org/10.17977/um018v4i22021p145-152 https://doi.org/10.17977/um018v4i22021p145-152 microsoft word 3.5832-19561-le3r knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 1, juni 2019, pp. 19–30 eissn 2597-4637 https://doi.org/10.17977/um018v2i12019p19-30 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) selection of marine security policy using fuzzy-ahp topsis hybrid approach hozairi a, 1, *, buhari a, 2, heru lumaksono b, 3, marcus tukan c, 4 a informatics eng. study program, islamic university of madura jl.pp.miftahul ulum bettet, pamekasan 6931, indonesia b department of shipbuilding engineering, shipbuilding institute of polytechnic surabaya jl. teknik kimia its, surabaya 60111, indonesia c faculty of engineering, university of pattimura jl. ir m putuhena, ambon 97233, indonesia 1 dr.hozairi@gmail.com*; 2 buharinahrawi@gmail.com; 3 heruppns@gmail.com; 4 marcustucan@gmail.com * corresponding author i. introduction indonesia is an archipelago with an area of ocean exceeding mainland. geographically, indonesia is located between two continents and two oceans, and has a large wealth of natural resources. as an archipelagic country, indonesia should also be called a maritime country. however, addressing indonesia as a maritime country seems inappropriate because the development between the land and the sea is not balance [1][2]. for that reason, since 2014, indonesia has focused on organising the maritime affairs for the nation's prosperity [3]. the indonesian government through presidential regulation no. 178 of 2014, established a marine security agency (bakamla), which was previously named the marine security coordination agency (bakorkamla) [4]. as consequence, the indonesian government is required to choose the right indonesian maritime security policy so as to be able to realize the ideals of indonesia as the world maritime axis [5][6]. choosing an indonesian maritime security policy is not easy because it has to consider many criteria. therefore a decision support system (dss) [7] is needed to recommend the most suitable maritime policies [8]. this study aims to select the indonesian maritime security policy by considering many criteria in each decision alternative. the process of selecting indonesia's marine security policy is not easy because it includes complex problems and cannot be solved by linear programming methods. this problem is a multi criteria problem, requires a dss approach, namely multi criteria decision making (mcdm) to solve it. mcdm determines the best of many alternatives based on specific criteria. criteria are usually available in the form of measurements, rules or standards, are used in decision making process [9][10][11]. article info a b s t r a c t article history: received 3 december 2019 revised 24 january 2019 accepted 22 april 2019 published online 23 june 2019 the research was focused on the integration of fuzzy set theory with analytic hierarchy process (ahp) and technique for order preference by similarity to ideal solution (topsis) to choose the optimum maritime security policy to achieve indonesia recognition as the world's maritime axis. the method used is ahp with fuzzy based enhancement. here, the weight of each criterion is calculated to overcome the criticism of the scale of unbalanced rating, uncertainty, and inaccuracy in the pairwise of comparison process. the best recommendation for indonesian maritime policies is multi task single agency which is greatly infuenced by several factors such as technology, regulations, infrastructure, economic, politic, and socio-culture. the finding shows that the hybrid approach is able to produce the best recommendation for indonesian maritime security policy. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: marine security policy global maritime axis fuzzy-ahp topsis 20 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 another major advantage of mcdm techniques is their ability to analyze quantitative and qualitative criteria simultaneously. many techniques and methodologies are reported in the literature. some popular approaches are analytical hierarchy process (ahp) [12], analytic network process (anp), technique for preference by ideal for ideal solution (topsis) [13], elimination and choice translation reality (electre) [14][15], preference ranking for organization method for enrichment evaluation (promethee) [16], decision making trial and evaluation laboratory (dematel), and vse kriterijumska optimizacija i komprominsa resenje (vikor). each technique has its own strengths and weaknesses. therefore, a hybrid technique could be a solution on improving the performance of these stand-alone approaches. in this study, an integrated model of fuzzy-ahp and topsis was established to provide a phased methodology for selecting indonesia's maritime security policy in accordance with presidential regulation no. 178 of 2014. this model was then applied in case studies to demonstrate its application in real-world pilot studies and prove its reliability. fuzzy-ahp has good ability to resolve uncertainties and ambiguities in various mcdm situations [13][17][18][19]. on the other hand, topsis is a model that is efficient in handling reasonable attributes and there is no limit to the number of criteria, sub-criteria or alternatives [20][21][22]. thus, fuzzy-ahp and topsis integration should provide a good basis for the analysis of complex decision problems [23][24][25][26]. the easy programmed hybrid technique is expected to overcome the complex problem of maritime decision making. ii. methods this study aims to analyse the indonesian marine security model by considering many criteria on each decision alternative. based on the purpose, mcdm can be divided into 2 (two) models: multi atribute decision making (madm), multi objective decision making (modm). often madm and modm are used to solve multi-attribute and multi-objective problems. mcdm method developed in this research is fuzzy ahp and topsis. in general, the stages of this research process can be seen in figure 1. a. fuzzy ahp the use of ahp in the multi criteria decision making (mcdm) problem is often criticized in light of the inadequacy of this ahp approach to overcome the uncertain factor experienced by the decision maker when it must provide a definite value in a pairwise comparison matrix. therefore, to overcome the weakness of existing ahp then developed a method called fuzzy ahp. ahp fuzzy method is a combination of ahp method with fuzzy approach. fig. 1. the framework of fuzzy based ahp-topsis multi criteria decision analysis hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 21 the fuzzy ahp method uses triangular fuzzy number (tfn). tfn is used to describe linguistic variables with certainty. tfn is symbolized by 𝑀 = (𝑙, 𝑚, 𝑢) where 𝑙 is the lowest value, 𝑚 is the middle value, and 𝑢 is the top. the ahp and tfn ratings are used for purposes of the pairwise comparison matrix, as shown in table 1. if we suppose there are 2 (two) triangular fuzzy number (tfn) that is 𝑀1 = 𝑙1, 𝑚1, 𝑢1 and 𝑀2 = 𝑙2, 𝑚2, 𝑢2, then tfn arithmetic operation is: 𝑀 + 𝑀 = (𝑙 + 𝑙 , 𝑚 + 𝑚 , 𝑢 + 𝑢 ) (1) 𝑀 𝑀 = (𝑙 𝑙 , 𝑚 𝑚 , 𝑢 𝑢 ) (2) 𝑀 = (1 𝑢 , 1 𝑚 , 1 𝑙 ) (3) fuzzy analytic hierarcy process (fahp) stage are: defining the fuzzy synthetic extend value, confidence level, the level of probability for a convex fuzzy numbers, and normalization of weighted vector 1) fuzzy synthetic extend value define the fuzzy synthetic extent value for i-objects like the following equation: 𝑆 = ∑ 𝑀  ∑ ∑ 𝑀 (4) to get ∑ 𝑀 , then a fuzzy sum operation is performed from the value of 𝑚 in a pairwise matrix of comparison as can be seen in the following equation: ∑ 𝑀 = ∑ 𝑙 , ∑ 𝑚 , ∑ 𝑢 (5) to obtain equation (6): ∑ ∑ 𝑀 (6) then summed on 𝑀 as can be seen in equation (7): ∑ ∑ 𝑀 = ∑ 𝑙 , ∑ 𝑚 , ∑ 𝑢 (7) then, to obtain the inverse of equation (7) can be done by using tfn arithmetic operation on equation (3) which resulted in equation (8): ∑ ∑ 𝑀 = ∑ , ∑ , ∑ (8) 2) confidence level if there are two fuzzy numbers m1 = (l1, m1, u1) and m2 = (l2, m2, u2), then the confidence level of m1 = (l1, m1, u1) ≥ m2 = (l2, m2, u2) can be defined as follows: 𝑉(𝑀 ≥ 𝑀 ) = sup 𝑚𝑖𝑛 𝜇𝑀 (𝑥), 𝜇𝑀 (𝑦) (9) if m1 and m2 of the convex fuzzy number are obtained the following conditions: 𝑉(𝑀 ≥ 𝑀 ) = 1 𝑖𝑓𝑓 𝑚 ≥ 𝑚 (10) 𝑉(𝑀 ≥ 𝑀 ) = ℎ𝑔𝑡 (𝑀 ∩ 𝑀 ) = 𝜇 (𝑑) (11) table 1. triangular fuzzy scale of preference linguistic term ahp scale triangular fuzzy number (tfn) absolute 9 (7, 9, 9) very strong 7 (5, 7, 9) fairly strong 5 (3, 5, 7) weak 3 (1, 3, 5) equal 1 (1, 1, 3) 22 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 the confidence level of the fuzzy number can be obtained by the equation: 𝑉(𝑀 ≥ 𝑀 ) = 1 , 𝑖𝑓 𝑚 ≥ 𝑚 0 , 𝑖𝑓 𝑙 ≥ 𝜇 ( ) ( ) ( ) , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (12) comparison of two fuzzy numbers can be seen in figure 2 which shows that d is the highest intersection point coordinate between µm1 and µm2, to compare m1 = (l1, m1, u1) and m2 = (l2, m2, u2) requires value from 𝑉(𝑀 ≥ 𝑀 ) and 𝑉(𝑀 ≥ 𝑀 ). 3) level of probability the level of probability for a convex fuzzy numbers better than than k convex fuzzy numbers m1 (i=1,2,3,...,k) can be defined as follows: 𝑉(𝑀 ≥ 𝑀 , 𝑀 , . . . . . . , 𝑀 ) (13) = 𝑉 [(𝑀 ≥ 𝑀 ) and (𝑀 ≥ 𝑀 ) 𝑎𝑛𝑑 . . . . (𝑀 ≥ 𝑀 )] = 𝑚𝑖𝑛 𝑉(𝑀 ≥ 𝑀 ), 𝑖 = 1,2, . . . , 𝑘 it is assumed that: 𝑑’(𝐴 ) = 𝑚𝑖𝑛 𝑉(𝑆 ≥ 𝑆 ) (14) for k=1,2,...,n; k≠i, then the weight of the vector is defined as follows: 𝑊’ = 𝑑’(𝐴 ), 𝑑’(𝐴 ), . . . . , 𝑑’(𝐴 ) (15) 4) normalization of weighted vector normalization of weighted vector in equation (15) becomes: 𝑊 = 𝑑(𝐴 ), 𝑑(𝐴 ), . . . . , 𝑑(𝐴 ) (16) where w is not a fuzzy number. b. topsis topsis is one of the main mcdm techniques. this approach is based on the best alternative that has the closest distance from the positive ideal solution (pis) and the farthest distance from the negative ideal solution (nis). it has been widely applied in many fields of research related to the selection of various alternatives and risk analysis because of the rationality, logic and simplicity of the method. to solve multi criteria problems with the topsis method there are several steps that must be completed, namely: step 1 normalise the decision matrix 𝑟 = ∑ , 𝑗 = 1,1, … . , 𝑗; 𝑖 = 1,2, … . , 𝑛 (17) fig. 2. the framework of fuzzy based ahp-topsis multi criteria decision analysis hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 23 step 2 apply weight to the normalized decision matrix by multiplying the normalized matrix with the weights of the criteria: 𝑣 = 𝑤 ∗ 𝑟 = 1,2, … . . , 𝑗; 𝑖 = 1,2, … . , 𝑛 (18) step 3 determine both pis (maximum values) and nis (minimum values) as: 𝐴 = {𝑣 , 𝑣 , … … , 𝑣 } (19) 𝐴 = {𝑣 , 𝑣 , … … , 𝑣 } step 4 calculate the distance of each alternative from pis and nis: 𝐷 = ∑ (𝑣 − 𝑣 ) , 𝑗 = 1,2, … , 𝑗 (20) 𝐷 = ∑ (𝑣 − 𝑣 ) , 𝑗 = 1,2, … , 𝑗 step 5 calculate the closeness coefficient of each alternative (ci) relative to its distance from pis and nis: 𝐶 = (21) step 6 compare the ci values to determine the ranking of alternatives. this research begins by agreeing on the criteria that are considered in the selection of a suitable indonesian maritime security policy which then later determine the alternatives that will be rated as table 2. the proposed approach consists of three steps. in the first stage, compiling the policy criteria is taken into consideration to determine the structured decision hierarchy. decision hierarchy should be approved by the policy-making team with several considerations of internal factors and external factors. the next process assess the criteria using fuzzy ahp. as explained in section 2, linguistic values are used to determine the weight criteria. in the third stage, the indonesian maritime security policy model is ranked using the topsis procedure. ranking is based on ci values in the order of the high. the maximum ci value are selected as the best policy model. the schematic diagram of the proposed model is showed in in figure 3. iii. results and discussions the integrated maritime security policy requires the involvement of many actors in decision making such as the state sector and the civil sector. the concept of the world maritime axis is the concept of maritime security itself with prominent characteristics and focuses on several aspects: national, economic, environmental, and human security. table 2. maritime security policy code description security policy code c1 politics c2 economics c3 social-culture c4 technology c5 infrastructure c6 regulation alternatif security policy code a1 multy agency single task a2 multy agency multy task a3 single agency multy task a4 single agency single task 24 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 indonesia has twelve law enforcement agencies in the sea, of the twelve there are six institutions that have patrol boats as law enforcement tools at sea by conducting patrols at sea, namely: the navy, indonesian national police, ministry of defense, ministry of maritime affairs and fisheries, ministry of transportation and customs and excise. there are six other marine law enforcement agencies that do not have patrol boats, namely: ministry of foreign affairs, ministry of home affairs, ministry of law and human rights, attorney general's office, ministry of finance and state intelligence agency. therefore, the proposed method analyzes the situation and simplify the decision-making process. for this purpose, a team of ten: indonesian ocean security agency, navy, indonesian national police, ministry of defense, ministry of maritime affairs and fisheries, ministry of transportation, customs and excise, and author were formed. the experiences and perspectives of these members are used throughout the entire study by following the described procedures. a. identification of criteria important criteria for comparison of the indonesian marine security policy model are determined by a team of experts based on their background and experience. the team agreed that the problem has six important criteria as shown in figure 3. the next step is creating a comparison between criteria to determine the most important criteria. the assessment process between criteria uses the fuzzy ahp approach. the comparison scheme between criteria can be seen in table 3. decision hierarchy includes three levels: the overall objective "selection indonesian maritime security policy" as the first level of the hierarchy, the criteria are on the second level and third level as the alternative. b. weight criteria and alternatives the calculation of the fuzzy synthesis value leads to estimate the overall value of each desired criterion and alternative as presented in table 3. afterwards, the matrix elements in table 3 is divided by the values in the row. the next stage sums the each line values, then divided it by the number of criteria to look for the eigen vector or the weight of each criterion. the calculation results can be seen in table 4. fig. 3. hierarchy of selection of indonesian maritime security policies table 3. the pairwise comparison matrix for criteria criteria c1 c2 c3 c4 c5 c6 l m u l m u l m u l m u l m u l m u c1 1.00 1.00 3.00 1.00 0.33 0.20 3.00 5.00 7.00 0.20 0.14 0.11 1.00 0.33 0.20 0.33 0.20 0.14 c2 1.00 3.00 5.00 1.00 1.00 3.00 3.00 5.00 7.00 0.33 0.20 0.14 1.00 1.00 3.00 1.00 0.33 0.20 c3 0.33 0.20 0.14 0.33 0.20 0.14 1.00 1.00 3.00 0.20 0.14 0.11 0.33 0.20 0.14 0.20 0.14 0.11 c4 5.00 7.00 9.00 3.00 5.00 7.00 5.00 7.00 9.00 1.00 1.00 3.00 5.00 7.00 9.00 1.00 3.00 5.00 c5 1.00 3.00 5.00 1.00 1.00 0.33 3.00 5.00 7.00 0.20 0.14 0.11 1.00 1.00 3.00 1.00 0.33 0.20 c6 3.00 5.00 7.00 1.00 3.00 5.00 5.00 7.00 9.00 1.00 0.33 0.20 1.00 3.00 5.00 1.00 1.00 3.00 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 25 after the value of the number of rows and columns is obtained, the value of fuzzy synthesis of each criterion (ski) with i=1, 2, …4 is calculated as follow ski = (number of rows (l, m, u) * invers (l, m, u)) sk1 = ((6.53, 7.01, 10.65) * invers (0.01, 0.01, 0.02)) = (0.05, 0.09, 0.19) sk2 = ((7.33, 10.53, 18.34) * invers (0.01, 0.01,0.02)) = (0.06, 0.13, 0.33) sk3 = ((2.40, 1.89, 3.65) * invers (0.01, 0.01, 0.02)) = (0.02, 0.02, 0.33) sk4 = ((20.00, 30.00, 42.00) * invers (0.01, 0.01, 0.02)) = (0.17, 0.38, 0.76) sk5 = ((7.20, 10.48, 15.64) * invers (0.01, 0.01, 0.02)) = (0.06, 0.13, 0.28) sk6 = ((3.73, 2.15, 1.77) * invers (0.01, 0.01, 0.02)) = (0.04, 0.03, 0.03) the calculation of fuzzy synthesis values can be summarised in table 5. determination of vector value (v) and ordinate value of defuzification uses a fuzzy approach. it is the minimum implication (min) fuzzy function. afterwards, we will get the ordinate value of defuzification (d') which is the minimum d value. based on table 5, the values of vectors and ordinate defective values of each criterion are obtained. for instance, the first criteria is politic (k1), the value of vector is: (vk1) ≥ (vk2, vk3, vk4, vk5, vk6) because of the value m1 ≥ m2 and u2 ≥ l1 then vk1 ≥ vk2: vk1 ≥ vk2 = 0.86 vk1 ≥ vk3 = 1 vk1 ≥ vk4 = 0.67 vk1 ≥ vk5 = 0.83 vk1 ≥ vk6 = 0.73 afterwards, the next stage obtained value d’(vk1). once finished, the calculation is repeated for the other criteria. d’(vk1) = min (0.86, 1, 0.67, 0.83, 0.73) d’(vk1) = 0.67 table 4. results calculation of the number of rows. number of columns and inverse values tfn (m) number of rows number of columns inverse c1 c2 c3 c4 c5 c6 l 6.53 7.33 2.4 20 7.2 12 55.47 0.01 m 7.01 10.53 1.89 30 10.48 19.33 79.24 0.01 u 10.65 18.34 3.65 42 15.64 29.20 119.49 0.02 table 5. results of fuzzy synthesis calculations (si) criteria si l m u c1 0.05 0.09 0.19 c2 0.06 0.13 0.33 c3 0.02 0.02 0.07 c4 0.17 0.38 0.76 c5 0.06 0.13 0.28 c6 0.10 0.24 0.53 26 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 based on the ordinate values k1, k2, k3, k4, k5 and k6, then the value of the vector weight can be determined as follow: number of vectors= (0.67 + 0.71 + 0.62 + 1 + 0.71 + 0.81) = 3.52 as a result, the value of the vector weight in the first criterion is obtained. vk1 = 0.67/3.52 = 0.19 the calculation result of other criteria is also presented in table 6. finally, the total value of the vector weights in each criterion is summed w = (vk1 + vk2 + vk3 + vk4 + vk5 + vk6) thus, w = (0.19 + 0.20 + 0.18 + 0.28 + 0.20 + 0.23) = 1.28. normalize the value of the vector weight for the first criterion is wk1 = 0.19/1.28 = 0.15 third column of table 6 shows the calculation result of the other criteria. the results of the calculation of criteria weights using the fuzzy ahp method obtained the following values of weighting: politics = 0.15; economy = 0.16; social and cultural = 0.14; technology = 0.22; infrastructure = 0.16; and policy = 0.18. here, the most influential factor is technology. the next stage spreads questionnaires to several respondents (ministry of maritime affairs and fisheries, indonesian navy, ministry of defense, ministry of transportation, police and academics) who understand and have the authority of indonesian maritime security policies. the questionnaire is table 6. the calculation result of weight of each criterion criteria d’(vk) vk wk 1 0.67 0.19 0.15 2 0.71 0.2 0.16 3 0.62 0.18 0.14 4 1 0.28 0.22 5 0.71 0.2 0.16 6 0.81 0.23 0.18 total 3.52 1.28 table 7. questionnaire value variables code value very poor vp 1 poor p 2 fair f 3 good g 4 very good vg 5 table 8. recapitulation of questionnaire results between criteria and decision alternative alternative c1 c2 c3 c4 c5 c6 a1 4.00 3.00 4.00 4.00 4.00 4.00 a2 3.00 2.00 3.00 2.00 4.00 2.00 a3 5.00 4.00 4.00 5.00 4.00 5.00 a4 2.00 5.00 3.00 4.00 4.00 3.00 average 3.50 3.50 3.50 3.75 4.00 3.50 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 27 used to assess the consistency of each alternative as detailed in table 7. about 100 people fill the questionnaire. the result of the questionnaire recapitulation is shown in table 8. after obtained the comparative value of criteria and alternatives, the next step is looking for the value of squares and the roots of the assessment results using topsis. table 9 captures the calculation result. the process is followed by the multiplication of every element in table 8 with its root in table 9 to get a topsis normalization matrix in table 10. the next step is to get a weighted normalization matrix by multiplying the topsis normalization matrix with the weighted matrix of criteria (table 6). the result is a weighted normalization matrix of topsis with the fuzzy-ahp can be seen in table 11. table 9. result of calculation value of squares and roots criteria c1 c2 c3 c4 c5 c6 square 54.00 54.00 50.00 61.00 64.00 54.00 root 7.35 7.35 7.07 7.81 8.00 7.35 table 10. normalization matrix alternative c1 c2 c3 c4 c5 c6 a1 0.54 0.41 0.57 0.51 0.50 0.54 a2 0.41 0.27 0.42 0.26 0.50 0.27 a3 0.68 0.54 0.57 0.64 0.50 0.68 a4 0.27 0.68 0.42 0.51 0.50 0.41 table 11. the weighted normalization matrix alternative c1 c2 c3 c4 c5 c6 a1 0.081 0.064 0.078 0.071 0.078 0.098 a2 0.061 0.043 0.059 0.035 0.078 0.049 a3 0.101 0.085 0.078 0.088 0.078 0.123 a4 0.040 0.106 0.059 0.071 0.078 0.074 table 12. the value of alternative distance of positive and negative solution value c1 c2 c3 c4 c5 c6 maximum 0.101 0.106 0.078 0.088 0.078 0.123 minimum 0.040 0.043 0.059 0.035 0.078 0.049 table 13. quadratic value of alternatives square value benefit cost a1 0.003 0.006 a2 0.014 0.000 a3 0.000 0.014 a4 0.007 0.006 table 14. root values on alternatives root value benefit cost a1 0.056 0.078 a2 0.120 0.020 a3 0.021 0.119 a4 0.082 0.077 28 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 the next step is finding the value of positive solution and negative solution and determining the distance between values, as showed in table 12. the next stage, determines the square and root values of positive ideal values (pis) and negative ideal values (nis) which is presented in table 13 and table 14 respectively. after obtaining the square and root values of pis and nis. then the last step in the topsis calculation is to find the preference value for each given alternative. if a larger vi value indicates that the alternative ai is preferred. the preference values of each alternative are obtained as follows: the preference value of multy agency single task model 𝑉 = . . . = 0.583 the preference value of multy agency multy task model 𝑉 = . . . = 0.144 the preference value of single agency multy task model 𝑉 = . . . = 0.483 the preference value of single agency single task model 𝑉 = . . . = 0.848 having obtained the priority value on each alternative, then carried out the process of normalization of the decision value, shown in table 15. based on the analysis of the interests of some concepts of marine security, it turns out the most suitable concept to be implemented in the state of indonesia. first is the concept of single agency multy task, where all policies in handling law enforcement at sea are in one institution. second is the multy agency single task, where there are more than one institution that interacts together to achieve or to solve similar problems. the third is single agency single task, where there is one institution that regulates and implements the regulations. this alternative is not possible to be applied because indonesia already has many institutions with maritime authority of security. fourth is multy agency multy task, where there are many institutions that have the duty and authority of oversight and marine security. the last concept has many weaknesses, especially the inter-agency sectoral ego. table 15. priority value of alternative selection of indonesian marine security concept alternative priority value normalize priority values a1 0.583 0.283 a2 0.144 0.070 a3 0.848 0.412 a4 0.483 0.235 fig. 4. results of priority model of marine security indonesia 0,283 0,070 0,412 0,235 0,000 0,050 0,100 0,150 0,200 0,250 0,300 0,350 0,400 0,450 a1 a2 a3 a4 priority value marine security concept using fuzzy-ahp topsis m ar in e se cu ri ty c on ce pt hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 29 figure 4 illustrates that the concept of "single agency multy tasks" has greatest contribution on addressing marine security and security enforcement issues in indonesia. in contrast, other concepts may difficult to implement and may cause some problems such as overlapping authority, conflicts between law enforcement officers, and the absence of command and control units at the sea. the implementation of the single agency multy tasks system in indonesia, can be done by optimizing the the synergy between authority, strength and ability of stakeholders. iv. conclusions in this paper, using fuzzy ahp and topsis, an integrated two-step model was established to evaluate in choosing the optimal maritime security policy to realize indonesia as the world's maritime axis. evaluation is based on several criteria, namely: politics, economy, social and culture, technology, infrastructure and policy. the fuzzy-ahp found that the best criteria that greatly affect the improvement of marine security is technology, is followed by regulations, infrastructure, economic, politic, and socio-culture. the order of priority decisions of indonesian maritime security policies based on topsis are multy task single agency (0.412), single task multy agency (0.283), single task single agency (0.235), and multy multy agency task (0.070). the best recommendation for indonesia marine security policy is "multy task single agency". the concept is believed to make a major contribution in overcoming various problems in the enforcement of security and safety laws at indonesian sea. the concept requires a single institution to give one command, related to maritime security policy. acknowledgment this research is supported by the ministry of research, technology and higher education of the republic of indonesia through the national strategic research grant. references [1] i. n. p. a, a. hakim, s. h. pramono, and a. s. leksono, “the effect of strategic environment change toward indonesia maritime security : threat and opportunity,” int. j. appl. eng. res., vol. 12, no. 16, pp. 6037–6044, 2017. [2] a. p. lis gindarsah, “indonesia’s maritime doctrine and security concerns,” 2014. [3] i. chapsos and j. a. malcolm, “maritime security in indonesia : towards a comprehensive agenda?,” mar. policy, vol. 76, no. april 2016, pp. 178–184, 2017. [4] g. wasito, “the authority of bakamla in the enforcement of certain criminal acts at sea is based on law no. 32 of 2014 concerning maritime affairs,” 2015. [5] u. muawanah et al., “review of national laws and regulation in indonesia in relation to an ecosystem approach to fisheries management,” mar. policy, vol. 91, no. august 2017, pp. 150–160, 2018. [6] a. h. i nengah putra a, “analyze opportunities and threats of indonesian maritime security as a result of the development of a strategic environment,” 2016. [7] m. ilangkumaran and s. kumanan, “selection of maintenance policy for textile industry using hybrid multi-criteria decision making approach,” j. manuf. technol. manag., vol. 20, no. 7, pp. 1009–1022, 2009. [8] c. bueger, “what is maritime security?,” mar. policy, vol. 53, pp. 159–164, 2015. [9] s. a. ghassemi and s. danesh, “a hybrid fuzzy multi-criteria decision making approach for desalination process selection,” desalination, vol. 313, pp. 44–50, 2013. [10] h. lumaksono, “the selection of suitable fishing gear for fishermen in madura island using fuzzy ahp and fuzzy topsis,” ecoterra, vol. 15, no. 2, pp. 34–51, 2018. [11] o. gottfried et al., “swot-ahp-tows analysis of private investment behavior in the chinese biogas sector,” j. clean. prod., 2018. [12] l. a. zadeh, “fuzzy sets,” inf. control, vol. 8, no. 3, pp. 338–353, 1965. [13] s. h. zyoud et al., “a framework for water loss management in developing countries under fuzzy environment : integration of fuzzy ahp with fuzzy topsis,” expert syst. appl., vol. 36, no. 1, pp. 61–67, 2012. [14] x. yu, s. zhang, x. liao, and x. qi, “electre methods in prioritized mcdm environment,” inf. sci. (ny)., vol. 424, pp. 301–316, 2018. [15] a. zandi and e. roghanian, “extension of fuzzy electre based on vikor method,” comput. ind. eng., vol. 66, no. 2, pp. 258–263, 2013. [16] s. corrente, s. greco, and r. słowiński, “multiple criteria hierarchy process with electre and promethee,” omega (united kingdom), vol. 41, no. 5, pp. 820–846, 2013. [17] a. loganathan and i. mani, “a fuzzy based hybrid multi criteria decision making methodology for phase change material selection in electronics cooling system,” ain shams eng. j., vol. 9, no. 4, pp. 2943–2950, 2018. 30 hozairi et al. / knowledge engineering and data science 2019, 2 (1): 19–30 [18] s. dožić, t. lutovac, and m. kalic, “fuzzy ahp approach to passenger aircraft type selection,” j. air transp. manag., vol. 68, pp. 165–175, 2018. [19] d.-y. chang, “applications of the extent analysis method on fuzzy ahp,” eur. j. oper. res., vol. 95, no. 3, pp. 649– 655, 1996. [20] p. sirisawat and t. kiatcharoenpol, “fuzzy ahp-topsis approaches to prioritizing solutions for reverse logistics barriers,” comput. ind. eng., vol. 117, no. april 2017, pp. 303–318, 2018. [21] t. kaya and c. kahraman, “multicriteria decision making in energy planning using a modified fuzzy topsis methodology,” expert syst. appl., vol. 38, no. 6, pp. 6577–6585, 2011. [22] y. k. hozairi, “decision support system determination of main work unit in wpp-711 using fuzzy topsis,” knowl. eng. data sci., vol. 1, no. 1, pp. 8–19, 2018. [23] m. s. problem, “combined fuzzy ahp and topsis method for solvinglocation problem,” vol. 8, pp. 373–383, 2006. [24] a. t. gumus, “evaluation of hazardous waste transportation firms by using a two step fuzzy-ahp and topsis methodology,” expert syst. appl., vol. 36, no. 2, pp. 4067–4074, 2009. [25] g. büyüközkan and g. çifçi, “a combined fuzzy ahp and fuzzy topsis based strategic analysis of electronic service quality in healthcare industry,” expert syst. appl., vol. 39, pp. 2341–2354, 2012. [26] r. k. shukla, d. garg, and a. agarwal, “an integrated approach of fuzzy ahp and fuzzy topsis in modeling supply chain coordination,” prod. manuf. res., vol. 2, no. 1, pp. 415–438, 2014. microsoft word 4.6530-18986-le3r knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 1, juni 2019, pp. 31–40 eissn 2597-4637 https://doi.org/10.17977/um018v2i12019p31-40 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) high dimensional data clustering using self-organized map ruth ema febrita a, 1, wayan firdaus mahmudy a, 2, *, aji prasetya wibawa b, 3 a department of computer science, universitas brawijaya jl. veteran, malang, 65145, indonesia b department of electrical engineering, state university of malang jl. semarang no. 5, malang 65145, indonesia 1 ruthemaf@gmail.com; 2 wayanfm@ub.ac.id*; 3 aji.prasetya.ft@um.ac.id * corresponding author i. introduction houses are promising investment commodity in the last decade. the house price index in indonesia is also experienced inflation, as well as the level of sales of home property [1]. the price is influenced by several factors such as the interest rate, inflation on house ownership loans, inflation in building materials prices, and inflation in workers' minimum wages. there are many different types of houses offered along with various features, which sometimes make prospective buyers confused to determine their choice. another three factors that influence house pricing are physical attributes, accessibility, and developer reputation [2]. physical attributes refer to house attributes that are visible and measurable, such as the land area, building area, number of rooms, number of bathrooms, and the availability of the living room. the accessibility refers to the house location that determines the ease of access to public facilities, such as hospitals, schools, campuses, markets, etc. commonly, the closer location of a house with many public facilities may cause the more expensive of the house price. some other economic phenomena that can affect the house prices are the interest rate, inflation and the gross domestic products (gdp) [3]. based on many considered features in determining house prices, the housing data are classified as a high-dimensional data. in some previous studies, neural network can be used to predict the price of a house [4][5][6][7][8]. several approaches of regression techniques to predict the house prices also done by [9][10] which using the time-series data. however, in using a neural network or regression techniques, all feature values must be complete, is less applicable in the real condition. it is because the information received from prospective buyers is not always the same and complete. although by the neural network the missing input value can be replaced through the interpolation mechanism, when using the interpolation, the replaced value will be given under the assumption that it is related to other variables, which is not always correct in the house pricing case. for example, the first data has a value of 60 meters square of the land area and 30 meters square of the building area, while the second data has a value of 100 meters square of the land area and 50 meters square of the article info a b s t r a c t article history: received 7 february 2019 revised 1 april 2019 accepted 6 april 2019 published online 23 june 2019 as the population grows and e economic development, houses could be one of basic needs of every family. therefore, housing investment has promising value in the future. this research implements the self-organized map (som) algorithm to cluster house data for providing several house groups based on the various features. k-means is used as the baseline of the proposed approach. som has higher silhouette coefficient (0.4367) compared to its comparison (0.236). thus, this method outperforms k-means in terms of visualizing high-dimensional data cluster. it is also better in the cluster formation and regulating the data distribution. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: house clustering k-means self-organized map som kohonen 32 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 building area. if the third data has a value of 150 meters square of the land area but has a missing value on building area attributes, then the interpolation will return 75 meters square as the replacement value of the building area attribute, based on the assumption of the two previous data that the building area is half of the land area. the results of interpolation in housing cases are not always correct, because the building area of a house may have a greater value than the land area if the house has more than one floor. therefore, the use of interpolation in house price predictions can cause inaccurate results. this study tries to perform a clustering approach to give a recommendation for house prices. the clustering approach may extract the value of features of each cluster, which can be used as a recommendation for house prices. the clustering process is done by comparing all data in the dataset which will then be clustered based on the similarity of existing features. it is expected to provide information on price ranges that are in line with the features which are already known by prospective buyers. thus, the process of predicting house prices is more applicable. moreover, if the cluster produced can be easily distinguished from other clusters, the value of the inter-cluster feature will not experience overlapping. this makes it very easy to deal with data which contain a missing value. the price recommendation process can still be done by looking at the value of the known attributes and ignoring the unknown values. there is some previous research implemented clustering approach to do prediction task in some cases. two-stage clustering had been for predicting rented house price [11]. the idea of this method is forming the rented house data into some clusters using a clustering algorithm based on the location, then creating the prediction model using linear regression neural network for each cluster formed. the clustering process is done by considering house location because this research believes that the nearer a house location to many public facilities (landmark), the rental price will be higher. by using this hybrid method, the effective cluster can be created, although the accurate rent price prediction still needs more improvements. the two-stage clustering method using k-means and fuzzy inference system also have been done to cluster house data [12]. the data are clustered into four predefined clusters based on house price: cheap, medium, expensive and very expensive. the clustering method was implemented to see how the location of a house affected the house price. after those clusters have been formed, the values of centroid features from each cluster were obtained and used as initial values to build a fuzzy inference system. this research shows that the fuzzy clustering system cannot predict the same cluster as kmeans, means that the prediction of the house price still low in the accuracy. similar work also had been done by [13], which tried to predict house prices using three different methods such as fuzzy, artificial neural network and the k-nn. another clustering method, fuzzy c-means, also been used as a hybrid method in predicting cases. a hybrid method of fuzzy c-means and regression technique is used to predict the workload of a new driver [14]. fuzzy c-means was used to generate a driver workload model based on the regression generated previously. meanwhile, [15][16] also developed fuzzy c-means for predicting the software fault by using it as feature extraction method. on the other hand, som had been applied to classify and label transient data signal [17]. the sequence of stable and transient phase is extracted from the time-series signal data obtained from aircraft engines during the flights. som cluster and label the transient data by checking the similarity of the pattern. the accuracy of the labeled transient signal is excellent in robustness and visualization. a generalized som (gsom) is the improvement of som, have been studied [18]. the special characteristic of this method is it can automatically determine the best number of the cluster and also the shape of the cluster by using a 1-d neighborhood method. the 1-d neighborhood method was represented like the chain of neurons, which can automatically disconnect and connect with the other neurons. som has also been implemented in the health sector to classify and predict female subject with unhealthy visceral fat levels in japan [19]. a map topology is formed from the neurons, where each neuron stores 13 health parameters that are used to detect visceral fat. this map topology is then trained using the som algorithm, and each neuron will be given a label that represents the visceral fat level. the test data prediction is done by finding a winning neuron, which is a neuron that stores the value of the closest / most similar feature to the data. r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 33 self-organized map (som) kohonen will be implemented to cluster high-dimensional data of housing data. som was proposed since it works based on topological arranged neurons, where each neuron has a different feature value. som also has a neighboring weight updating mechanism, which causes adjacent neurons to have similar characteristics. in other words, it is expected to improve the cluster visualization. this research uses k-means as a comparison approach. the performance of these methods are compared to discover the best algorithm for data high dimensional house-data clustering. ii. methods a. dataset the dataset consists of 189 housing data which are obtained from the property exhibition, held in march and august 2017. all data in the dataset have a different value of physical attributes, locations, and also have valid prices, are determined by the developer and valid until december 2017. this research uses [20] to obtain the exact number of public facilities around the house location, within a 1000 meter radius. all the feature values will be normalized to optimize the clustering process, all features that build this house dataset will be shown in table 1. meanwhile, the complete dataset can be accessed through [21]. b. som clustering som is one type of neural network, which is categorized as an unsupervised algorithm. som is built by using one or more layer of neurons and can be described as a topological map of neurons. in general, the som algorithm works by finding a neuron that has the most similar weight corresponding to the data, which is then called as the winning neuron, and then updates the weights of the surrounding neurons within the neighboring radius to form the cluster of neurons that have similar weights. the applied som algorithm is detailed as in [22]: • initialization. in this first step, some of som parameters, such as vector weight of neurons, the map size, the learning rate and also the radius of neighborhood update (nc) need to be initialized. the two-dimension rectangular map grid will be used in this research, while the size of the map will be tested to obtain the best size which performs the best clustering result. meanwhile, each neuron contains a set of features value which already described in table 1. the learning rate represents how fast the algorithm will learn in each iteration. the radius of the neighborhood update refers to the number of neurons around the winning neuron that will be updated. • obtaining the winner neuron. each data vector (x) in the dataset will be compared to each neuron weights (wi) contained in the topological map, and the data similarity (d) will be calculated by using the euclidean distance, as written in (1). the neuron that has the closest distance to the data will be called the winning neuron (c). 𝑑 = ∑ 𝑥 − 𝑤 (1) • neighborhood weights update. this step is an effort to make the weight of the adjacent neurons have similar weights. updating the weights is done using the equation (2) and (3). 𝑤 (𝑡 + 1) = 𝑤 (𝑡) + ℎ (𝑡) 𝑥(𝑡) − 𝑤 (𝑡) (2) ℎ = ℎ 𝑒𝑥𝑝(−‖𝑟 − 𝑟 ‖ /𝜎 ) (3) ℎ (𝑡) is the learning rate 𝛼(𝑡)for all neurons within the 𝑁 and ℎ (𝑡)= 0 for all neurons outside the 𝑁 . 𝑟 and 𝑟 is the weight of neuron i and the winner neuron c, 𝜎 = 𝑁 . the distance of neuron i and neuron c (‖𝑟 − 𝑟 ‖) is calculated based on the neurons positions in the grid map. • stopping criteria. the stopping criteria are determined by using (4), where e is the minimum allowable weight change of the neuron weights between the coressponding iteration (t) and the previous iteration (t-1) 𝑒 = ∑ ∑ × (4) 34 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 c. defining the cluster there are several assumptions used in defining clusters. for convenience, the topological map will be described in figure 1 where each cell describes a neuron and the number written in each cell illustrate the amount of data that best matches the weight of the neuron.. the more detailed illustration will be explained as follows: • the cluster should have at least two matching data in one of the neurons or it may have only one matching data with other data in the adjacent neurons. in figure 1a, the red color has the insufficient condition to make a cluster. • if there are two cells or additional cells separated by empty cells (not adjacent), then every cell is going to be thought-about as a special cluster (figure 1b). • if there are two or more adjacent cells, which have the matching data on them, then all of the adjacent cells will be considered as the same cluster (figure 1c). table 1. the list of observed attributes of a house data attributes original units regency id house id distance from km 0 b kilometer building area meter square land area meter square number of hospital item number of clinic / pharmacy item number of schools item number of campuses item number of market / mall item number of hotels item number of restaurants item number of recreational park item number of public transportation item number of worship place item number of bedrooms item number of bathrooms item living room meter square family room meter square kitchen meter square dinning room meter square clothes horse meter square number of floor item warehouse meter square garage meter square number of terace/ balcony item garden meter square swimming pool meter square building permission electrical installation water channel certificate of ownership free fence free kitchen set the cost of making land certificate housing ownership credit interest rates % price idr *km 0 in the malang city is in malang square (merdeka selatan street) r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 35 thus, the cells that do not have any numbers describe the neurons that have no compatibility with any data in the dataset. d. the measurement of cluster validity in order to measure how well the results of a clustering process, the silhouette coefficient and davies-bouldin index are used. the principle of measuring silhouette coefficient is that a cluster is good enough if the distance between members in the same cluster is close, while the distance between two clusters is far enough so that each cluster can be easily recognized and separated from the other clusters. the davies-bouldin index is used to evaluate cluster results by measuring the ratio of the spread of clusters and the distance between clusters. the silhouette coefficient will be shown at (5), while the davies-bouldin index will be shown at (6) to (8). sil = b – a / max (a, b) (5) in (5), a is the mean intra-cluster distance, whereas b is the nearest-cluster distance. the value of silhouette coefficient will be in the range of [-1, 1]. the most effective cluster will be obtained if the value of sil=1, therefore the worst value of the silhouette coefficient is sil=-1. when the value of sil=0, it indicates that the clusters are overlapped. in the davies-bouldin index, the distribution of clusters will be calculated using (6). 𝑆 = ∑ ‖𝑥 − 𝑧 ‖∈ (6) 𝑇 is the number of members in the cluster i (𝐶 ) and 𝑍 is the center of the cluster i. the distance between clusters is calculated using the euclidean distance between the centroid of the cluster i and the centroid of cluster j. the ratio between 𝐶 and 𝐶 will be calculated using (7). 𝑅 = (7) then, the maximum value of the ratio (𝐷) will be used to calculate the davies-bouldin index, which is shown at (8). 𝐷𝐵𝐼 = ∑ 𝐷 (8) unlike the silhouette coefficient, davies-bouldin index (dbi) value has a range between [0 1]. 𝐷𝐵𝐼 = 0 indicates that the ratio of data distribution in clusters is very good, while 𝐷𝐵𝐼 = 1 shows the ratio of data distribution in clusters is very bad. iii. results and discussions a. som parameter testing there are some parameters on som that will be tested in this research. for each value in the parameter testing will be tested by 7 times. the first tested parameter is the radius of the neighborhood update (𝑁 ). this parameter serves to determine the area of weight updates in the neurons located around the winning neurons. the closer the neuron position to the winning neuron, the more significant weight changes will be so that the weight of the neuron will be more similar to the weight of the winning neuron. when testing the neighboring radius, other parameters are temporarily set by default with the following values: the map size of the neuron is 15×15, the learning rate is α = 0.05, fig. 1. the illustration of defined cluster; (a) the minimum condition of a cluster; (b) a special cluster; (c) a cluster with two adjacent cells 36 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 and the maximum error is = 0.1. the percentage of values will be used for this parameter testing. for example, if the map size is 15×15 and the 𝑁 =60%, that means the radius of the neighborhood update is 𝑁 =9 (9 neurons above, 9 neurons on the left side, 9 neurons below and 9 neurons on the right side of the winning neurons). table 2 will show the result of this parameter testing. table 2 shows that for each neighboring radius tested, the value of silhouette coefficient always shows a negative value. this negative value of silhouette coefficient can be influenced by other parameters. the best silhouette coefficient value is obtained when the neighbor radius is 67% of the map size. the best dbi value (the smallest dbi value) is also obtained when the neighbor's radius size is 67%. the test results show that when 𝑁 =67%, the resulting cluster has a good ratio in terms of the number and distance between clusters, but still does not perform an effective cluster. the size of the neighboring radius that is too large (80%) causes the cluster boundaries to be less clear because the area of the neuron whose weight will be updated is too broad. this will allow the weight of a neuron look similar to that of one cluster. in the other hand, when the size of the neighboring radius is too small will cause the cluster formation process will last very slow. in testing other parameters, the neighboring radius will be set to 67% of the map size. the next tested parameter is the learning rate (α). the result of the learning rate will be shown in table 3. based on the tests performed, the best silhouette coefficient is 0.0958, which is obtained at the setting α = 0.06. while the best dbi is obtained at α = 0.05. all of the tested learning rate values show that the algorithm just needs 3-4 iterations for running the clustering process. this fact shows that the learning rate does not affect the number of iterations. by considering the average silhouette coefficient, the average dbi, the best silhouette coefficient, and the best dbi value, the next parameter testing will use the learning rate α=0.06. the following tested parameter the maximum error (е) as the stopping criteria in the clustering process. the maximum error in som testing can be considered as a significant change in weight on neurons compared to the weights in the previous iteration. the test result of the stopping criteria will be shown in table 4. table 2. neighborhood radius testing nc (%) average of silhouette coef best of silhouette coef average of dbi best of dbi number of cluster 13 -0.884 -0.566 0.7218 0.1461 2 27 -0.927217 -0.8611686 0.587514 0.1594 2 40 -0.951212 -0.7657426 0.827013 0.2678 3 53 -0.959121 -0.8336872 0.795857 0.18 4 67 -0.739308 -0.4496412 0.552214 0.106 2 80 -1 -1 1 1 1 table 3. learning rate testing α average of silhouette coef best of silhouette coef average of dbi best of dbi number of cluster 0.01 -0.88693 -0.41373 0.764971 0.3149 2 0.02 -1 -1 1 1 1 0.03 -0.91634 -0.41435 0.871257 0.0988 1 0.04 -0.82412 -0.39095 0.758229 0.1148 2 0.05 -0.85469 -0.017196 0.868429 0.079 2 0.06 -0.77649 -0.0958 0.791271 0.0979 2 0.07 -0.7262 -0.27805 0.615897 0.08138 2 0.08 -0.94572 -0.77001 0.885571 0.268 2 0.09 -0.86777 -0.61331 0.52072 0.04434 2 0.1 -0.94549 -0.61842 0.769143 0.12 4 0.2 -0.9291 -0.50367 0.878 0.146 3 0.3 -0.99687 -0.97811 0.766324 0.13327 3 0.4 -0.948 -0.72479 0.8899 0.4195 3 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 37 in the stopping criteria, the smaller the error value specified, the more similar the weights of the map from the current iteration with the previous iteration. based on the test results, the greater the error value results in the fewest iterations needed for a clustering process. the test results also show that at e=0.45 and e=0.5 there is increasing value of silhouette coefficient and dbi, both on average and the best value. this can happen because the clustering process will immediately stop the program when a very significant change in weight occurs, whereas in this condition, there has not been a lot of data transfer from one neuron to another, so the data is still quite scattered. when the clusters are quite diffuse, it is possible for the clustering result to obtain the better silhouette coefficient value as well as the dbi value. based on the testing, the best stop criteria occur when e=0.5, because it shows the best average value of silhouette coefficient and dbi. however, the best silhouette coefficient values are obtained when e=0.4. thus the stop condition is set with the value е = 0.5 to for the next parameter testing. the last testing parameter is the size of the topological map. the test result is shown in table 5. the test results show that the best map size is 30×30. when map size is 10×10 shows the worst results because the number of neurons in it is much smaller than the training data used, which is 189 house data. thus, a neuron can be matched with a lot of home data so that a good cluster is very difficult to achieve. the 10×10 maps will provide very limited distances to make separate distances between clusters. as consequence, the silhouette coefficient will be very small. after doing several parameter testing, the best number of clusters obtained by using som is n=2, although in some testing the number of clusters can reach up to n=4. the visualization of the best clustering result is shown in figure. 2 b. k-means result the following sub-section will discuss the implementation of other clustering algorithms as a comparison of the som algorithm. unlike the som algorithm, the number of clusters in the k-means algorithm must be specified before testing. the parameters that will be tested in k-means are the number of clusters and also the stopping criteria (e). table 6 and table 7 will show the testing result of the k-means algorithm based on the silhouette coefficient and dbi. all values written in the table are the best values for each group testing. table 4. maximum error testing е average of silhouette coef best of silhouette coef average of dbi best of dbi number of cluster 0.01 -0.9889 -0.92231 0.883414 0.1839 3 0.05 -0.79485 -0.4469 0.532481 0.237 2 0.1 -0.55797 -0.03704 0.558784 0.0712 2 0.15 -0.61825 -0.052803 0.568891 0.07284 2 0.25 -0.67872 -0.11946 0.548464 0.10143 2 0.3 -0.77888 -0.382642 0.756946 0.04852 2 0.35 -0.69045 -0.087995 0.688443 0.0835 2 0.4 -0.80267 -0.381318 0.8683 0.0781 2 0.45 -0.28458 -0.436725 0.401703 0.0513 2 0.5 -0.31293 -0.250545 0.463836 0.0491 2 table 5. map size testing map size average of silhouette coef best of silhouette coef average of dbi best of dbi number of cluster 10 x 10 -0.9786 -0.8502 0.871986 0.1039 2 15 x 15 -0.86143 -0.37508 0.619 0.197 2 20 x 20 -0.55797 -0.03704 0.558784 0.0712 2 25 x 25 -0.84044 -0.34323 0.653924 0.12244 2 30 x 30 -0.48226 -0.351387 0.151423 0.05173 2 35 x 35 -0.67708 -0.019051 0.583639 0.2163 2 38 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 based on tests performed using k-means, the best silhouette is obtained when the number of clusters specified as n = 6 and e = 0.5. however, the number of clusters that actually formed as the clustering result does not exceed n = 3. in testing the k-means method, almost all of the parameter values tested performs a negative silhouette coefficient value. this indicates that the resulting cluster is not right and still difficult to distinguish between one cluster and another. the best number of clusters obtained is 3 clusters. c. som and k-means comparison based on the test results, both som and k-means are still difficult to achieve good clustering results. it is proven by the negative value of the average silhouette coefficient, indicates that many data are send to the wrong cluster. however, the best silhouette coefficient achieved by som (0.4367), is better than k-means (0.236). in this, case the som has a better ability to build valid clusters compared to the k-means. som algorithm represents the data in the form of a twodimensional topology map. in som, data is placed on neurons. as a result, the internal distance among the member of the cluster and the distance between clusters is easier to be measured. in terms of data distribution, som shows better performance compared to k-means. in som, each cluster can be clearly identified by searching for the grid distance between clusters on the map, so that the distance between clusters can be calculated easily and clearly. in clustering the data, som compares input data vectors to the weights of neurons. but in the k-means, input data vector is compared to the value of the centroid. the centroid values in k-means always been updated on each iteration with the average value of its members' features. thus, a centroid does not always indicate a point in the dataset. this may create difficulties in determining the cluster area and cluster distribution. although som shows better performance compared to k-means, for clustering the highdimensional data it still needs more improvements. this is because in the som, in determining the winning neuron is done by calculating the similarity between the data and the weight of the neurons fig. 2. the best clustering visualization using som table 6. k-means clustering result based on the silhouette coefficient n-cluster maximum error (e) 0.01 0.05 0.1 0.2 0.5 c=3 -0.114403 -0.438083 -0.398426 -0.374 0.072083 c=4 -0.481474 -0.357111 -0.195223 -0.05731 -0.21398 c=5 -0.459957 -0.180032 -0.614222 -0.00155 -0.45512 c=6 -0.268694 -0.855462 -0.1104888 -0.06809 -0.236027 table 7. k-means clustering result based on the dbi n-cluster maximum error (e) 0.01 0.05 0.1 0.2 0.5 c=3 0.4973335 0.4515325 0.4848922 0.408955 0.51021 c=4 0.3420642 0.4083031 0.4619643 0.423122 0.529518 c=5 0.4847796 0.5156256 0.449523 0.608689 0.553155 c=6 0.4619643 0.2885393 0.2954057 0.513923 0.516404 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 39 by using the calculation of the euclidean distance. in calculating the euclidean distance, all features are calculated using the same weight. whereas, in high-dimensional data, not all features are relevant. this makes considering all features in the calculation can actually be a disruption to form a valid clustering result. in fact, some features in high-dimensional data can be referred as the noise [23]. considering the characteristic of the data, som can be modified by using different distance measurements, for example using manhattan distance [24][25][26]. iv. conclusion som can be used to cluster housing data and successfully shows better performance compared to the k-means algorithm. som outperforms k-means in terms of visualizing the high-dimensional data clustering. in other words, it provide easier calculation to obtain the cluster validity. in addition, som also showed a better performance in the process of forming good clusters, is indicated by obtaining better silhouette coefficient and dbi values. however, som still needs some improvements to produce better clustering results. acknowledgement the authors would like to thank andrini cahyaningrum pratiwi for helping in the data collection stage. in addition, the authors also wishes to acknowledge to oliver beattie for providing the open accessed interface which was used in this research. references [1] bank of indonesia, “residential property price survey,” 2017. [online]. available: http://www.bi.go.id/id/publikasi/ survei/harga-properti-primer/pages/shpr-tw.iv-2016.aspx. [accessed: 14-mar-2017]. [2] r. a. rahadi, s. k. wiryono, d. p. koesrindartotoor, and i. b. syamwil, “factors influencing the price of housing in indonesia,” international journal of housing market analysis, vol. 8, no. 2, pp. 169–188, 2015. [3] r. füss and j. zietz, “the economic drivers of differences in house price inflation rates across msas,” journal of housing economics, vol. 31, pp. 35–53, 2016. [4] w. t. lim, l. wang, and y. wang, “singapore housing price prediction using neural networks,” 12th international conference on natural computation, fuzzy system and knowledge discovery, pp. 518–522, 2016. [5] y. feng and k. jones, “comparing multilevel modelling and artificial neural networks in house price prediction,” 2015 2nd ieee international conference on spatial data mining and geographical knowledge services, pp. 108–114, 2015. [6] j. j. wang et al., “predicting house price with a memristor-based artificial neural network,” ieee access, 2018. [7] y. yu, s. song, t. zhou, h. yachi, and s. gao, “forecasting house price index of china using dendritic neuron model,” 2016 international conference on progress in informatics and computing, pp. 37–41, 2016. [8] a. varma et al., “house price prediction using machine learning and neural networks,” 2018 second international conference on inventive communication and computational technologies, pp. 1936–1939, 1936. [9] s. lu, z. li, z. qin, x. yang, r. siow, and m. goh, “a hybrid regression technique for house prices prediction,” ieee international conference on industrial engineering and engineering management, pp. 319–323, 2017. [10] f. tan, c. cheng, and z. wei, “time-aware latent hierarchical model for predicting house prices,” international conference on data mining, pp. 1111–1116, 2017. [11] y. li, q. pan, t. yang, and l. guo, “reasonable price recommendation on airbnb using multi-scale clustering,” chinese control conference ccc, vol. 2016–augus, pp. 7038–7041, 2016. [12] r. e. febrita, a. n. alfiyatin, h. taufiq, and w. f. mahmudy, “data-driven fuzzy rule extraction for housing price prediction in malang , east java,” 9th international conference on advance computer science and inference systems, 2017. [13] m. f. mukhlishin, r. saputra, and a. wibowo, “predicting house sale price using fuzzy logic, artificial neural network and k-nearest neighbor,” 1st international conference on informatics and computational sciences (icicos), vol. 1, pp. 171–176, 2017. [14] d. yi, j. su, c. liu, and w. chen, “new driver workload prediction using clustering-aided approaches,” ieee transaction on systems, man, and cybernetics systems, vol. 49, no. 1, pp. 64–70, 2019. [15] a. arshad, s. riaz, l. jiao, and a. murthy, “semi-supervised deep fuzzy c-mean clustering for software fault prediction,” ieee access, vol. 6, pp. 25675–25685, 2018. [16] a. arshad, s. riaz, l. jiao, and a. murthy, “the empirical study of semi-supervised deep fuzzy c-mean clustering for software fault prediction,” ieee access, vol. 6, pp. 47047–47061, 2018. [17] c. faure, m. olteanu, j. m. bardet, and j. lacaille, “using self-organizing maps for clustering and labelling aircraft engine data phases,” 12th international workshop on self-organizing maps learning vector quantization, clustering data visualization wsom 2017 proceeding, 2017. 40 r.e. febrita et al. / knowledge engineering and data science 2019, 2 (1): 31–40 [18] m. b. gorzalczany and f. rudzinski, “generalized self-organizing maps for automatic determination of the number of clusters and their multiprototypes in cluster analysis,” ieee transaction on neural networks and learning systems, pp. 1–13, 2017. [19] n. kamiura, s. kobashi, m. nii, t. yumoto, and k. sorachi, “application of self-organizing maps to data classification and data prediction for female subjects with unhealthy-level visceral fat,” 2016 ieee international conference on systems, man, and cybernectics, pp. 001815–001820, 2016. [20] “draw radius circles on a map.” [online]. available: http://obeattie.github.io/gmaps-radius/? [21] r. e. febrita, “published house dataset,” 2018. [online]. available: wayanfm.lecture.ub.ac.id/files/2018/10/publisheddataset-ruth-ema.xlsx. [22] t. kohonen, “the self-organizing map,” proceeding ieee, vol. 78, no. 9, pp. 1464–1480, 1990. [23] c. peng, z. kang, m. yang, and q. cheng, “feature selection embedded subspace clustering,” ieee signal processing letters, vol. 23, no. 7, pp. 1018–1022, 2016. [24] a. b. rathod, “a comparative study on distance measuring approches for permutation representations,” 2016 ieee international conference on advances in electronics, communication and computer technology (icaecct), pp. 251– 255, 2016. [25] l. greche, m. jazouli, n. es-sbai, a. majda, and a. zarghili, “comparison between euclidean and manhattan distance measure for facial expressions classification,” 2017 international conference on wireless technologies, embedded and intelligent systems (wits), pp. 1–4, 2017. [26] j. p. singh and n. bouguila, “proportional data clustering using k-means algorithm : a comparison of different distances,” 2017 ieee international conference on industrial technology (icit), pp. 1048–1052. microsoft word 2.5804-19321-le3r knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 1, juni 2019, pp. 10–18 eissn 2597-4637 https://doi.org/10.17977/um018v2i12019p10-18 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) the diffusion of ict for corruption detection in open government data darusalam a, 1, *, jamaliah said a, 2, normah omar a, 3, marijn janssen b, 4, kazi sohag c, 5 a accounting research institute, university teknologi mara, malaysia, level 12, menara sultan abdul aziz shah, universiti teknologi mara, 40450 shah alam, selangor, malaysia b department of technology, policy and management, delft university of technology building 31 jaffalaan 5, delft and 2628 bx delft, netherlands c graduate school of economics and management ural federal university, russia. lenin ave, 51, yekaterinburg, sverdlovskaya oblast', 620075, russia 1 darusalam85@gmail.com*; 2 jamaliah533@salam.uitm.edu.my; 3normah645@salam.uitm.edu.my; 4m.f.w.h.a.janssen@tudelft.nl; 5sohagkaziewu@gmail.com * corresponding author i. introduction recently, open data has become popular due to the drastic growth of information technology [1]. government agencies, state own companies, nonprofit organisation have started initiatives to open their data to enhance the transparency and accountability toward the stakeholder [1]. open data enables public to access data freely, subsequently monitor and participate in government activities. some developing countries have already implemented open data for fighting corruption. information and communication technology (ict) development enable the opening of data to create transparency and have the potential to create anti-corruption tools. [2] argues that there are five significant ways in which ict can help reducing corruption risks: raising awareness of specific governance problems (types of corruption); providing low-cost online platforms to monitor and promote more inclusive, transaction and accountable decision-making. as a result it can reduce the cost of distribution, accessing and collecting government information [3]; reduce the incentives for corruption by reducing the direct contact and familiarity between end-users and decision-makers; enabling the more effective control of financial transactions that may put the integrity of politically exposed agents (individual or collective) at stake; and speeding up public awareness for anticorruption campaigns. article info a b s t r a c t article history: received 29 november 2018 revised 25 april 2019 accepted 13 may 2019 published online 23 june 2019 corruption occurs in many places within the government. to tackle the issue, open data can be used as one of the tools in creating more insight into the government. the premise of this paper is to support the notion that data opening can bring up new ways of fighting corruption. the current paper aimed at investigating how open data can be employed to detect corruption. this open data is trivial due to challenges like information asymmetry among stakeholders, data might only be opened partly, different sources of data need to be combined, and data might not be easy to use, might be biased or even manipulated. the study was conducted using a literature review approach. the reviews implied that corruption can be detected using open government data, thus, by conducting the open data technique within the government, the public could monitor the activities of the governments. the practical contribution of this paper is expected to assist the government in detecting corruption by using a data-driven approach. furthermore, the scientific contribution will originate from the development of a framework reference architecture to uncover corruption cases. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: open data anti-corruption activity theory public sector private sector information and communication technology darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 11 corruption is a problem in the public and private sector and came in many shapes and forms including: bribery [4], embezzlement [5], theft [6], extortion [7], abuse of power [8], discretion [9], favoritism [10], conflicting interest [11], and improper political contribution [12]. the consequences of corruptions are diverse. corruption can harm society and result in increased poverty, diminishing money available for essential government services, destroy citizen trust in government and undermining economic growth the open data has been used as a mechanism to fight against corruption has been successfully implemented. the original mission of open data is to enhance credibility, the democratisation of government decisions, participation and the promotion of a global culture of transparency and accountability. for example, timor leste a small island close to indonesia with 1 million inhabitants is using financial data to monitor and control corruption [13]. thus, this study aims to investigate how open data can be used to detect corruption. this paper explores the literature on ict development and open data as a mechanism to fight against corruption. in this paper, the institutional setting will be taken into account, but the change of the culture is outside the scope. the following section of this research provides background information regarding the developments including open government data (ogd), data-driven detection of corruption and overview of stakeholder. next, the challenge for data-driven corruption detection. after that, the problem statement and research objective are presented. finally, this proposal discusses the research phases and research method. ii. methods this study review past related literature on the role of open data in mitigating corruption. the literature review is focused on 1) open government data 2) information architecture and 3) corruption detection. the determined critical concepts for the literature reviews include open government data, corruption, and accountability. articles were found by using scopus, jstor, acm, digital library, and google scholar/l snowballing will be used examining the citations in the identified articles and adding these articles. a literature review is one of the critical elements for research and generally focus as fundamental in a research project [14]. webster and watson’s 2002 study [15] stated that new researcher uses the method of the literature review is as nothing more than grouping some papers and summaries it or collaborated the multiple research manuscript of annotated bibliography. the definition of literature review for various scholars mentioned as “the use of ideas in the literature to justify the particular approach to the topic, the selection of methods, and demonstration that this research contributes something new” hart’s study cited in [15]. hart (1998) also noted that for the literature review, “quality means appropriated breadth and depth, rigour and consistency, clarity and brevity and effective analysis and synthesis” cited in [15]. [15] in reporting shaw’s study emphasised that the aspect of a literature review should “explain how one piece of research builds on another”. in line with the above definition, webster and watson’s [15] (2002) define the literature review as one that “creates a firm foundation for advancing knowledge. it facilitates theory development, closes areas where a plethora of research exists, and uncovers areas where research is needed”. according to [15], an effective literature review should follow the following steps. first, conducting methodological analysis and synthesizing the quality of the literature. secondly, providing a firm foundation for a research topic. thirdly, providing a firm foundation to the selection of research methodology, and finally, demonstrating the usefulness of the proposed research to the overall body of knowledge or advancing the research field’s knowledge-base. figure 1 described the three-step process to propose a literature review. the process contains the following steps; 1) inputs, 2) processing and 3) outputs. it gives an overall view of the process proposed. this proposal is using a literature review as the first step to define the critical constructs of study as well as to identify factors which influence open data to detect corruption. webster and watson’s 2002 [15] stated that a useful and quality literature review is a review that is based upon a concept-centric approach rather than chronological or author-centric approach. researchers must ask themselves when they are reviewing and writing the literature review and see if the presented articles are related or not to the study [15]. 12 darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 iii. results and discussion this part presents a global overview of how ict especially open data can be used for mitigating corruption. a. opening the government data the former president of united states of america, barrack obama, stated in a memorandum of transparency and open government that the government should ensure the public trust and establish a system of transparency, public participation, and collaboration [16]. openness will strengthen our democracy and promote efficiency and effectiveness in government. the european commission also state that the availability of raw data and document in various readable format and languages may maximise the re-useability value of public sector information (psi) (european commission, 2010, p.5). therefore, the government should provide the most practical data for users. data should be available, for free, over the internet in open, structured, machine-readable formats to anyone who wants to use it [17]. according to [18] open data is freely accessed internet data, can be to re-used without limitation. [19] said that everyone can freely see, use and inform the open data to others. open data can be freely used, re-used and distributed by anyone and make their work available to be shared as well [20]. these definitions show that the main characteristics of open data are data (re)use, machine-readable and access. machine-readable should ensure that a massive amount of data can be processed automatically. open data has the potential to support the detecting of corruption; however, how this can be done is not known yet. data collection, current systems, administrative processes and institutional arrangements might need to be changed. a reference information architecture capturing administrative processes, data, and software systems and organizational principles can help to develop systems to support the detecting of corruption. open data has become an important and growing topic in many developing countries [21] due to improve transparency, accountability and citizen participation [22]. [22] argue that there are five primary drivers of open data initiative in developing countries. first, an open data initiative can be motivated by politicians to improve the information flow within the government and with other stakeholders to reduce administrative burden, costs and inefficiencies. second, the increase of accountability may strengthen the government political policies by giving better information on regional, local or sectoral government activities. the third driver for open data initiative is the pressure from civil society, media, and parliamentarians or private companies. the fourth is international pressures to create data transparency. the last driver is the government motives to gain reputation from transparency. b. ict for corruption mitigation this part presents a global overview of how ict can be used for mitigating corruption. there are three types of indicator in ict such as ict access, ict use and ict skill. we only focus on ict access which is consists of five proxies. for example, fixed telephone subscriptions, mobile phone, fig. 1. the three stages of effective literature review process [15] darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 13 internet bandwidth, the computer used, and internet access. [2] argues that there are five major ways in which ict can help reducing corruption risks: 1) raising awareness of specific governance problems (types of corruption); 2) providing low-cost online platforms to monitor and promote more comprehensive transaction and accountable decision making. as a result it can reduce the cost of distribution, accessing and collecting government information [3]; 3) reduce the incentives for corruption by reducing the direct contact and familiarity between end-users and decision-makers; 4) enabling the more effective control of financial transactions that may put the integrity of politically exposed agents (individual or collective) at stake; and 5) speeding up public awareness for anticorruption campaigns. ict of information and communications technology is referred to actions that provide connecting to information through other electronic technologies such as, internet, wireless networks, cell phones and other mediums communication [23][24]. some prior studies argue that ict has a positive relationship in fighting corruption and improve quality governance. previous studies indicated that ict is able to provide countries a new method to creating transparency and indorsing anti-corruption [3][25][26][27][28][29][30][31]. subsequently, a high number of countries try to implement and connected transparency with ict-based initiative for example through e-government [32]. furthermore, according to [33], in india ict is success support quality of governance. ict support the decision of public administrators to improve planning and monitoring programs for more transparent public services through access to information and knowledge. for instance, the use of gis (geographical information system) for planning location of rural facilities or identifying the disaster area. another example is the use of telephone to develop socioeconomic development [34]. people can reduce their communication cost since the telephone can minimise the number of communicating links among some parties. following that, many ict literatures commented on the implementation of ict approach to reducing corruption. for instance, [30] ict can promote to reduce corruption by implemented good governance, strengthening reform-oriented initiatives, reducing the potential for corrupt actions, improving the connection between citizens and public employee, permit the citizen to follow their activities, monitor and control the public employee actions. also, [35] stated that, to success in reducing, the ict initiatives must be changed to disclose the information. as a result, the societies, ngo, researcher, politicians can track the decisions and action that lead by a government employee. in the same time, some governments see the implementation of ict as a resource to encourage efficiency and transparency in the same [36]. in general, ict envision as a practical tool to reduce corruption, although social culture can reduce the effectiveness of ict as anti-corruption [29]. statistical analyses and cases studies show that ict is demonstrating a great deal in reducing corruption. especially, ict can improve the effectiveness of managerial and internal control ended by fraudulent behaviour as well as endorsing the transparency and accountability on governments [30]. a study by [37] regarding the analysing of corruption data over ict-enable e-government initiatives concluded that corruption could be reduced significantly by implementing e-government, “even after controlling for any propensity for corrupt governments to be more or less aggressive in adopting e-government initiatives” [37] p. 210. other studies examined the successes of e-government in reducing corruption in some countries like americans, europe and asia [21][30]. the most prominent successes solution for fighting corruption by e-government in the area of taxes and government contracts [12]. for examples, in india, by provided the property record online in a rural area has significantly improved the speed at which the records are retrieved and updated, whereas, in the same time erasing the opportunities for the local employee to receive bribes as had previously been widespread [21]. the property records online or the bhoomi electronic land record system at karnataka, india is approximate to have saved 7 million in bribes to the local employee in its first several years. before the implemented the system, it required rs.100 to transfer to a local employee for bribes, though the electronic system only takes rs.2 [38]. similarly, in pakistan, by using e-government for tax transaction. the government was restructured all tax system and the department to e-government structure. the aims are to decrease the face to face contact between tax employee and citizen to decreasing chances for requests for bribes [37]. also, in the philippines department of budget and management, they create an e-procurement system for bidding on government contracts to both avoid price fixing and provide accountability on the public. in the same way, in chile nation, e-procurement system was established to provide public and citizens 14 darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 to see and compare the cost and services of the bids purchased by the government. the e-procurement system provides 500 outsourced services from more than 6,000 providers [30]. the system approximately gives advantages $150 million us per year by avoiding price fixing or inflation by corrupt officials and contractors. not to mention, this new system gives a positive contribution to reducing corruption and allow small business to participate in the government bidding process [39]. as a matter of fact, in fiji, the successful of e-government to reducing corruption has built positive public perception of government corruption. as a result, it improves the responsiveness of public employee to provide better services to citizens [40]. coupled with, the united state has establishing web sites that permit access to the government expenditure data, for instance (recovery-gov) general funds (usa spending-gov), and information technology funds (it usa spending-gov) web sites, the purpose is involving society to control the government spending for earlier identification and removal of wasteful projects [16]. some states in the us adopted similar websites involving the citizens to control and monitor government spending for waste and fraud. additionally, some u.s websites on government permit for tracking of transaction therefore that it is possible to monitor the progress of one’s application or a request from government services. for example, the u.s customs and immigration services (uscis) give access to immigrants to check their application progress. similarly, with u.s department of state allow passport seekers to check the progress of their application. this service allows a significant number of the user (such as citizens, residents, immigrants) to monitor the progress of their application through online services. as a result, it saves the time for the user, efficiency and offers reasonable timeframes for processes of some services, documents and resources [12]. thus, [2] argues that some conditions should be in places to make ict as a weapon to fight against corruption like training of public officials, and institutional arrangements. the latter is necessary for the ability to further analysis and take interventions after spotting possible corruption. since corruption is rooted in culture ict alone is not sufficient to reduce corruption they should collaborate with public governance, institution, media and society [2]. c. open data-driven corruption detection more data is available that can be used to detect possible corruption. the availability of a large volume of data is often named ‘big data’ [41]. corruption is often hard to detect and to observe. anomalies, outliers or changes in the patterns might be a sign of corruption. it is likely that multiple data sources need to be combined with being able to detect corruption. each of these sources might provide complementary insight. different sources might not be consistent or show a different picture of the situation, which makes analysis difficult. differences might be a sign for corruption, but can also due to problems in data collection. the availability of data is related to the datafication. the term of datafication can be defined as “the ability to quantify all sorts of information into machine-readable data format” datification resulted in the need for the development of new capabilities to handle the large volume of data [42]. the use of sensors of the internet of things (iot) enables the collection of data at the source. by collecting data directly, the chance of data manipulations is reduced. the creation of new content, connectivity, analysis software and infrastructure may continuously evolve the datafication. one of the most changes is the transformation data from closed data to the open, interconnected world where the traditional roles of, and the relation between, sectors are changing [18]. example of iot measures of pollution, video cameras, weighing goods, counting the number of cars (or people) passing and so on. this kind of information might be useful for detecting corruption. for example, more cars might have been passed than toll has been paid. data-driven corruption detection is a complex process in which data need to be collected, processed, analyzed and acted upon. in this process, many stakeholders can be involved. figure 2 shows schematically how corruption can be detected by collecting and using data. firstly, the data is collected from the process under scrutiny in which various actors can play a role. different types of data can be collected at various points in time and from multiple stakeholders. data can be collected in different areas, including budgeting, actual spending, policy programmers and so on. the categories of data that is collected likely determine if it is possible to detect corruption. darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 15 d. the information architecture, modelling, and principles mechanisms for information sharing needs to be in place to detect corruption. a public organisation needs guidance to develop systems and processes for disclosing data to detect corruption. in this research, the focus is on developing a reference architecture to be able to detect corruption. by using reference information, architecture organisations can develop a system that can be used to detect corruption. there are various views on what constitutes an information architecture. in general, an information architecture (ia) is a coherent whole of principles and models capturing elements like organisational structure, business processes, data, and it infrastructure. it is a formal description of a system, or a detailed plan of the system at component level to guide its implementation as well as the structure of components, their inter-relationships, and the principles and guidelines are governing their design and evolution ever time [43]. furthermore, the reference architecture is defined as a set of principal guidance for implementation and a system structure and components, is simultaneously applicable to multiple related of specific system with explicit variation (taylor, medvidovic, & dashofy, 2009, p.58). [43] explains the advantages of using a reference architecture as follows. 1. accelerating the whole or partial system analysis. 2. improving the reusability and connectivity. 3. decreasing mistakes and error. information architecture should provide a blueprint of the desired situation and an overall plan regarding the implementation. the architecture can be used by stakeholders to make decisions regarding the development strategies system. software architecture, enterprise architecture and reference architecture differ in many ways; their generality and their scope [44]. a reference architecture is a generic architecture, while enterprise architecture and software architecture are specific architectures for a situation. the target of enterprise architecture is the enterprise as a whole. while software architecture is only for specific solution architecture in their scope. information architecture describes the relationship between the business process, application and information source aim at storing, processing, reusing and distributing of information across information resources. in other words, information architecture is the organisation of information to aid information sharing among actors. however, reference architecture can be applying in all levels both solutions and enterprise [43]. what a reference architecture should contain varies in the literature. we follow [43] a reference architecture includes 1) architectural principles, 2) implementation guidelines, and 3) system structures and components. on the other hand, [45] categorise the contents of a reference architecture into three separate classifications, i.e. 1) customer context, 2) business architecture, and 3) technical architecture. customer context consists of customer enterprises and users as well as their interaction; technical architecture, which provides solutions in technology, includes design patterns technology; business context comprises of a business model and life cycle. according to [46], referential architecture has three primary elements: (1). statements of technical positions that guide system architects in making technical decisions; (2) an organisation-wide consolidated infrastructure blueprint that gives a blueprint for the overall infrastructure and shows how various enterprise components are hooked together, and (3) individual reference architectures for specific types of systems. fig. 2. data-driven corruption detection open data collection data processing data analysing measuring the coruption possibility 16 darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 despite the different contents given by various authors, there are some commonalities that should be included in a reference architecture: (1) architecture should be both descriptive and prescriptive (blueprint) which describes system structure and components as well as their interaction; and, (2) practical guidance for implementation (principles, guidelines, or technical positions). there is a variation of contents by inserting business context and customer context into a reference architecture. in our work, we consider a reference architecture as a perspective information architecture describing the highest level of abstraction of a system in term of modelling languages, there are various languages for architecture modelling such as archimate and bpmn. the archimate language contains concepts for describing the relationships between architecture descriptions at the business, application, and technology levels. it plays a central role, related to the ubiquitous problem of business–ict alignment. archimate conforms to existing languages or standards, such as unified modelling language (uml) for each architectural domain, [47]. the business process modelling notation (bpmn) is a standard for modelling processes, which gives unambiguous symbols and constructs for mapping out processes resulting in simple communicative models [48]. bpmn will be used for mapping business processes in which the responsible entity for executing tasks can be modelled using in the swim lane. the connection among organisation can be designed by exchanging information, with the modeller choosing the aggregation level at which a service is specified. this allows a link among the business processes, organisational responsibilities, data stored and exchanged related to (sub) processes and tasks. our focus is on data sources, information exchange and business processes; therefore, bpmn will be adopted as the modelling language. furthermore, architecture principles can be defined as must be followed rules, emphasises “doing the right things” and expected to give significant improvement [44]. guidelines are supporting practical guides which often cannot wholly be followed and need trade-offs; system structures are levels of structures of the system that expected to satisfy the requirements; system components are components of the system that expected to satisfy the requirements. part of the reference architecture is architectural principles. principles are defined as rules which have to be followed, emphasises as well as to provide a significant improvement [49]. on the other word, principles is a normative reusable and directive statement that guide architects in designing the capabilities needed to achieve overarching goals. the use of principles to solve complex or structural problems cannot be formulated in clear and quantitative terms and computational techniques [47]. the use of principle based design (pbd) is suitable for information system (is) design in multi-tasks environments [48] such as facilitating different sets of goals and process of various users; unfamiliar events and processes in non-routine task environments; different audiences (architects, it developers, it auditors, system managers and operators); deep uncertainty; and a wide range of dynamic-technical solutions and alternatives. the principle can give guidance during the multi-level of the design process and architects it auditors can use it as a checklist for the evaluation of the existing information system. the implementation of the reference information architecture is aimed at enabling the design of information architectures that allow corruption detection iv. conclusion open data is one of the tools for fighting corruption; opening up government data to the public is an excellent strategy to control the public and private government. the mission of open data is to improve credibility, democratisation, transparency and accountability of government decisions. also, there is some requirement to be in places to make ict development to be a tool to fighting corruption, for example, training the official staff and institution revolution, ict itself is not sufficient to reduce corruption, they must work with public government ngos, media and society. in this paper, we consider creating a framework of reference architecture as a viewpoint information architecture describing the top level of abstraction of a system. future research will have to show whether this framework can be applying in the case study. the practical contribution is aimed at helping the government to detect corruption by using a data-driven approach. the scientific contribution will originate from the development of a reference architecture and architectures principles which enable the design of information architectures enabling the detecting of corruption. darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 17 acknowledgements we are indebted to the mora scholarship 5000 doktor ministry of religious affairs of indonesia, state islamic university of raden fatah palembang, indonesia, accounting research institute (ari, uitm) malaysia, technology university of delft, the netherlands in giving us the support needed for this project. we appreciate the reviews and comments made by academicians on earlier drafts of this paper. many thanks to government agencies, organizations that participated in the project. references: [1] m. janssen, et al., "benefits, adoption barriers and myths of open data and open government," information systems management, vol. 29, pp. 258-268, 2012. [2] l. d. sousa, "open government and the use of ict to reduce corruption risks," ppt, 2016. [3] j. c. bertot, et al., "crowd-sourcing transparency: icts, social media, and government transparency initiatives," in proceedings of the 11th annual international digital government research conference on public administration online: challenges and opportunities, 2010, pp. 51-58. [4] n. huijboom and t. van den broek, "open data: an international comparison of strategies," european journal of epractice, vol. 12, pp. 4-16, 2011. [5] i. amundsen, political corruption: an introduction to the issues: chr. michelsen institute, 1999. [6] a. shleifer and r. w. vishny, "corruption," the quarterly journal of economics, vol. 108, pp. 599-617, 1993. [7] j. hindriks, et al., "corruption, extortion and evasion," journal of public economics, vol. 74, pp. 395-430, 1999. [8] d. waite and d. allen, "corruption and abuse of power in educational administration," the urban review, vol. 35, pp. 281-296, 2003. [9] s. johnson, et al., "regulatory discretion and the unofficial economy," the american economic review, vol. 88, pp. 387-392, 1998. [10] j. leitner, et al., "the debate about political risk: how corruption, favoritism and institutional ambiguity shape business strategies in ukraine," in eu crisis and the role of the periphery, ed: springer, 2015, pp. 3-19. [11] o. fadiaro, et al., "coverage of corruption news by major newspapers in nigeria," publications of new media and mass communication, vol. 24, pp. 53-59, 2014. [12] j. c. bertot, et al., "using icts to create a culture of transparency: e-government and social media as openness and anti-corruption tools for societies," government information quarterly, vol. 27, pp. 264-271, 2010. [13] cia, "central intellegence agency," https://www.cia.gov/library/publications/the-world-factbook/geos/tt.html, 2006. [14] y. k. dwivedi, et al., "completing a phd in business and management: a brief guide to doctoral students and universities," journal of enterprise information management, vol. 28, pp. 615-621, 2015. [15] y. levy and t. j. ellis, "a systems approach to conduct an effective literature review in support of information systems research," informing science: international journal of an emerging transdiscipline, vol. 9, pp. 181-212, 2006. [16] w. house, "memorandum on transparency and open government," ed, 2009. [17] d. g. robinson, et al., "government data and the invisible hand," yale journal of law & technology, vol. 11, p. 160, 2009. [18] t. jetzek, et al., "data-driven innovation through open government data," journal of theoretical and applied electronic commerce research, vol. 9, pp. 100-120, 2014. [19] j. kučera, et al., "open government data catalogs: current approaches and quality perspective," in technologyenabled innovation for democracy, government and governance: proceedings of the joint international conference on electronic government and the information systems perspective, and electronic democracy (egovis/edem 2013), prague, czech republic, 2013, pp. 152-166. [20] b. ubaldi, "open government data," 2013. [21] s. bhatnagar, "e-government and access to information," global corruption report, vol. 2003, pp. 24-32, 2003. [22] c. schwegmann, "open data in developing countries," european public sector information platform-topic report 2013, vol. 2, 2012. [23] c. dictionaries, collins english dictionary: harpercollins publishers, 2009. [24] c. cobuild and u. o. b. . collins cobuild advanced learner's english dictionary: harpercollinspublishers, 2003. [25] c.-k. kim, "anti-corruption initiatives and e-government: a cross-national study," public organization review, vol. 14, pp. 385-396, 2014. [26] s. kim, et al., "an institutional analysis of an e-government system for anti-corruption: the case of open," government information quarterly, vol. 26, pp. 42-50, 2009. [27] d. a. lalountas, et al., "corruption, globalization and development: how are these three phenomena related?," journal of policy modeling, vol. 33, pp. 636-648, 2011. [28] j. carlo bertot, et al., "promoting transparency and accountability through icts, social media, and collaborative egovernment," transforming government: people, process and policy, vol. 6, pp. 78-91, 2012. [29] d. c. shim and t. h. eom, "anticorruption effects of information communication and technology (ict) and social capital," international review of administrative sciences, vol. 75, pp. 99-116, 2009. 18 darusalam et al. / knowledge engineering and data science 2019, 2 (1): 10–18 [30] d. c. shim and t. h. eom, "e-government and anti-corruption: empirical analysis of international data," intl journal of public administration, vol. 31, pp. 298-316, 2008. [31] v. tanzi and h. davoodi, "corruption, public investment, and growth," in the welfare state, public investment, and growth, ed: springer, 1998, pp. 41-60. [32] j. e. relly and m. sabharwal, "perceptions of transparency of government policymaking: a cross-national study," government information quarterly, vol. 26, pp. 148-157, 2009. [33] s. bhatnagar, "social implications of information and communication technology in developing countries: lessons from asian success stories," the electronic journal of information systems in developing countries, vol. 1, pp. 1-9, 2000. [34] p. palvia, et al., "ict for socio-economic development: a citizens’ perspective," information & management, 2017. [35] s. bhatnagar, "transparency and corruption: does e-government help?," draft paper prepared for the compilation of chri, 2003. [36] c. von haldenwang, "electronic government (e-government) and development," the european journal of development research, vol. 16, pp. 417-432, 2004. [37] t. b. andersen, "e-government as an anti-corruption strategy," information economics and policy, vol. 21, pp. 201210, 2009. [38] worldbank, "anti-corruption," http://www.worldbank.org/en/topic/governance/brief/anti-corruption, 2016. [39] r. heeks, "e-government as a carrier of context," journal of public policy, vol. 25, pp. 51-74, 2005. [40] r. d. pathak, et al., "e-governance to cut corruption in public service delivery: a case study of fiji," int. journal of public administration, vol. 32, pp. 415-437, 2009. [41] j. s. ward and a. barker, "undefined by data: a survey of big data definitions," arxiv preprint arxiv:1309.5821, 2013. [42] v. mayer-schönberger and k. cukier, big data: a revolution that will transform how we live, work, and think: houghton mifflin harcourt, 2013. [43] y. gong, engineering flexible and agile services: a reference architecture for administrative processes: tu delft, delft university of technology, 2012. [44] d. greefhorst and e. proper, architecture principles: the cornerstones of enterprise architecture: springer science & business media, 2011. [45] r. cloutier, et al., "the concept of reference architectures," systems engineering, vol. 13, pp. 14-27, 2010. [46] p. j. windley, digital identity: " o'reilly media, inc.", 2005. [47] w. janssen, et al., "business case modelling for e-services," proceedings 18th bled econference: eintegration in action, bled, slovenia, 2005. [48] n. bharosa, et al., "an activity theory analysis of boundary objects in cross-border information systems development for disaster management," security informatics, vol. 1, p. 1, 2012. [49] n. bharosa and m. janssen, "principle-based design: a methodology and principles for capitalizing design experiences for information quality assurance," journal of homeland security and emergency management, vol. 12, pp. 469-496, 2015. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 145–156 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p145-156 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) evidence of students’ academic performance at the federal college of education asaba nigeria: mining education data arnold adimabua ojugo a,1,*, christopher chukwufunaya odiakaose b,2, frances emordi c,3, rita erhovwo ako a,4, winifred adigwe d,5, kizito eluemonor anazia e,6, and victor geteloma a,7 a department of computer science, federal university of petroleum resources pmb 1221 fupre road, ugbomro, effurun 330102, nigeria b department of computer science, faculty of information technology, dennis osadebay university bonsaac, pantor drive, anwai-asaba 320006, nigeria c department of cybersecurity, faculty of information technology, dennis osadebay university bonsaac, pantor drive, anwai-asaba 320006, nigeria d department of computer science, faculty of information technology, university of science and technology ozoro kwale rd, ozoro 334113, nigeria e department of information technology, faculty of information technology, university of science and technology ozoro kwale rd, ozoro 334113, nigeria 1ojugo.arnold@fupre.edu.ng*; 2osegalaxy@gmail.com; 3frances.emordi@dou.edu.ng; 4ochukorita2@gmail.com; 5adigwew@dsust.edu.ng; 6anaziake@dsust.edu.ng; 7vochuko@gmail.com * corresponding author i. introduction the advent of data technology in various fields has led to massive volumes of data in various forms like files, audio, videos, images, and lots of new data formats [1][2]. data from diverse applications requires a correct method of extracting knowledge from large repositories for better decision-making [3]. knowledge discovery aims at birthing valuable, meaningful data via a collection of knowledge [4][5]. knowledge mining uses various methods and algorithms to extract various forms of data. data processing and mining for knowledge discovery tools have since recorded tremendous success in their impact [6][7][8], and have become an essential facet in various organizations [9][10][11][12]. data processing techniques are introduced into new fields of statistics, databases, machine learning, pattern reorganization, ai, and computation competencies. there are growing study interests in using educational data mining. this recently evolving field, called educational data mining, concerns developing approaches that discover knowledge from data originating from educational environments [13][14][15][16]. educational data mining uses techniques like decision trees, neural networks, naïve bayes, and k-nearest neighbors [17]. these techniques reveal many sorts of knowledge, like association rules, classifications, and clustering [18][19]. the article info a b s t r a c t article history: received 04 october 2023 revised 09 october 2023 accepted 14 october 2023 published online 19 october 2023 one main objective of higher education is to provide quality education to its students. one way to achieve the highest level of quality in the higher education system is by discovering knowledge for prediction regarding enrolment of students in a particular course, alienation of traditional classroom teaching model, detection of unfair means used in online examination, detection of abnormal values in the result sheets of the students, and prediction about students’ performance. the knowledge is hidden among the educational data set and is extractable through data mining techniques. the present paper is designed to justify the capabilities of data mining techniques in the context of higher education by offering a data mining model for the higher education system in the university. in this research, the classification task is used to evaluate student’s performance, and as many approaches are used for data classification, the decision tree method is used here. by this, we extract data that describes students’ summative performance at semester’s end, helps to identify the dropouts and students who need special attention, and allows the teacher to provide appropriate advising/counseling. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: academic performance bayesian network educational data decision tree summative testing http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 146 revealed knowledge is used in prediction about the enrolment of scholars during a particular course, alienation of traditional classroom teaching model, detection of unfair means utilized in online examination, detection of abnormal values within the result sheets of the scholars, prediction about students’ performance then on [20][21][22][23]. the study uses data mining methodologies to investigate students’ performance within the various courses. data mining offers many tasks to investigate student performance, and for such tasks in classification [24] – we seek to study student’s performance using decision tree classification. data such as class tests, attendance, assignment marks, and examination score(s) were collected and used to predict the performance at the top of the semester. ii. methods data mining is often utilized in the tutorial field to reinforce our understanding of the training process to specialize in identifying, extracting, and evaluating variables associated with the training process of scholars as described by many scholars. mining in an academic environment is named educational data mining. data mining in education may be a recent research field and this area of research area is gaining popularity due to its potential for educational institutes. shiokawa et al. [25] describe data mining transactions as one that permits the users to research data from different dimensions, categorize it, and summarize the relationships identified during the mining process [25]. the study was conducted on student performance by selecting 600 students from different colleges at awadh university, faizabad, india. employing bayes classification on category, language, and background qualification, it had been found whether newcomer students will perform or not [26]. ahmad et al. [27] conducted a study on student performance by selecting 300 students (225 males, 75 females) from a gaggle of schools affiliated to punjab university of pakistan. the hypothesis stated as "student's attitude towards attendance in school, hours spent during a study on day to day after college, student's family income, students' mother's age, and mother's education are significantly related with student performance" was framed. employing simple rectilinear regression analysis found that factors like the mother’s education and the student’s family income were highly correlated with the student's academic performance. brindlmayer [28] conducted a performance study on 400 students comprising 200 boys and 200 girls selected from the senior lyceum of aligarh muslim university, aligarh, india, with the most objective to determine the prognostic value of various measures of cognition, personality, and demographic variables for fulfillment at higher secondary level in science stream. the choice was supported by the cluster sampling technique, during which the whole population of interest was divided into groups or clusters, and a random sample of those clusters was selected for further analyses. it had been found that girls with high socio-economic status had relatively higher academic achievement within the science stream, and boys with low socio-economic status generally had relatively higher academic achievement. nguyen et al. [29] gave a case study using student data to research their learning behavior to predict the results and warn students in danger before their final exams. they applied a decision (choice) tree model to predict the ultimate grade of scholars who studied the c++ course at yarmouk university, jordan, 2015. the 3-classification methods were used, namely id3, c4.5, and the naive bayes. their results indicated that the choice tree model had better predictions than others. also, nilam [30] conducted a study on student performance by selecting 60 students from a degree college of awadh university in india. through association rule, they find the interesting in opting class teaching language. he describes using the k-means clustering algorithm to predict student’s learning activities. the knowledge generated after implementing the info-mining technique could also be helpful for a teacher and college kids. chen [31], in his study on private tutoring and its implications, observed that the share of scholars receiving private tutoring in india was relatively above in malaysia, singapore, japan, china, and sri lanka. it had also been observed that there was an enhancement of educational performance with the 147 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 intensity of personal tutoring, and this variation of intensity of personal tutoring depends on the collective factor, namely socio-economic conditions. haipinge et al. [32] conducted a study on student performance by selecting 300 students from 5 different degree colleges conducting the bca (bachelor of computer application) course. through the bayesian classification method on 17 attributes [33][34][35][36], it was found that factors like students‟ grades within the senior secondary exam, living location, medium of teaching, mother’s qualification, student other habits, family annual income, and student’s family status were highly correlated with the scholar academic performance. data mining, also called knowledge discovery in database (kdd), often refers to digging out [37][38][39] or “mining" knowledge from large amounts of data. data mining techniques are wont to operate vast volumes of data to get hidden patterns and relationships helpful in deciding [40][41][42]. while data mining and knowledge discovery within the database are frequently treated as synonyms, data mining is a component of the knowledge discovery process [43][44][45]. the sequences of steps identified in extracting knowledge from data are shown in figure 1. fig. 1. the steps of extracting knowledge from data numerous heuristic techniques, such as classification, clustering, regression, neural networks, association rules, decision trees, and genetic algorithms, have been successfully used for database knowledge discovery. these techniques and procedures in data mining need to be briefly mentioned to understand better [46][47][48][49]. regression techniques are often adapted for prediction. regression or multivariate analyses are often used to model the connection between one or more independent variables and dependent variables. in data mining, independent variables are attributes already known, and response variables are what we would like to predict. unfortunately, many real-world problems are not predictive[50][51][52][53]. prediction is a declaration of something self-evident, a task that can be assumed as the basis for argument. thus, more complex techniques are used (e.g., logistic regression, decision trees, or neural nets) to forecast future values. an equivalent model type can often be used for both regression and classification. for instance, the cart (classification and regression trees) decision tree algorithms are often built to classify categorical response variables and to forecast continuous response variables. neural networks can also create classification and regression models [21][54]. classification is the most ordinarily applied data processing technique, which employs a group of pre-classified examples to develop a model to classify the population of records at large. this approach frequently employs decision trees or neural network-based classification algorithms. the info classification process involves learning and classification. in learning, the classification algorithm analyzes the training data [55]. in classification test data are wont to estimate the accuracy of the classification rules. if the accuracy is suitable, the principles are often applied to the new data tuples [23]. the classifier-training algorithm uses these pre-classified examples to determine the parameters required for correct discrimination. the algorithm encodes these feats into a model/classifier [26][56][57][58]. a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 148 the decision tree is a tree-shaped structure representing selection sets. these decisions generate rules for the classification of a dataset. specific decision tree methods include classification and regression trees [59][60][61]. clustering is often said because of the identification of comparable classes of objects. using clustering techniques, we will further identify dense and sparse regions in object space and may discover overall distribution patterns and correlations among data attributes [62][63][64]. the classification approach can also be used to distinguish groups or classes of an object effectively but becomes costly, so clustering is often used as a preprocessing approach for attribute subset selection and classification [65][66][67]. association and correlation usually seek out frequent item set findings among large data sets. this finding helps businesses form certain decisions, like catalog design, cross-marketing, and customer shopping behavior analysis. association rule algorithms have to be ready to generate rules confidently values but one. also, the possible association rules for a given dataset are usually extensive, and many principles are usually of little (if any) value [54]. a neural network may be a set of connected input/output units, and every connection features a weight present with it. at its training phase, the network learns by adjusting weights to be ready to predict the suitable class labels of the input tuples. neural networks can derive meaning from complicated or imprecise data and may want to extract patterns and detect trends that are too complex to be noticed by humans or other computer techniques [25]. these are compatible with continuousvalued inputs and outputs. neural networks are best at identifying patterns or trends in data and are compatible with prediction or forecasting needs [68][69][70][71]. a technique that classifies each record during a dataset supported a mixture of the classes of the k record(s) most almost like it during a historical dataset (where k is bigger than or adequate to 1). sometimes called the k-nearest neighbor technique [72][73]. a. technical experimental framework today, a student’s academic performance is determined by the internal assessment (formative tests) and end-of-the-semester (summative tests) examinations. the teacher administers the internal assessment based on students’ performance in educational activities like class tests, seminars, assignments, general proficiency, attendance, and lab work. the scholar scores the end-of-semester examination within the semester examination. each student must get minimum marks to pass a semester in internal and end semester examination. the dataset used in this study was obtained from the federal college of education (technical) asaba, delta state on the sampling method of the post graduate diploma, of course, pde-technical education (post graduate diploma in education, technical education option) from session 2017 to 2020. initially, the dimension of the info is 50. during this step, data stored in several tables was joined during a single table after joining process errors were removed. in this step, only those fields were selected which were required for data processing. a couple of derived variables were selected. in contrast, some knowledge of the variables was extracted from the database. table 1 gives all the predictor and response variables derived from the database for reference. table 1. student related variables items description possible values cgp cumulative grade point {distinction >4.50, credit <4.50 & >3.50, merit <3.50 & >2.50, fail <2.50} tp teaching practice {“a” >70, “b” <70 & >60, “c” <60 & >50, “f” <50} ass assignment {yes, no} ga general aptitude {yes, no} att attendance {good, average, poor} ep education project {“a” >70, “b” <70 & >60, “c” <60 & >50, “f” <50} cgpa cummulatibe grade point average {distinction >4.50, credit <4.50 & >3.50, merit <3.50 & > 2.50, fail <2.50} 149 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 the domain values for a few of the variables were defined for this investigation as follows: • cgp – cumulative grade point obtained at the end of semester examinations. cgp is split into four classes: distinction >4.50, credit <4.50 & >3.50, merit <3.50 & >2.50, fail <2.50. • tp – teaching practice performance obtained in the final semester: teaching practice programs are organized to see the performance of scholars in teaching as a profession. teaching practice is evaluated into four classes: “a” >70, “b” <70 & >60, “c” <60 & >50, “f” <50. • ass – assignment performance. in each semester, two assignments are given to students by each teacher. assignment performance is split into two classes: yes – student-submitted assignment, no – student not submitted assignment. • ga general aptitude. like seminars, in each semester, general proficiency tests are organized. the general proficiency test is split into two classes: yes – student generally participated proficiency, no – student not generally participated proficiency. • att – attendance of student. a minimum of 70% attendance is compulsory to participate in the end semester examination. however, albeit in exceptional cases, low-attendance students also participate in the end semester examination for genuine reasons. attendance is split into three classes: poor <60%, average > 60%, and <80%, good >80%. • ep – education project. the education project is split into two classes: yes – student completed education project, no – student not completed education project. education project as a course is a credit load with a grading system of “a” >70, “b” <70 & >60, “c” <60 & >50, and “f” <50. • cgpa – cumulative grade point average obtained in pde session(s) has been declared the response variable. it is split into four class values: distinction >4.50, credit <4.50 & >3.50, merit < 3.50 & >2.50, and fail <2.50. b. the proposed id3 decision tree classifier a tree in which each branch node represents a choice between some alternatives and every leaf node represents a choice is referred to as a decision tree. decision trees are commonly used for gaining information for decision-making. the choice tree starts with a root node on which it's for users to require action. users split each node from this node recursively, consistent with the decision tree learning algorithm. the ultimate result is a decision tree, each branch representing a possible scenario of a decision and its outcome. the three widely used decision tree learning algorithms are id3, assistant, and c4.5. the id3 decision tree is a simple decision tree learning algorithm developed by quinlan in 1986. the essential idea of the id3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to check each attribute at every tree node. we introduce a metric information gain to pick the most useful attribute for classifying a given set. to find optimal thanks to classifying a learning set, we would like to try to attenuate the questions asked (i.e., minimizing the depth of the tree). thus, some function will be needed to measure which questions provide the foremost balance splitting. the knowledge gain metric is such a function. c. measuring impurity given a data table containing attributes and therefore the class of the attributes, we will measure the table's homogeneity (or heterogeneity) that supported the classes. we are saying a table is pure or homogenous if it contains only one class. if a knowledge table contains several classes, then we are saying that the table is impure or heterogeneous. there are several indices to live the degree of impurity quantitatively. the foremost well-known indices to live the degree of impurity are entropy, gini index, and classification error. the calculation of entropy as in (1). 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ∑ −𝑃𝑗𝑙𝑜𝑔2𝑃𝑖𝑗 1 𝑗 = −𝑃(0) ∗ 𝐿𝑜𝑔(𝑃(0)) + 𝑃(1) + log (𝑃(1)) (1) a pure table's entropy (consisting of one class) is zero because the probability is 1 and log (1) = 0. entropy reaches a maximum value when all classes within the table have equal probability. a gini index equation as in (2). 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ∑ 𝑃𝑖 2 𝑗 (2) a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 150 the gini index of a pure table containing one class is zero because the probability is 1, and 1-12 = 0. like entropy, the gini index reaches a maximum value when all classes within the table have equal probability. the formula of classification error can be seen as in (3). 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 = 1 − 𝑀𝑎𝑥(𝑃𝑖) (3) similar to the entropy and gini index, the classification error index of a pure table (consisting of one class) is zero because the probability is 1 and 1-max (1) = 0. the worth of the classification error index is usually between 0 and 1. the utmost (maximum) gini index for a given number of classes is usually adequate to the utmost of classification error index because, for a few classes n, we set probability as equal to p = 1/n, and therefore, the maximum gini index happens at 1-n(1/n2) = 1(1/n), while maximum classification error-index also happens at 1max{1/n}= 1-(1/n). d. splitting criteria we use the measure called information gain to determine the simplest attribute for a specific node within the tree. the knowledge gained expressed as gain (s, a) of an attribute a is relative to a set of examples s – is defined as in (4). 𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = ∑ (|𝑆𝑣|) |𝑆| 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴) (4) where value (a) is the set of all possible values for attribute a, and sv is the subset of s that attribute a has value v (i.e., 𝑆𝑣 = {s  s| a(s) = v}). the first term within the equation for gain is simply the entropy of the first collection s, and therefore, the second term is that the arithmetic mean of the entropy after s is portioned using attribute a. the expected entropy is the sum of the entropies of every subset, weighted by the fraction of examples (|𝑆𝑣|) |𝑆| that belong to gains (s, a) is therefore the expected reduction in entropy caused by knowing the worth of attribute a. equation (5) and (6) is use for split information and gain ratio. 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆, 𝐴) = ∑ |𝑆𝑖 | |𝑆| 𝑙𝑜𝑔2 |𝑆𝑖 | |𝑆| 𝑛 𝑖=1 (5) 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝑆, 𝐴) = 𝐺𝑎𝑖𝑛(𝑆,𝐴) 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆,𝐴) (6) choosing a replacement attribute and partitioning the training examples is now repeated for every non-terminal descendant node. attributes incorporated higher within the tree are excluded so that any given attribute can appear at the most once along any path through the tree. this process continues for every new leaf node until either of two conditions is met. first condition is every attribute has already been included along this path through the tree [44], or the training examples related to this leaf node all have an equivalent target attribute value (i.e., their entropy is zero) [51]. the listing of the id3 algorithm framework can be seen in pseudocode 1. pseudocode 1: listing of the id3 algorithm framework id3 (examples, target_attributes, attributes): create a root node if example is positive then return the single node tree root with label = + elseif example is negative then return the single-node tree root, with label = end if if a variety of predicting attributes is empty then return only node tree root with label = commonest value of the target attribute within the examples end if start: a = the attribute that best classifies example. decision tree attribute for root = a. for every possible value, vi, of a, add a replacement limb below root, like the test a =vi. let example (vi) be the subset of example that has the worth vi for a if example (vi) is empty; then below this new branch add a leaf node with label = commonest target value within the examples else below this new branch add the subtree id3 (example(vi), traget_attribute, attributes – {a}) end if: end 151 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 iii. result and discussion a dataset of 50 students was used in this study (table 2) as obtained/retrieved from the professional diploma in education, federal college of education (technical) asaba, pde-technical option from session 2017 to 2020 [74]. table 2. dataset from pde technical from 2016 2019 s.no cgp tp ass ga att ep cgpa 1 distinction distinction yes yes good yes distinction 2 distinction credit yes no good yes distinction 3 distinction credit no no average no distinction 4 credit distinction no no good yes distinction 5 credit credit no yes good yes distinction 6 merit credit no no average yes distinction 7 merit credit no no poor yes credit 8 credit merit yes yes average no distinction 9 merit merit no no poor no merit 10 credit credit yes yes good no distinction 11 distinction distinction yes yes good yes distinction 12 distinction credit yes yes good yes distinction 13 distinction credit yes no good no distinction 14 credit distinction yes yes good no distinction 15 distinction credit yes yes average yes distinction 16 distinction credit yes yes poor yes credit 17 credit credit yes yes good yes credit 18 credit credit yes yes poor yes credit 19 merit credit no yes good yes credit 20 credit merit yes no average yes credit 21 merit credit no yes poor no merit 22 merit merit yes yes average yes merit 23 merit merit no no average yes merit 24 merit merit yes yes good yes credit 25 merit merit yes yes poor yes merit 26 merit merit no no poor yes fail 27 distinction distinction yes yes good yes distinction 28 credit distinction yes yes good yes credit 29 distinction credit yes yes good yes credit 30 distinction distinction yes yes average yes credit 31 distinction distinction no no good yes credit 32 credit credit yes yes good yes credit 33 credit credit no yes average yes merit 34 credit distinction no no good yes merit 35 distinction credit no yes average yes merit 36 credit merit no no average yes merit 37 merit credit yes no average yes merit 38 merit credit no yes poor yes fail 39 credit credit no yes poor yes merit 40 merit merit no no good no merit 41 merit merit no yes poor yes fail 42 merit merit no no poor no fail 43 distinction distinction yes yes good yes credit 44 distinction distinction yes yes average yes credit 45 credit distinction yes yes average yes merit 46 merit merit yes yes average no fail 47 distinction merit no yes poor yes fail 48 merit merit no no poor yes fail 49 credit credit yes yes good yes credit 50 merit distinction no no poor no fail to compute the knowledge gain for a relative to s – we compute the entropy of s. but, s is the set of all of the 50-examples with value 12=“distinction”, 15=“credit”, 17=“merit”, and 6=“fail”. so, we have: 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −𝐷𝑖𝑛𝑠𝑡𝑖𝑛𝑐𝑡𝑖𝑜𝑛 𝐿𝑜𝑔(𝐷𝑖𝑛𝑠𝑡𝑖𝑛𝑐𝑡𝑖𝑜𝑛 ) − 𝐶𝑟𝑒𝑑𝑖𝑡 log (𝐶𝑟𝑒𝑑𝑖𝑡) − 𝑀𝑒𝑟𝑖𝑡 log (𝑀𝑒𝑟𝑖𝑡) − 𝐹𝑎𝑖𝑙 log (𝐹𝑎𝑖𝑙) = − 12 50 log [ 12 50 ] − 15 50 log [ 15 50 ] − 17 50 log [ 17 50 ] − 6 50 log [ 6 50 ] = 1.911 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 152 we use the measure called information gain to determine the simplest attribute for a specific node within the tree. the knowledge gain, gain (s, a) of an attribute a, relative to a set of examples s: 𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − [ |𝑆𝑓𝑖𝑟𝑠𝑡 | |𝑆| 𝐸𝑛𝑡𝑟𝑜𝑝𝑦|𝑆𝑓𝑖𝑟𝑠𝑡 |] − [ |𝑆𝑠𝑒𝑐𝑜𝑛𝑑 | |𝑆| 𝐸𝑛𝑡𝑟𝑜𝑝𝑦|𝑆𝑠𝑒𝑐𝑜𝑛𝑑 |] − [ |𝑆𝑡ℎ𝑖𝑟𝑑 | |𝑆| 𝐸𝑛𝑡𝑟𝑜𝑝𝑦|𝑆𝑡ℎ𝑖𝑟𝑑 |] − [ |𝑆𝑓𝑎𝑖𝑙| |𝑆| 𝐸𝑛𝑡𝑟𝑜𝑝𝑦|𝑆𝑓𝑎𝑖𝑙|] cgp has the highest gain. thus, it is used as the root node as shown in figure 2. table 3 represents the gain values. table 4 shows the split information values, while table 5 represents the gain ratio. fig. 2. the steps of extracting knowledge from data table 3. gain values split information value split (s, cgp) 1.448442 split (s, tp) 1.597734 split (s, ass) 1.744987 split (s, ga) 1.91968 split (s, att) 1.511673 split (s, ep) 1.510102 table 4. split information gain value gain(s, cgp) 1.690616 gain(s, tp) 1.602740 gain(s, ass) 0.995378 gain(s, ga) 0.924819 gain(s, att) 1.560956 gain(s, ep) 0.826746 table 5. gain ratio gain ratio value gain ratio (s, cgp) 0.355674 gain ratio (s, tp) 0.229 gain ratio (s, ass) 0.125289 gain ratio (s, ga) 0.022887 gain ratio (s, att) 0.298968 gain ratio (s, ep) 0.30032 this process continues until all data is classified ideally or out of attributes. the knowledge represented by the decision tree can be extracted and represented in the form of if-then rules as denoted in the pseudocode 2. one classification rule can be generated for each path from each terminal node to the root node. the pruning technique was executed by removing nodes with less than the desired number of objects. if-then rules may be easier to understand. 153 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 pseudocode 2: listing of if-then rules for the decision tree if cgp = “distinction” and att = “good” and tp = “a” or “credit” then ass = “yes” else-if if cgp = “distinction” and tp = “a” and att = “good” or “average‟ then ass = “yes” else-if if cgp = “credit” and att = “good” and ass = “yes” then cgpa = “distinction” else-if if cgp = “credit” and tp = “b” and ep = “a” then cgpa = “credit” else-if if if cgp = “merit” and tp = “a” or “b” and att = “good‟ or “average‟ then ass = “yes” else-if if cgp = “merit” and ass = “no” and att = “average” then cgpa = “merit” else-if if cgp = “fail” and tp = “f” and att = “poor” then cgpa = “fail” else-if iv. conclusion the classification task is used on the scholar database to predict the student's division supported by the previous database. as many approaches as can be used for data classification, the decision tree method is used for this study. also, data like the cumulative grade point, teaching practice marks, assignments, general aptitude, attendance, education project marks, and cumulative grade point average – were collected from the student’s previous database to predict the performance at the top of the semester. this study will help the scholars and teachers enhance the division of the scholar. this study also will work to spot students who need special attention to reduce the fail ratio and take appropriate action for subsequent semester's examinations. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. ifeka and a. akinbobola, “trend analysis of precipitation in some selected stations in anambra state,” atmos. clim. sci., vol. 05, no. 01, pp. 1–12, 2015. [2] m. i. akazue, r. e. yoro, b. o. malasowe, o. nwankwo, and a. a. ojugo, “improved services traceability and management of a food value chain using block-chain network : a case of nigeria,” indones. j. electr. eng. comput. sci., vol. 29, no. 3, pp. 1623–1633, 2023. http://journal2.um.ac.id/index.php/keds https://doi.org/10.4236/acs.2015.51001 https://doi.org/10.4236/acs.2015.51001 https://doi.org/10.11591/ijeecs.v29.i3.pp1623-1633 https://doi.org/10.11591/ijeecs.v29.i3.pp1623-1633 https://doi.org/10.11591/ijeecs.v29.i3.pp1623-1633 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 154 [3] a. a. ojugo, p. o. ejeh, c. c. odiakaose, a. o. eboka, and f. u. emordi, “improved distribution and food safety for beef processing and management using a blockchain-tracer support framework,” int. j. informatics commun. technol., vol. 12, no. 3, p. 205, dec. 2023. [4] r. e. yoro, f. o. aghware, b. o. malasowe, o. nwankwo, and a. a. ojugo, “assessing contributor features to phishing susceptibility amongst students of petroleum resources varsity in nigeria,” int. j. electr. comput. eng., vol. 13, no. 2, p. 1922, apr. 2023. [5] r. e. yoro, f. o. aghware, m. i. akazue, a. e. ibor, and a. a. ojugo, “evidence of personality traits on p hishing attack menace among selected university undergraduates in nigerian,” int. j. electr. comput. eng., vol. 13, no. 2, p. 1943, apr. 2023. [6] s. drummond, k. sudduth, a. joshi, s. birrell, and s. kitchen, “statistics and neural method for site specific yield prediction,” trans. asae, vol. 46, no. 1, pp. 23–32, 2003. [7] p. m. granitto, c. furlanello, f. biasioli, and f. gasperi, “recursive feature elimination with random forest for ptrms analysis of agroindustrial products,” chemom. intell. lab. syst., vol. 83, no. 2, pp. 83–90, sep. 2006. [8] a. a. ojugo and a. o. eboka, “memetic algorithm for short messaging service spam filter using text normalization and semantic approach,” int. j. informatics commun. technol., vol. 9, no. 1, p. 9, 2020. [9] q. li et al., “an enhanced grey wolf optimization based feature selection wrapped kernel extreme learning machine for medical diagnosis,” comput. math. methods med., vol. 2017, pp. 1–15, 2017. [10] j. w. hatfield, c. r. plott, and t. tanaka, “understanding price controls and nonprice competition with matching theory,” am. econ. rev., vol. 102, no. 3, pp. 371–375, may 2012. [11] a. a. ojugo and r. e. yoro, “migration pattern as threshold parameter in the propagation of the covid -19 epidemic using an actor-based model for si-social graph,” jinav j. inf. vis., vol. 2, no. 2, pp. 93–105, mar. 2021. [12] a. a. ojugo and o. nwankwo, “spectral-cluster solution for credit-card fraud detection using a genetic algorithm trained modular deep learning neural network,” jinav j. inf. vis., vol. 2, no. 1, pp. 15–24, jan. 2021. [13] m. i. akazue, a. a. ojugo, r. e. yoro, b. o. malasowe, and o. nwankwo, “empirical evidence of phishing menace among undergraduate smartphone users in selected universities in nigeria,” indones. j. electr. eng. comput. sci., vol. 28, no. 3, pp. 1756–1765, dec. 2022. [14] s. carbó, j. f. de guevara, d. humphrey, and j. maudos, “estimating the intensity of price and non-price competition in banking,” banks bank syst., vol. 4, no. 2, pp. 4–19, 2009. [15] l. a. belanche and f. f. gonzález, “review and evaluation of feature selection algorithms in synthetic problems,” inf. fusion, vol. 23, pp. 34–54, jan. 2011. [16] z. karimi, m. mansour riahi kashani, and a. harounabadi, “feature ranking in intrusion detection dataset using combination of filtering methods,” int. j. comput. appl., vol. 78, no. 4, pp. 21–27, sep. 2013. [17] a. karim, s. azam, b. shanmugam, k. kannoorpatti, and m. alazab, “a comprehensive survey for intelligent spam email detection,” ieee access, vol. 7, pp. 168261–168295, 2019. [18] n. tomar and a. k. manjhvar, “a survey on data mining optimization techniques,” ijste -international j. sci. technol. eng. |, vol. 2, no. 06, pp. 130–133, 2015. [19] a. goldstein, l. fink, a. meitin, s. bohadana, o. lutenberg, and g. ravid, “applying machine learning on sensor data for irrigation recommendations: revealing the agronomist’s tacit knowledge,” precis. agric., vol. 19, no. 3, pp. 421–444, jun. 2018. [20] a. a. ojugo and d. a. oyemade, “boyer moore string-match framework for a hybrid short message service spam filtering technique,” iaes int. j. artif. intell., vol. 10, no. 3, pp. 519–527, 2021. [21] u. usman, “effects of price & non-price competition of consumers effects of pricing and non-pricing competition on consumer submitted by : umair usman ghani submitted to : sir raja rub nawaz dated preston university karachi main campus,” pp. 1–16, 2014. [22] f. shirbani and h. soltanian zadeh, “fast sffs-based algorithm for feature selection in biomedical datasets,” amirkabir int. j. sci. res., vol. 45, no. 2, pp. 43–56, 2013. [23] a. a. ojugo, a. o. eboka, r. e. yoro, m. o. yerokun, and f. n. efozia, “hybrid model for early diabetes diagnosis,” math. comput. ind., vol. 50, no. 3–5, pp. 55–65, 2015. [24] g. b. dela cruz, b. d. gerardo, and b. t. tanguilig iii, “agricultural crops classification models based on pca ga implementation in data mining,” int. j. model. optim., vol. 4, no. 5, pp. 375–382, oct. 2014. [25] y. shiokawa, t. misawa, y. date, and j. kikuchi, “application of market basket analysis for the visualization of transaction data based on human lifestyle and spectroscopic measurements,” anal. chem., vol. 88, no. 5, pp. 2714– 2719, 2016. [26] a. patil and p. gupta, “a review on up-growth algorithm using association rule mining,” in 2017 international conference on computing methodologies and communication (iccmc), jul. 2017, pp. 96–99. [27] h. w. ahmad, s. zilles, h. j. hamilton, and r. dosselmann, “prediction of retail prices of products using local competitors,” int. j. bus. intell. data min., vol. 11, no. 1, pp. 19–30, 2016. [28] m. brindlmayer, r. khadduri, a. osborne, a. briansó, and e. cupito, “prioritizing learning during covid -19: the most effective ways to keep children learning during and post-pandemic,” glob. educ. evid. advis. panel, no. january, pp. 1–21, 2022. [29] v.-d. nguyen, d.-n. tran, h.-h. tran, t.-n. phan, t. danh, and h.-n. tran, “blended learning model-based local education for vietnamese primary school students,” rev. int. geogr. educ., vol. 11, no. 8, pp. 1684–1694, 2022. [30] d. nilam, w. sari, and m. mulu, “explorative study on the application of learning model in virtual classroom during covid-19 pandemic at the school of yogyakarta province,” proceeding int. webinar educ. 2020 umsurabaya, pp. 54– 64, 2020. [31] d. l. chen, s. ertac, t. evgeniou, x. miao, a. nadaf, and e. yilmaz, “grit and academic resilience during the covid-19 pandemic,” ssrn electron. j., 2022. [32] e. haipinge, n. kadhila, and l. m. josua, “using digital technology in transforming assessment in higher education institutions beyond covid-19,” creat. educ., vol. 13, no. 07, pp. 2157–2167, 2022. [33] h. patrinos, e. vegas, and r. carter-rau, “an analysis of covid-19 student learning loss,” educ. glob. pract. policy res. work. pap. 10033, vol. 10033, no. may, pp. 1–31, 2022. [34] f. agostinelli, m. doepke, g. sorrenti, and f. zilibotti, “when the great equalizer shuts down: schools, peers, and parents in pandemic times,” j. public econ., vol. 206, p. 104574, feb. 2022. [35] u. christian and m. author, “the influence of covid-19 on good governance and democratic behavior in nigeria,” int. j. arts soc. sci., vol. 5, no. july, pp. 50–57, 2022. https://doi.org/10.11591/ijict.v12i3.pp205-213 https://doi.org/10.11591/ijict.v12i3.pp205-213 https://doi.org/10.11591/ijict.v12i3.pp205-213 https://doi.org/10.11591/ijece.v13i2.pp1922-1931 https://doi.org/10.11591/ijece.v13i2.pp1922-1931 https://doi.org/10.11591/ijece.v13i2.pp1922-1931 https://doi.org/10.11591/ijece.v13i2.pp1943-1953 https://doi.org/10.11591/ijece.v13i2.pp1943-1953 https://doi.org/10.11591/ijece.v13i2.pp1943-1953 https://doi.org/10.13031/2013.12541 https://doi.org/10.13031/2013.12541 https://doi.org/10.1016/j.chemolab.2006.01.007 https://doi.org/10.1016/j.chemolab.2006.01.007 https://doi.org/10.11591/ijict.v9i1.pp9-18 https://doi.org/10.11591/ijict.v9i1.pp9-18 https://doi.org/10.1155/2017/9512741 https://doi.org/10.1155/2017/9512741 https://doi.org/10.1257/aer.102.3.371 https://doi.org/10.1257/aer.102.3.371 https://doi.org/10.35877/454ri.jinav379 https://doi.org/10.35877/454ri.jinav379 https://doi.org/10.35877/454ri.jinav379 https://doi.org/10.35877/454ri.jinav274 https://doi.org/10.35877/454ri.jinav274 https://doi.org/10.11591/ijeecs.v28.i3.pp1756-1765 https://doi.org/10.11591/ijeecs.v28.i3.pp1756-1765 https://doi.org/10.11591/ijeecs.v28.i3.pp1756-1765 https://books.google.com/books?hl=en&lr=&id=rzbj1gzrhl8c&oi=fnd&pg=pa5&dq=estimating+the+intensity+of+price+and+non-price+competition+in+banking&ots=gb8wbnm1pm&sig=jrlnv9bowum_jlqmxgd49awm6wk https://books.google.com/books?hl=en&lr=&id=rzbj1gzrhl8c&oi=fnd&pg=pa5&dq=estimating+the+intensity+of+price+and+non-price+competition+in+banking&ots=gb8wbnm1pm&sig=jrlnv9bowum_jlqmxgd49awm6wk https://arxiv.org/abs/1101.2320 https://arxiv.org/abs/1101.2320 https://doi.org/10.5120/13478-1164 https://doi.org/10.5120/13478-1164 https://doi.org/10.1109/access.2019.2954791 https://doi.org/10.1109/access.2019.2954791 https://www.academia.edu/download/40979482/ijstev2i6074_ok.pdf https://www.academia.edu/download/40979482/ijstev2i6074_ok.pdf https://doi.org/10.1007/s11119-017-9527-4 https://doi.org/10.1007/s11119-017-9527-4 https://doi.org/10.1007/s11119-017-9527-4 https://doi.org/10.11591/ijai.v10.i3.pp519-527 https://doi.org/10.11591/ijai.v10.i3.pp519-527 https://www.academia.edu/download/32922684/project.pdf https://www.academia.edu/download/32922684/project.pdf https://www.academia.edu/download/32922684/project.pdf https://eej.aut.ac.ir/article_434_6d0cf5e07bb414dc6eec6c82785c0149.pdf https://eej.aut.ac.ir/article_434_6d0cf5e07bb414dc6eec6c82785c0149.pdf https://doi.org/10.1109/mcsi.2015.35 https://doi.org/10.1109/mcsi.2015.35 https://doi.org/10.7763/ijmo.2014.v4.404 https://doi.org/10.7763/ijmo.2014.v4.404 https://doi.org/10.1021/acs.analchem.5b04182 https://doi.org/10.1021/acs.analchem.5b04182 https://doi.org/10.1021/acs.analchem.5b04182 https://doi.org/10.1109/iccmc.2017.8282605 https://doi.org/10.1109/iccmc.2017.8282605 https://doi.org/10.1504/ijbidm.2016.076418 https://doi.org/10.1504/ijbidm.2016.076418 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=prioritizing+learning+during+covid-19%3a+the+most+effective+ways+to+keep+children+learning+during+and+post-pandemic&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=prioritizing+learning+during+covid-19%3a+the+most+effective+ways+to+keep+children+learning+during+and+post-pandemic&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=prioritizing+learning+during+covid-19%3a+the+most+effective+ways+to+keep+children+learning+during+and+post-pandemic&btng= https://doi.org/10.48047/rigeo.11.08.145 https://doi.org/10.48047/rigeo.11.08.145 https://journal.um-surabaya.ac.id/index.php/pro/article/view/5951 https://journal.um-surabaya.ac.id/index.php/pro/article/view/5951 https://journal.um-surabaya.ac.id/index.php/pro/article/view/5951 https://doi.org/10.2139/ssrn.4001431 https://doi.org/10.2139/ssrn.4001431 https://doi.org/10.4236/ce.2022.137136 https://doi.org/10.4236/ce.2022.137136 https://doi.org/10.1596/1813-9450-10033 https://doi.org/10.1596/1813-9450-10033 https://doi.org/10.1016/j.jpubeco.2021.104574 https://doi.org/10.1016/j.jpubeco.2021.104574 https://www.ijassjournal.com/2022/v5i7/414665811.pdf https://www.ijassjournal.com/2022/v5i7/414665811.pdf 155 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 [36] i. m. ugochukwu-ibe and e. ibeke, “e-learning and covid-19 the nigerian experience: challenges of teaching technical courses in tertiary institutions,” ceur workshop proc., vol. 2872, no. may, pp. 46–51, 2021. [37] w. c. kolberg, “marketing mix theory: integrating price and non-price marketing strategies,” ssrn electron. j., no. 1993, pp. 1–35, 2011. [38] a. a. ojugo and o. nwankwo, “tree-classification algorithm to ease user detection of predatory hijacked journals: empirical analysis of journal metrics rankings,” int. j. eng. manuf., vol. 11, no. 4, pp. 1–9, aug. 2021. [39] a. e. ibor, e. b. edim, and a. a. ojugo, “secure health information system with blockchain technology,” j. niger. soc. phys. sci., vol. 5, no. 992, pp. 1–8, 2023. [40] f. o. aghware, r. e. yoro, p. o. ejeh, c. odiakaose, f. u. emordi, and a. a. ojugo, “sentiment analysis in detecting sophistication and degradation cues in malicious web contents,” kongzhi yu juece/control decis., vol. 38, no. 01, pp. 653–665, 2023. [41] k. vassil, m. solvak, p. vinkel, a. h. trechsel, and r. m. alvarez, “the diffusion of internet voting. usage patterns of internet voting in estonia between 2005 and 2015,” gov. inf. q., vol. 33, no. 3, pp. 453–459, jul. 2016. [42] w. pieters, “acceptance of voting technology: between confidence and trust,” in international conference on trust management, 2006, pp. 283–297. [43] s. okuyama, s. tsuruoka, h. kawanaka, and h. takase, “interactive learning support user interface for lecture scenes indexed with extracted keyword from blackboard,” aust. j. basic appl. sci., vol. 8, no. 4, pp. 319–324, 2014. [44] s. chouhan, d. singh, and a. singh, “an improved feature selection and classification using decision tree for crop datasets,” int. j. comput. appl., vol. 142, no. 13, pp. 5–8, may 2016. [45] j. obasi, nwele, n. amuche n, and u. elias a., “economics of optimizing value chain in agriculture sector of nigeria through mechanised crop processing and marketing,” asian j. basic sci. res., vol. 02, no. 01, pp. 80–92, 2020. [46] d. acemoglu, k. bimpikis, and a. ozdaglar, “price and capacity competition: extended abstract,” 44th annu. allert. conf. commun. control. comput. 2006, vol. 3, no. december, pp. 1307–1309, 2006. [47] e. oyebode, k. adekalu, and s. akinboro, “development of rainfall-runoff forecast model,” j. res. natl. dev., vol. 8, no. 2, pp. 56–66, 2011. [48] a. a. ojugo, c. o. obruche, and a. o. eboka, “quest for convergence solution using hybrid genetic algorithm trained neural network model for metamorphic malware detection,” arrus j. eng. technol., vol. 2, no. 1, pp. 12–23, nov. 2021. [49] a. a. ojugo, c. o. obruche, and a. o. eboka, “empirical evaluation for intelligent predictive models in prediction of potential cancer problematic cases in nigeria,” arrus j. math. appl. sci., vol. 1, no. 2, pp. 110–120, nov. 2021. [50] g. g. akin, a. f. aysan, g. i. kara, and l. yildiran, “the failure of price competition in the turkish credit card market,” emerg. mark. financ. trade, vol. 46, no. suppl. 1, pp. 23–35, 2010. [51] d. o. oyewola, e. g. dada, n. j. ngozi, a. u. terang, and s. a. akinwumi, “covid -19 risk factors, economic factors, and epidemiological factors nexus on economic impact: machine learning and structural equation modelling approaches,” j. niger. soc. phys. sci., vol. 3, no. 4, pp. 395–405, 2021. [52] j. h. jeong et al., “random forests for global and regional crop yield predictions,” plos one, vol. 11, no. 6, p. e0156571, jun. 2016. [53] x. e. pantazi, d. moshou, t. alexandridis, r. l. whetton, and a. m. mouazen, “wheat yield prediction using machine learning and advanced sensing techniques,” comput. electron. agric., vol. 121, pp. 57–65, feb. 2016. [54] a. a. ojugo and o. d. otakore, “intelligent cluster connectionist recommender system using implicit graph friendship algorithm for social networks,” iaes int. j. artif. intell., vol. 9, no. 3, p. 497~506, 2020. [55] t. avinadav, “the effect of decision rights allocation on a supply chain of perishable products under a revenue-sharing contract,” int. j. prod. econ., vol. 225, p. 107587, jul. 2020. [56] f. o. aghware, r. e. yoro, p. o. ejeh, c. c. odiakaose, f. u. emordi, and a. a. ojugo, “delcluste: protecting users from credit-card fraud transaction via the deep-learning cluster ensemble,” int. j. adv. comput. sci. appl., vol. 14, no. 6, pp. 94–100, 2023. [57] m. armstrong and j. vickers, “patterns of price competition and the structure of consumer choice,” mpra pap., vol. 1, no. 98346, pp. 1–40, 2020. [58] k. parsons, a. mccormac, m. pattinson, m. butavicius, and c. jerram, “the design of phishing studies: challenges for researchers,” comput. secur., vol. 52, pp. 194–206, jul. 2015. [59] s. girish patil, p. shahaji, n. nilesh, g. kishore, and r. gupta, traceability based value chain management in meat sector for achieving food safety and augmenting exports, 2022. [60] c. li, n. ding, h. dong, and y. zhai, “application of credit card fraud detection based on cs-svm,” int. j. mach. learn. comput., vol. 11, no. 1, pp. 34–39, 2021. [61] v. umarani, a. julian, and j. deepa, “sentiment analysis using various machine learning and deep learning techniques,” j. niger. soc. phys. sci., vol. 3, no. 4, pp. 385–394, 2021. [62] b. o. malasowe, m. i. akazue, e. a. okpako, f. o. aghware, a. a. ojugo, and d. v. ojie, “adaptive learner-cbt with secured fault-tolerant and resumption capability for nigerian universities,” int. j. adv. comput. sci. appl., vol. 14, no. 8, pp. 135–142, 2023. [63] s. khaki, l. wang, and s. v. archontoulis, “a cnn-rnn framework for crop yield prediction,” front. plant sci., vol. 10, no. january, pp. 1–14, 2020. [64] s. khaki and l. wang, “crop yield prediction using deep neural networks,” front. plant sci., vol. 10, may 2019. [65] a. d. bhavani and n. mangla, “a novel network intrusion detection system based on semi-supervised approach for iot,” int. j. adv. comput. sci. appl., vol. 14, no. 4, pp. 207–216, 2023. [66] m. sharma, “a survey of email spam filtering methods,” int. conf. “new trends stat. optim., vol. 7, no. 6, pp. 14– 21, 2018. [67] z. sun, s. sun, j. zhao, b. ai, and q. yang, “detection of massive oil spills in sun glint optical imagery through super-pixel segmentation,” j. mar. sci. eng., vol. 10, no. 11, p. 1630, 2022. [68] s. do, k. d. song, and j. w. chung, “basics of deep learning : a radiologist ’ s guide to understanding published radiology articles on deep learning,” korean j. radiol., vol. 21, no. 1, pp. 33–41, 2020. [69] a. s. pillai, “multi-label chest x-ray classification via deep learning,” j. intell. learn. syst. appl., vol. 14, pp. 43–56, 2022. [70] s. k. datta, m. a. shaikh, s. n. srihari, and m. gao, “soft-attention improves skin cancer classification performance,” may 2021. https://rgu-repository.worktribe.com/output/1317762 https://rgu-repository.worktribe.com/output/1317762 https://doi.org/10.2139/ssrn.986407 https://doi.org/10.2139/ssrn.986407 https://doi.org/10.5815/ijem.2021.04.01 https://doi.org/10.5815/ijem.2021.04.01 https://doi.org/10.46481/jnsps.2022.992 https://doi.org/10.46481/jnsps.2022.992 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=sentiment+analysis+in+detecting+sophistication+and+degradation+cues+in+malicious+web+contents&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=sentiment+analysis+in+detecting+sophistication+and+degradation+cues+in+malicious+web+contents&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=sentiment+analysis+in+detecting+sophistication+and+degradation+cues+in+malicious+web+contents&btng= https://doi.org/10.1016/j.giq.2016.06.007 https://doi.org/10.1016/j.giq.2016.06.007 https://doi.org/10.1007/11755593_21 https://doi.org/10.1007/11755593_21 https://shibaura.elsevierpure.com/en/publications/interactive-learning-support-user-interface-for-lecture-scenes-in https://shibaura.elsevierpure.com/en/publications/interactive-learning-support-user-interface-for-lecture-scenes-in https://doi.org/10.5120/ijca2016909966 https://doi.org/10.5120/ijca2016909966 https://doi.org/10.38177/ajbsr.2020.2109 https://doi.org/10.38177/ajbsr.2020.2109 https://doi.org/10.38177/ajbsr.2020.2109 https://doi.org/10.1016/j.geb.2008.06.004 https://doi.org/10.1016/j.geb.2008.06.004 https://doi.org/10.4314/jorind.v8i2.66854 https://doi.org/10.4314/jorind.v8i2.66854 https://doi.org/10.35877/jetech613 https://doi.org/10.35877/jetech613 https://doi.org/10.35877/jetech613 https://doi.org/10.35877/mathscience614 https://doi.org/10.35877/mathscience614 https://doi.org/10.2753/ree1540-496x4603s102 https://doi.org/10.2753/ree1540-496x4603s102 https://doi.org/10.46481/jnsps.2021.173 https://doi.org/10.46481/jnsps.2021.173 https://doi.org/10.46481/jnsps.2021.173 https://doi.org/10.1371/journal.pone.0156571 https://doi.org/10.1371/journal.pone.0156571 https://doi.org/10.1016/j.compag.2015.11.018 https://doi.org/10.1016/j.compag.2015.11.018 https://doi.org/10.11591/ijai.v9.i3.pp497-506 https://doi.org/10.11591/ijai.v9.i3.pp497-506 https://doi.org/10.1016/j.ijpe.2019.107587 https://doi.org/10.1016/j.ijpe.2019.107587 https://doi.org/10.14569/ijacsa.2023.0140610 https://doi.org/10.14569/ijacsa.2023.0140610 https://doi.org/10.14569/ijacsa.2023.0140610 https://mpra.ub.uni-muenchen.de/id/eprint/98346 https://mpra.ub.uni-muenchen.de/id/eprint/98346 https://doi.org/10.1016/j.cose.2015.02.008 https://doi.org/10.1016/j.cose.2015.02.008 https://www.pashudhanpraharee.com/wp-content/uploads/2023/05/traceability-based-value-chain-management-in-meat-sector.pdf https://www.pashudhanpraharee.com/wp-content/uploads/2023/05/traceability-based-value-chain-management-in-meat-sector.pdf https://doi.org/10.18178/ijmlc.2021.11.1.1011 https://doi.org/10.18178/ijmlc.2021.11.1.1011 https://doi.org/10.46481/jnsps.2021.308 https://doi.org/10.46481/jnsps.2021.308 https://doi.org/10.14569/ijacsa.2023.0140816 https://doi.org/10.14569/ijacsa.2023.0140816 https://doi.org/10.14569/ijacsa.2023.0140816 https://doi.org/10.3389/fpls.2019.01750 https://doi.org/10.3389/fpls.2019.01750 https://doi.org/10.3389/fpls.2019.00621 https://doi.org/10.14569/ijacsa.2023.0140424 https://doi.org/10.14569/ijacsa.2023.0140424 https://core.ac.uk/download/pdf/234676898.pdf https://core.ac.uk/download/pdf/234676898.pdf https://doi.org/10.3390/jmse10111630 https://doi.org/10.3390/jmse10111630 https://doi.org/10.3348/kjr.2019.0312 https://doi.org/10.3348/kjr.2019.0312 https://doi.org/10.4236/jilsa.2022.144004 https://doi.org/10.4236/jilsa.2022.144004 https://link.springer.com/chapter/10.1007/978-3-030-87444-5_2 https://link.springer.com/chapter/10.1007/978-3-030-87444-5_2 a. a. ojugo et al. / knowledge engineering and data science 2023, 6 (2): 145–156 156 [71] y. kang, m. ozdogan, x. zhu, z. ye, c. hain, and m. anderson, “comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the us midwest,” environ. res. lett., vol. 15, no. 6, p. 064005, jun. 2020. [72] a. a. ojugo and r. e. yoro, “extending the three-tier constructivist learning model for alternative delivery: ahead the covid-19 pandemic in nigeria,” indones. j. electr. eng. comput. sci., vol. 21, no. 3, p. 1673, mar. 2021. [73] a. a. ojugo and r. e. yoro, “forging a deep learning neural network intrusion detection framework to curb the distributed denial of service attack,” int. j. electr. comput. eng., vol. 11, no. 2, pp. 1498–1509, 2021. [74] a. a. ojugo, m. i. akazue, p. o. ejeh, c. odiakaose, and f. u. emordi, “degatramonn : deep learning memetic ensemble to detect spam threats via a content-based processing,” kongzhi yu juece/control decis., vol. 38, no. 01, pp. 667–678, 2023. https://doi.org/10.1088/1748-9326/ab7df9 https://doi.org/10.1088/1748-9326/ab7df9 https://doi.org/10.1088/1748-9326/ab7df9 https://doi.org/10.11591/ijeecs.v21.i3.pp1673-1682 https://doi.org/10.11591/ijeecs.v21.i3.pp1673-1682 https://doi.org/10.11591/ijece.v11i2.pp1498-1509 https://doi.org/10.11591/ijece.v11i2.pp1498-1509 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=degatramonn%e2%80%af%3a+deep+learning+memetic+ensemble+to+detect+spam+threats+via+a+content-based+processing&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=degatramonn%e2%80%af%3a+deep+learning+memetic+ensemble+to+detect+spam+threats+via+a+content-based+processing&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=degatramonn%e2%80%af%3a+deep+learning+memetic+ensemble+to+detect+spam+threats+via+a+content-based+processing&btng= knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 170–187 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p170-187 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) deep learning approaches with optimum alpha for energy usage forecasting aji prasetya wibawa a,1,*, agung bella putra utama a,2, ade kurnia ganesh akbari a,3, akhmad fanny fadhilla a,4, alfiansyah putra pertama triono a,5, andien khansa’a iffat paramarta a,6, faradini usha setyaputri a,7, leonel hernandez b,8 a department of electrical engineering and informatics, faculty of engineering, universitas negeri malang jl. semarang no. 5, malang 65145, indonesia b institución universitaria de barranquilla iub cra. 45 #48-31, nte. centro historico, barranquilla 080020, colombia 1aji.prasetya.ft@um.ac.id*; 2agungbpu02@gmail.com; 3ade.kurniaganesh.1905356@students.um.ac.id; 4akhmadfadhil512@gmail.com; 5alfiansyah.putrapt.1905356@student.um.ac.id; 6khansaandien@gmail.com; 7faradini.usha@gmail.com; 8lhernandezc@unibarranquilla.edu.co * corresponding author i. introduction energy usage is a critical factor in various human activities, ranging from individual to industrial scales. it plays a vital role in supporting economic growth, social welfare, and technological development [1]. however, with the increasing global demand for energy and the challenges posed by environmental changes, understanding energy usage patterns has become increasingly important. accurate predictions about future energy use can provide significant benefits in decision-making [2], demand and supply stability [3], and energy efficiency [4]. energy usage data often exhibits a time series nature, where information is recorded over a specific time span [5]. for example, hourly energy consumption data may be challenging to interpret directly due to its temporal nature [6]. additionally, energy usage data can involve various attributes that contribute to the patterns and fluctuations of energy usage. therefore, accurately forecasting future energy use poses a complex task. to overcome the complexity of analyzing energy usage data, deep learning (dl) has emerged as a practical approach [7]. dl is a branch of machine learning that utilizes neural networks with multiple layers and parameters to learn complex data representations [8]. various dl models have been developed for time series analysis, including convolutional neural networks (cnn) [9], recurrent neural networks (rnn) [10], long short-term memory (lstm) [11], bidirectional lstm (bilstm) [12], and gated recurrent unit (gru) [13]. cnns have been widely used in image article info a b s t r a c t article history: received 17 october 2023 revised 17 october 2023 accepted 17 october 2023 published online 20 october 2023 energy use is an essential aspect of many human activities, from individual to industrial scale. however, increasing global energy demand and the challenges posed by environmental change make understanding energy use patterns crucial. accurate predictions of future energy consumption can greatly influence decision-making, supply-demand stability and energy efficiency. energy use data often exhibits timeseries patterns, which creates complexity in forecasting. to address this complexity, this research utilizes deep learning (dl), convolutional neural networks (cnn), recurrent neural networks (rnn), long short-term memory (lstm), bidirectional lstm (bi-lstm), and gated recurrent unit (gru) models. the main objective is to improve the accuracy of energy usage forecasting by optimizing the alpha value in exponential smoothing, thereby improving forecasting accuracy. the results showed that all dl methods experienced improved accuracy when using optimum alpha. lstm has the most optimal mape, rmse, and r2 values compared to other methods. this research promotes energy management, decision-making, and efficiency by providing an innovative framework for accurate forecasting of energy use, thus contributing to a sustainable and efficient energy system. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: energy efficiency forecasting deep learning exponential smoothing optimum alpha http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ 171 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 recognition tasks, but they can also be applied to time series data analysis. they can automatically extract essential features from time series data, such as seasonal patterns, trends, cycles, and irregularities. unlike 2d-cnns, which require converting time series data into image format, 1dcnns [14] can directly process time series data without the need for image conversion. rnns, particularly lstm, are well-suited for modeling temporal dependencies in time series data [15]. rnns maintain a hidden state that captures information about previous time steps, allowing them to capture long-term dependencies. lstm, in particular, addresses the vanishing gradient problem commonly encountered in traditional rnns. the vanishing gradient problem occurs when the gradient approaches zero, preventing updates to the network weights and causing the loss of time series data characteristics. lstm overcomes this issue by using memory cells and gates to store and control the temporary state of the network [16]. bi-lstm is an extension of the lstm model that incorporates information from both past and future time steps. it consists of two lstm layers, one processing the input sequence in the forward direction and the other in the backward direction. by considering information from both directions, bi-lstm can capture more comprehensive temporal dependencies in the data [17]. this bidirectional nature makes bi-lstm particularly effective in tasks where future information is crucial for accurate predictions, such as energy usage forecasting. gru, on the other hand, is a simplified version of lstm that aims to address the computational. in this study, we aim to explore the application of dl models with optimum alpha for energy usage forecasting. we will compare the performance of different dl models and evaluate their effectiveness in capturing the complex patterns and fluctuations in energy usage data. additionally, we will investigate the impact of data normalization techniques on the performance of dl models. the findings of this research will contribute to the development of accurate and efficient energy usage forecasting models, which can aid in decision-making and promote energy efficiency in various sectors. overall, this study aims to address the challenges in analyzing energy usage data by leveraging the power of dl models. by utilizing dl models, we can extract meaningful features and capture temporal dependencies in the data, leading to improved energy usage forecasting. the results of this research will provide valuable insights for energy management and planning, contributing to a more sustainable and efficient energy future. ii. methods to facilitate a more systematic research approach, experiments were devised, as illustrated in figure 1. in essence, a comparison was made between the smoothed deep learning (s-dl) method using optimum alpha and the primary dl method. various evaluation metrics were also employed to assess the performance of the optimum alpha-enhanced results. further details regarding figure 1 will be expounded upon in the following subsections. a. dataset the dataset used in this study uses the hourly energy demand time series forecast dataset from kaggle [18]. this dataset encompasses a span of four years (january 2015 to december 2018) and encompasses information regarding electricity usage, production, pricing, and meteorological conditions in spain. specifically, data on electricity consumption and generation was sourced from entsoe, a publicly accessible platform for transmission service operator (tso) data. settlement prices, on the other hand, were acquired from the spanish tso, red electric españa. additionally, weather data for the five largest cities in spain was procured as part of a personal project, and it was subsequently made available to the public through the open weather api. what sets this dataset apart is its inclusion of detailed hourly records for electricity consumption, alongside forecasts provided by the tso for both consumption and pricing. this dataset consists of 29 attributes that have 35064 instances with float data type. the target attribute used in this study is the actual total load attribute whose data visualization can be seen in figure 2. as for the total load forecast attribute, it is not used in the research because the presence of this attribute in the data serves as a benchmark or comparison attribute with the target attribute. in addition, there are 2 attributes that are deleted because they have nan values. the total load forecast attribute was also not used in the study because the presence of this attribute in the data serves as a a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 172 benchmark or comparison attribute with the target attribute. therefore, the total attributes to be used are 26. fig. 1. experimental schema fig. 2. total load actual b. exponential smoothing with optimum α exponential smoothing is a widely used technique in time series forecasting that aims to eliminate noise and capture underlying patterns in data [19]. it achieves this by assigning weights to previous observations, with higher weights given to more recent data points. the smoothing factor, denoted as α (alpha), determines the weight assigned to the most recent observation [20]. the concept of optimum α arises from the need to find the best value for the smoothing factor that maximizes the accuracy of the forecasting model [21]. the choice of α depends on the specific characteristics of the time series data and the desired forecasting task. the goal is to select the value of α that minimizes the forecasting error or maximizes the accuracy of the predictions. to determine the optimum α, various approaches can be employed. one standard method is to perform a grid search or optimization algorithm to evaluate different values of α and select the one that yields the lowest forecasting error. the process of finding the optimum α involves balancing the 173 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 trade-off between responsiveness to recent changes in the data and the level of smoothing applied. a higher α value gives more weight to recent observations, making the model more responsive to shortterm fluctuations but potentially less stable. conversely, a lower α value places more emphasis on historical data, resulting in a smoother forecast but potentially slower to adapt to changes. this process considers the characteristics of the time series data and the specific forecasting objectives, striking a balance between responsiveness and stability in the model's predictions. equation (1) and (2) offer the single exponential smoothing [22] when 𝑡 = 0. the smoothed data 𝑆𝑡 is the result of smoothing the raw data {𝑋𝑡 }. the smoothing factor, 𝛼 is a value that determines the level of smoothing. the range of 𝛼 is between 0 and 1 (0 ≤ 𝛼 ≤1). when 𝛼 close to 1, the learning process is fast because it has a less smoothing effect. in contrast, values of 𝛼 closer to 0 have a more significant smoothing effect and are less responsive to recent changes (slow learning). 𝑆𝑡 = 𝛼𝑋𝑡 + (1 − 𝛼) 𝑆𝑡−1 , 𝑡 > 0 (1) 𝑆𝑡 = 𝑆𝑡−1 + 𝛼( 𝑋𝑡 − 𝑆𝑡−1) (2) 𝑂𝑝𝑡𝑖𝑚𝑢𝑚 𝛼 = ( 𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛) − 1 𝑛 ∑ 𝑋𝑡 𝑛 𝑖=1 𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 (3) the substitution of equation (3) to (2) results in the following equation (4). we use the optimum smoothed result (𝑆𝑡 ) to improve the dl methodperformance [21]. pseudocode 1 show how to find the optimum alpha for exponential smoothing 𝑆𝑡 = 𝑆𝑡−1 + ( 𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛) − 1 𝑛 ∑ 𝑋𝑡 𝑛 𝑖=1 𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 ( 𝑋𝑡 − 𝑆𝑡−1) (4) pseudocode 1. find the optimum alpha for exponential smoothing input: data time series output: optimum value of alpha procedure findoptimumalpha(data): set alpha_min = 0.1 // minimum value of alpha set alpha_max = 0.9 // maximum value of alpha set alpha_step = 0.1 // increment step for alpha set error_min = infinity // minimum error value set alpha_optimum = 0 // optimum value of alpha for alpha = alpha_min to alpha_max step alpha_step: apply exponential smoothing with alpha to the data calculate the error by comparing the predicted values with the actual data if error < error_min: set error_min = error set alpha_optimum = alpha return alpha_optimum as the optimum value of alpha end procedure a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 174 c. data normalization in this research, preprocessing is done by changing the original data so that it can be processed for further testing [23]. the inherent characteristics of the majority of time-series data exhibit dynamic and non-linear behavior [24]. the preprocessing carried out in this study is by normalize the data. data normalization is an essential preprocessing step in energy usage forecasting to ensure that the input data is standardized and comparable across different scales. normalization techniques transform the data into a standard range, typically between 0 and 1, without distorting the original distribution. this process helps to eliminate the influence of outliers and extreme values, making the data more suitable for training dl models. the choice of normalization technique depends on the characteristics of the energy usage data and the specific requirements of the forecasting task. it is essential to experiment with different normalization techniques and evaluate their impact on the performance of dl models. proper data normalization can improve the convergence speed of the models, prevent numerical instability, and enhance the overall accuracy of energy usage forecasting. one commonly used normalization technique in this research is using min-max scaling, also known as feature scaling. this method rescales the data by subtracting the minimum value and dividing by the range (maximum value minus minimum value). the resulting values are then within the range of 0 to 1 [25]. min-max scaling preserves the relative relationships between the data points and is particularly useful when the distribution of the data is known to be bounded as in (5). pseudocode 2 present the process for the normalization. 𝑋𝑡(𝑛𝑜𝑟𝑚) = 𝑋𝑡 − 𝑋 𝑚𝑖𝑛 𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 (5) 𝑋𝑡(𝑛𝑜𝑟𝑚) is the result of normalization, 𝑋𝑡 is the data to be normalized, while 𝑋 𝑚𝑖𝑛 and 𝑋 𝑚𝑎𝑥 stand for the minimum and maximum value of the entire data. pseudocode 2. normalization using min-max input: -data to be normalized (x), minimum value of the data (x_min), maximum value of the data (x_max) output: -normalized data (x_norm) procedure min-max normalization calculate the range of the data: a. set x_range = x_max x_min normalize the data: a. for each data point x_t in x: i. calculate the normalized value x_norm_t using the formula: x_norm_t = (x_t x_min) / x_range ii. append x_norm_t to the normalized data x_norm return the normalized data x_norm end procedure d. pso hyperparameter tuning particle swarm optimization (pso) is a metaheuristic optimization algorithm inspired by the social behavior of bird flocking or fish schooling [26]. it is commonly used to tune the hyperparameters of machine learning models, including deep learning (dl) models [27]. in this section, we will discuss the application of pso for hyperparameter tuning in dl models for energy usage forecasting. hyperparameters are parameters that are not learned directly from the data but are set by the user before training the model. they control the behavior and performance of the dl model, such as the learning rate, number of hidden layers, and number of neurons in each layer. finding the optimal values for these hyperparameters is crucial for achieving the best performance of the dl model. 175 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 pso works by simulating the movement of particles in a multidimensional search space. each particle represents a potential solution, and its position in the search space corresponds to a set of hyperparameters. the particles move towards the best solution found so far, called the global best, and are influenced by their own best solution, called the personal best. through iterations, the particles explore the search space and converge towards the optimal solution. a general outline of the pso hyperparameter tuning process can be seen in the pseudocode 3. in the context of dl models for energy usage forecasting, pso can be used to tune hyperparameters such as the number of dl layers, the number of neurons in each layer, the batch size, and the dropout rate like in table 1. by searching the hyperparameter space using pso, we can find the combination of hyperparameters that leads to the best performance of the dl model in terms of accuracy and prediction error. pseudocode 3. pso hyperparameter tuning input: data for training and validation hyperparameter search space output: best hyperparameter settings procedure pso_hyperparameter_tuning(data): set population_size = 50 // number of particles in the swarm set max_iterations = 100 // maximum number of iterations set c1 = 2.0 // cognitive parameter set c2 = 2.0 // social parameter set w = 0.7 // inertia weight // initialize the swarm initialize_swarm(population_size) // evaluate initial particle positions evaluate_particles(data) // set the global best position and fitness set_global_best() // main pso loop for iteration = 1 to max_iterations do: for each particle in the swarm do: // update particle velocity update_velocity(particle, global_best) // update particle position update_position(particle) // evaluate new particle position evaluate_particle(data, particle) // update personal best position and fitness update_personal_best(particle) // update global best position and fitness update_global_best(particle) // return the best hyperparameter settings return global_best_position end procedure to apply pso for hyperparameter tuning, we need to define the fitness function that evaluates the performance of the dl model with a specific set of hyperparameters. the pso algorithm then iteratively updates the positions of the particles based on their personal best, global best, and the inertia weight, which controls the balance between exploration and exploitation. by searching the hyperparameter space using pso, we can find the optimal combination of hyperparameters that leads to improved performance and accurate predictions. this approach can enhance the effectiveness of a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 176 dl models in energy usage forecasting and contribute to better decision-making and energy management. table 1. pso hyperparameter tuning search space parameter search space batch size ‘100’, ‘1000’ epoch ’50’, ‘100’ hidden layer ‘2’, ‘5’, ‘10’ loss function ‘mse’, ‘mae’, ’huberloss’ neuron ‘32’, ‘64’ optimizer ‘adam’, ‘rmsprop’ e. performance analysis to measure the performance analysis in the study, we used methods in dl. dl is a subset of machine learning algorithms. dl itself is often called a deep neural network [28]. neural networks are computational models that work by mimicking the behavior of the human brain [29]. basically, dl is a neural network that has many layers and parameters [30]. the number of layers in dl allows the model to be able to analyze large amounts of data and have complex relationships. early layers are used to learn simple features, while deeper layers learn more complex features [31]. • convolutional neural network (cnn) cnn, especially 2d-cnn, have revolutionized picture classification. however, one dimension (1d-cnn), excel at time-series data classification [14]. 1d-cnn can automatically learn the internal representation of time-series data and detect essential characteristics without operator intervention [21]. the internal representation of time-series data includes seasonality, trends, cycles, and abnormalities. these properties are essential for time-series data analysis and prediction. these internal representations can be recorded and used for classification by 1d-cnn. 1d-cnn operate directly on time-series data, unlike 2d-cnn, which convert input data into numbers. this simplifies workflow by eliminating preprocessing processes. 1d-cnns may capture temporal connections and identify significant patterns by directly examining sequential data [27] overall, 1d-cnn for timeseries data categorization have many benefits. this allows automatic feature extraction for more efficient and accurate time-series data analysis. direct processing of time-series data eliminates complex data transformations, simplifying modeling. thus, 1d-cnn are useful for time-series data analysis and classification. the 1d-cnn architecture present in figure 3 and for the pseudocde of cnn forecasting process can be seen in pseudocode 4. fig. 3. 1d-cnn architecture 177 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 pseudocode 4. cnn forecasting process input: energy dataset setting parameters according to the results of pso hyperparameter tuning output: trained cnn model procedure train_cnn(training_data, validation_data, num_conv_layers, num_filters, filter_size, num_fc_layers, num_neurons, learning_rate, num_epochs): initialize cnn model // add convolutional layers for i = 1 to num_conv_layers do: add convolutional layer with num_filters[i] filters and filter_size[i] filter size // flatten the output from convolutional layers flatten() // add fully connected layers for i = 1 to num_fc_layers do: add fully connected layer with num_neurons[i] neurons // compile the model compile model with appropriate loss function and optimizer // train the model train model on training_data with validation_data, using learning_rate and num_epochs // return the trained model return trained cnn model end procedure • recurrent neural network (rnn) the rnn developed by paul werbos and ronald j. williams in the 1980s and 1990s is the most commonly used model in deep learning [32]. rnns are a class of deep learning models designed to process sequential data. the main characteristic of rnns is the presence of recurrent connections in the network, which allows them to maintain a hidden state that captures information about previous time steps [33]. this hidden state makes rnns particularly suitable for modeling temporal dependencies in time series data. the architecture includes a series of recurrent cells, each processing input data and updating the hidden state using recurrent connections. this recurrent structure allows the rnn to cope with sequences of varying length. the rnn architecture can be seen in figure 4. pseudocde 5 present the rnn forecasting process. fig. 4. rnn architecture a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 178 pseudocode 5. rnn forecasting process input: energy dataset setting parameters according to the results of pso hyperparameter tuning output: trained rnn model procedure rnn_training(training_data, num_hidden_units, learning_rate, num_epochs): initialize weights and biases for input-to-hidden and hidden-to-hidden connections initialize the hidden state for epoch = 1 to num_epochs do: for each training example in training_data do: // forward pass for t = 1 to sequence_length do: update hidden state using the current input and previous hidden state // backward pass for t = sequence_length to 1 do: calculate the gradient of the loss with respect to the output update the weights and biases of the hidden-to-hidden connections calculate the gradient of the loss with respect to the hidden state update the weights and biases of the input-to-hidden connections // return the trained rnn model return trained_model end procedure • long short-term memory (lstm) vanishing gradient found in rnn is a condition when the gradient approaches 0 so that the gradient cannot provide updates to the weights in the network and make the time series data lose its characteristics [34]. vanishing gradient is caused by using the same weight at each time-step. lstm can overcome the vanishing gradient problem in rnn. the concept of long short-term memory (lstm) was first published in 1997 by hochreiter and schmidhuber [35]. lstm analyzes time series data for the long term by applying a collection of short-term memories. this model develops the information storage capacity of rnns by using "memory cells" [36]. memory cells have connections that store the temporary state of the network and are controlled through 3 "gates", namely forget gate, input gate, and output gate [37]. figure 5 represents the memory cell of lstm and for the pseudocde of lstm forecasting process can be seen in pseudocode 6. fig. 5. memory cell lstm 179 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 pseudocode 6. lstm forecasting process input: energy dataset setting parameters according to the results of pso hyperparameter tuning output: trained lstm model procedure lstm_model(training_data, testing_data, num_layers, num_hidden_units, num_output_units, learning_rate, num_epochs): // initialize lstm model initialize_lstm(num_layers, num_hidden_units, num_output_units) set_num_epochs(num_epochs) // train lstm model for epoch = 1 to num_epochs do: // forward pass for each training example in training_data do: // reset lstm hidden state reset_hidden_state() // iterate through each time step for t = 1 to length(training_example) do: // perform lstm forward pass lstm_forward_pass(training_example[t]) // backward pass for each training example in training_data do: // reset lstm gradients reset_gradients() // iterate through each time step in reverse order for t = length(training_example) to 1 do: // perform lstm backward pass lstm_backward_pass(training_example[t]) // update lstm weights update_weights() // test lstm model for each testing example in testing_data do: // reset lstm hidden state reset_hidden_state() // iterate through each time step for t = 1 to length(testing_example) do: // perform lstm forward pass lstm_forward_pass(testing_example[t]) // return trained lstm model return lstm_model end procedure • bidirectional lstm (bi-lstm) the bi-lstm model is a variant of the lstm model that incorporates bidirectional processing. it consists of two lstm layers, one processing the input sequence in the forward direction and the other processing it in the backward direction. this bidirectional processing allows the model to capture both past and future dependencies in the data, making it particularly effective for time series analysis [38]. in the forward lstm layer, the input sequence is processed from the beginning to the end, capturing the temporal dependencies and patterns in the data. this layer maintains a hidden state that stores information about the past time steps. the backward lstm layer, on the other hand, processes the input sequence in reverse order, capturing the dependencies and patterns in the opposite direction. this layer maintains a separate hidden state that stores information about the future time a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 180 steps. by combining the outputs of both lstm layers, the bi-lstm model can effectively capture the dependencies in both directions [39]. this allows the model to have a more comprehensive understanding of the temporal dynamics in the data. the outputs of the bi-lstm layers are then fed into a fully connected layer, which performs a non-linear transformation on the data and produces the final forecasted values. the bi-lstm architecture can be seen in figure 6. pseudocde 7 show the bi-lstm forecasting process fig. 6. bi-lstm architecture pseudocode 7. bi-lstm forecasting process input: energy dataset setting parameters according to the results of pso hyperparameter tuning output: trained bi-lstm model procedure train_bilstm(training_data, validation_data, num_lstm_layers, num_lstm_units, learning_rate, num_epochs): initialize bi-lstm model // add lstm layers for i = 1 to num_lstm_layers do: add forward lstm layer with num_lstm_units[i] units add backward lstm layer with num_lstm_units[i] units // compile the model compile model with appropriate loss function and optimizer // train the model train model on training_data with validation_data, using learning_rate and num_epochs // return the trained model return trained bi-lstm model end procedure • gated recurrent units (gru) the gru model is a sophisticated rnn variation used for sequential data processing and forecasting. it captures long-term time series dependencies well. training the gru model with the training set helps it understand data patterns and relationships. the gru model uses gating techniques to preserve or discard past time step information, unlike rnns [40] the reset and update gates control network information flow. the reset gate specifies which bits of the prior concealed state to forget, while the update gate determines how much new information to add. gru models data temporal dependencies by selectively updating and forgetting information. this adaptive retention or discard capacity allows the model to capture short-term and long-term patterns, making it suitable for time series forecasting, speech recognition, and natural language processing. its gating 181 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 features and capacity to identify long-term dependencies make the gru model a formidable sequential data analysis and forecasting tool. it is used in many fields where understanding and predicting time series data patterns is crucial because to its flexibility and efficacy [41]. figure 7 represents the structure of gru cell and for the pseudocde of gru forecasting process can be seen in pseudocode 8. fig. 7. gru cell structure pseudocode 8. gru forecasting process input: energy dataset setting parameters according to the results of pso hyperparameter tuning output: trained gru model procedure train_gru(training_data, validation_data, num_gru_layers, num_hidden_units, learning_rate, num_epochs): initialize gru model // add gru layers for i = 1 to num_gru_layers do: add gru layer with num_hidden_units[i] hidden units // compile the model compile model with appropriate loss function and optimizer // train the model train model on training_data with validation_data, using learning_rate and num_epochs // return the trained model return trained gru model end procedure f. data analysis performance testing is an essential step in evaluating the effectiveness and efficiency of energy usage forecasting models [42]. it involves assessing the model's ability to accurately predict future energy usage based on historical data. in this section, we will discuss the performance testing process and metrics used to evaluate the dl models' forecasting performance. this research uses mean absolute percentage error (mape), root mean square error (rmse), and the coefficient of determination (r2) as calculations. mape measures the extent to which forecasting or prediction distinguishes between predicted and actual energy values in percentage terms as in (6). a lower mape indicates a more accurate model a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 182 [43]. rmse calculates the square root of the average squared difference between the predicted and actual energy usage values [44]. rmse is used to determine how sensitive the existing dl model can detect outliers in the energy forecasting value compared to the original value, as in (7). additionally, the r2 is often used to assess the goodness of fit of the model as in (8). r2 represents the proportion of the variance in the energy usage data that is predictable by the model. a higher r2 value indicates a better fit of the model to the data [45]. 𝑀𝐴𝑃𝐸 = 1 𝑛 ∑ |𝐴𝑖−𝐹𝑖| 𝐴𝑖 𝑛 𝑖=1 (6) 𝑀𝑆𝐸 = √ 1 𝑛 ∑ (𝐹𝑖 − 𝐴𝑖) 2𝑛 𝑖=1 (7) 𝑅2 = 1 − 𝑆𝑆𝑟𝑒𝑠 𝑆𝑆𝑡𝑜𝑡 (8) 𝐴𝑖 is the actual value, 𝐹𝑖 is the predicted value, 𝑛 is the number of predictions, 𝑆𝑆𝑟𝑒𝑠 is the residual sum of squares, and ss𝑡𝑜𝑡 is the total sum of squares. we also logged the computational time for each method, which serves as an additional performance metric. we designate the best method with the shortest computational time expenditure. by conducting performance testing and evaluating the accuracy and efficiency of dl models, we can gain insights into their effectiveness in energy usage forecasting. this information can guide decision-making processes, improve energy management strategies, and contribute to the development of more sustainable and efficient energy systems. iii. result and discussion figure 8 through figure 11 illustrate the comparison between dl and s-dl across all method, with a smoothing factor of α = 0.1 applied to the s-dl. the setting paramter of all method is used from the pso hyperparameter tuning search space result as present in table 2. table 2. pso hyperparameter tuning search space parameter search space result batch size 100 epoch 50 hidden layer 2 loss function mse neuron 32 optimizer rmsprop figure 8 provides a valuable comparative assessment of mape for various prediction methods in two scenarios: "without smoothing" and "smoothing with optimum alpha," highlighting the impact of smoothing techniques, specifically optimized with an alpha value, on predictive accuracy. in both scenarios, mape values indicate that lstm consistently outperforms other methods, exhibiting the lowest mape values (3.9065%) and demonstrating exceptional predictive accuracy. conversely, bilstm continuously records the highest mape values (7.6464%), suggesting lower predictive accuracy regardless of smoothing. overall, figure 8 underscores the significance of optimizing smoothing techniques with an alpha value to enhance predictive accuracy in data analysis and forecasting tasks. although the average increase in mape value across all methods was a modest 0.1385%, lstm consistently proves to be the most accurate method. at the same time, bi-lstm always lags in predictive accuracy in both scenarios. these findings emphasize the importance of judiciously applying smoothing techniques for improved predictive performance. figure 9 compares rmse values for various prediction methods under two scenarios: "without smoothing" and "smoothing with optimum alpha." a consistent trend observed in the data is the slight reduction in rmse values across all methods when "smoothing with optimum alpha" is applied, indicating improved prediction accuracy through smoothing. lstm consistently outperforms other methods by achieving the lowest rmse values in both scenarios (0.0624 and 0.0621), underscoring its accuracy. in contrast, bi-lstm consistently exhibits the highest rmse values, 183 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 suggesting lower prediction accuracy regardless of smoothing (0.1252 and 0.1228). in summary, figure 9 emphasizes the positive impact of smoothing techniques, particularly when optimized with an alpha value, on enhancing the predictive accuracy of these methods. the average decrease in rmse of all methods after using smoothing with optimum alpha is 0.0061, this shows that smoothing with optimum alpha can detect outliers better and more sensitively. lstm is the most accurate method, while bi-lstm consistently demonstrates the least accuracy, highlighting the significance of thoughtful smoothing application in data analysis and forecasting tasks. fig. 8. mape evaluation result fig. 9. rmse evaluation result figure 10 presents a comparative r2 values analysis for various prediction methods in two scenarios. the data consistently reveals that r2 values improve across all forms when "smoothing with optimum alpha" is applied, indicating a superior fit to the dataset. lstm consistently outperforms other methods by achieving the highest r2 values in both scenarios (0.9021 and 0.9027), confirming its strong alignment with the data. in contrast, bi-lstm continuously records the lowest r2 values, indicating a relatively weaker fit regardless of smoothing (0.6042 and 0.6195). basically, figure 10 underscores the positive impact of optimizing smoothing techniques with an alpha value on improving the goodness of fit. lstm consistently excels at fitting the data, while bi-lstm a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 184 consistently demonstrates the weakest fit in both scenarios. these findings stress the importance of judiciously applying smoothing techniques to enhance the performance of these methods in data analysis and forecasting. fig. 10. r2 evaluation result figure 11 provides insights to assess how the application of smoothing techniques, optimized with an alpha value, impacts the computational efficiency of these methods when handling data. the data reveals that, in most cases, "smoothing with optimum alpha" leads to reduced computational times compared to the "without smoothing" scenario. this suggests that smoothing can improve the computational efficiency of these methods. cnn consistently shows shorter computational times in both scenarios, highlighting its efficiency. conversely, bi-lstm and gru require more time for computations, particularly without smoothing. these findings emphasize the importance of considering computational efficiency when choosing prediction methods for data analysis and forecasting tasks. fig. 11. computational time evalution result overall, the use of an optimum alpha value can significantly enhance the forecasting results for energy data in all dl method. in this study, the lstm model consistently stands out as the top choice, yielding the lowest mape and rmse values while achieving the highest r2 value. for computation time, lstm is also in the middle, not too fast and not too long. this indicates that lstm not only provides a high level of prediction accuracy but also offers the best fit to the existing data compared to the other evaluated methods. 185 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 the implications of these findings in the field of energy are that selecting the suitable model or method, especially when using an optimum alpha value, can significantly improve the accuracy of predictions in energy resource planning and management. in the energy sector, more accurate predictions can have a positive impact on optimizing energy usage, reducing waste, and supporting environmental sustainability. furthermore, the use of dl and optimized methods like lstm in energy forecasting opens up opportunities to develop more intelligent and more efficient solutions for energy supply management, especially in situations where energy sustainability and efficiency are becoming increasingly crucial. iv. conclusions in conclusion, this study underscores the significant impact of optimizing smoothing techniques with an optimum alpha (𝛼) value on enhancing the accuracy of energy usage forecasting using dl models. among the models tested, lstm consistently outperforms others, displaying the lowest mape (3.9065%) and rmse (0.0621) values and the highest r2 (0.9027), making it the top choice for accurate predictions. notably, the application of optimum alpha values has proven to be more successful in terms of improving prediction accuracy across various metrics. computational efficiency is also a critical consideration, with cnn demonstrating shorter computation times (57s). limitations of this research include the specific dataset used, which may not be entirely representative of all energy usage scenarios, and the computational resources required for lstm. future research should explore the generalizability of these findings across diverse energy datasets and further investigate the computational optimization of lstm. these findings have crucial implications for energy resource management, as more accurate predictions can aid in optimizing energy usage, reducing waste, and supporting environmental sustainability, emphasizing the relevance of thoughtful model selection and hyperparameter tuning. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] a. sharif, s. kocak, h. h. a. khan, g. uzuner, and s. tiwari, “demystifying the links between green technology innovation, economic growth, and environmental tax in asean-6 countries: the dynamic role of green energy and green investment,” gondwana res., vol. 115, pp. 98–106, mar. 2023. [2] p. ma, s. cui, m. chen, s. zhou, and k. wang, “review of family-level short-term load forecasting and its application in household energy management system,” energies, vol. 16, no. 15, p. 5809, aug. 2023. [3] l. malka, f. bidaj, a. kuriqi, a. jaku, r. roçi, and a. gebremedhin, “energy system analysis with a focus on future energy demand projections: the case of norway,” energy, vol. 272, p. 127107, jun. 2023. [4] s. kapp, j.-k. choi, and t. hong, “predicting industrial building energy consumption with statistical and machinelearning models informed by physical system parameters,” renew. sustain. energy rev., vol. 172, p. 113045, feb. 2023. [5] y. zou, r. v. donner, n. marwan, j. f. donges, and j. kurths, “complex network approaches to nonlinear time series analysis,” phys. rep., vol. 787, pp. 1–97, jan. 2019. http://journal2.um.ac.id/index.php/keds https://doi.org/10.1016/j.gr.2022.11.010 https://doi.org/10.1016/j.gr.2022.11.010 https://doi.org/10.1016/j.gr.2022.11.010 https://doi.org/10.3390/en16155809 https://doi.org/10.3390/en16155809 https://doi.org/10.1016/j.energy.2023.127107 https://doi.org/10.1016/j.energy.2023.127107 https://doi.org/10.1016/j.rser.2022.113045 https://doi.org/10.1016/j.rser.2022.113045 https://doi.org/10.1016/j.rser.2022.113045 https://doi.org/10.1016/j.physrep.2018.10.005 https://doi.org/10.1016/j.physrep.2018.10.005 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 186 [6] a. pranolo, y. mao, a. p. wibawa, a. b. p. utama, and f. a. dwiyanto, “robust lstm with tuned-pso and bifoldattention mechanism for analyzing multivariate time-series,” ieee access, vol. 10, pp. 78423–78434, 2022. [7] a. pranolo, y. mao, a. p. wibawa, a. b. p. utama, and f. a. dwiyanto, “optimized three deep learning models based-pso hyperparameters for beijing pm2.5 prediction,” knowl. eng. data sci., vol. 5, no. 1, p. 53, nov. 2022. [8] j. naskath, g. sivakamasundari, and a. a. s. begum, “a study on different deep learning algorithms used in deep neural nets: mlp som and dbn,” wirel. pers. commun., vol. 128, no. 4, pp. 2913–2936, 2023. [9] i. koprinska, d. wu, and z. wang, “convolutional neural networks for energy time series forecasting,” in 2018 international joint conference on neural networks (ijcnn), jul. 2018, pp. 1–8. [10] h. hewamalage, c. bergmeir, and k. bandara, “recurrent neural networks for time series forecasting: current status and future directions,” int. j. forecast., vol. 37, no. 1, pp. 388–427, jan. 2021. [11] g. bathla, r. rani, and h. aggarwal, “stocks of year 2020: prediction of high variations in stock prices using lstm,” multimed. tools appl., vol. 82, no. 7, pp. 9727–9743, mar. 2023. [12] m. yang and j. wang, “adaptability of financial time series prediction based on bilstm,” procedia comput. sci., vol. 199, pp. 18–25, 2022. [13] a. n. . f. faisal, a. rahman, m. t. m. habib, a. h. siddique, m. hasan, and m. m. khan, “neural networks based multivariate time series forecasting of solar radiation using meteorological data of different cities of bangladesh,” results eng., vol. 13, p. 100365, mar. 2022. [14] a. r. f. dewandra, a. p. wibawa, u. pujianto, a. b. p. utama, and a. nafalski, “journal unique visitors forecasting based on multivariate attributes using cnn,” int. j. artif. intell. res., vol. 6, no. 1, 2022. [15] f. kurniawan, s. sulaiman, s. konate, and m. a. a. abdalla, “deep learning approaches for mimo time-series analysis,” int. j. adv. intell. informatics, vol. 9, no. 2, p. 286, jul. 2023. [16] y. mao, a. pranolo, a. p. wibawa, a. b. putra utama, f. a. dwiyanto, and s. saifullah, “selection of precise long short term memory (lstm) hyperparameters based on particle swarm optimization,” in 2022 international conference on applied artificial intelligence and computing (icaaic), may 2022, pp. 1114–1121. [17] x. zhou, a. pranolo, and y. mao, “ab-lstm: attention bidirectional long short-term memory for multivariate time-series forecasting,” in 2023 international conference on computer, electronics & electrical engineering & their applications (ic2e3), jun. 2023, pp. 1–6. [18] m. elsaraiti, g. ali, h. musbah, a. merabet, and t. little, “time series analysis of electricity consumption forecasting using arima model,” in 2021 ieee green technologies conference (greentech), apr. 2021. [19] a. b. f. khan, k. kamalakannan, and n. s. s. ahmed, “integrating machine learning and stochastic pattern analysis for the forecasting of time-series data,” sn comput. sci., vol. 4, no. 5, p. 484, jun. 2023. [20] m. skariah and c. d. suriyakala, “forecasting reservoir inflow combining exponential smoothing, arima, and lstm models,” arab. j. geosci., vol. 15, no. 14, p. 1292, jul. 2022. [21] a. p. wibawa, a. b. p. utama, h. elmunsyah, u. pujianto, f. a. dwiyanto, and l. hernandez, “time-series analysis with smoothed convolutional neural network,” j. big data, vol. 9, no. 1, p. 44, dec. 2022. [22] v. prema and k. u. rao, “development of statistical time series models for solar power prediction,” renew. energy, vol. 83, pp. 100–109, nov. 2015. [23] s. huber, h. wiemer, d. schneider, and s. ihlenfeldt, “dmme: data mining methodology for engineering applications – a holistic extension to the crisp-dm model,” procedia cirp, vol. 79, pp. 403–408, 2019. [24] a. tealab, h. hefny, and a. badr, “forecasting of nonlinear time series using ann,” futur. comput. informatics j., vol. 2, no. 1, pp. 39–47, 2017. [25] k. aparna, “evolutionary computing based hybrid bisecting clustering algorithm for multidimensional data,” sādhanā, vol. 44, no. 2, p. 45, feb. 2019. [26] l. vanneschi and s. silva, “particle swarm optimization,” in natural computing series, 2023, pp. 105–111. [27] a. b. p. utama, a. p. wibawa, muladi, and a. nafalski, “pso based hyperparameter tuning of cnn multivariate time-series analysis,” j. online inform., vol. 7, no. 2, pp. 193–202, 2022. [28] m. abo-tabik, n. costen, j. darby, and y. benn, “towards a smart smoking cessation app: a 1d -cnn model predicting smoking events,” sensors, vol. 20, no. 4, p. 1099, feb. 2020. [29] w. j. zhang, g. yang, y. lin, c. ji, and m. m. gupta, “on definition of deep learning,” in 2018 world automation congress (wac), jun. 2018, pp. 1–5. [30] d. a. bashar, “survey on evolving deep learning neural network architectures,” j. artif. intell. capsul. netwo rks, vol. 2019, no. 2, pp. 73–82, dec. 2019. [31] p. p. shinde and s. shah, “a review of machine learning and deep learning applications,” in 2018 fourth international conference on computing communication control and automation (iccubea), aug. 2018, pp. 1–6. [32] h. apaydin, h. feizi, m. t. sattari, m. s. colak, s. shamshirband, and k. w. chau, “comparative analysis of recurrent neural network architectures for reservoir inflow forecasting,” water (switzerland), vol. 12, no. 5, pp. 1–18. [33] a. zanfei, b. m. brentan, a. menapace, m. righetti, and m. herrera, “graph convolutional recurrent neural networks for water demand forecasting,” water resour. res., vol. 58, no. 7, jul. 2022. [34] z. hu, j. zhang, and y. ge, “handling vanishing gradient problem using artificial derivative,” ieee access, vol. 9, pp. 22371–22377, 2021. [35] k. smagulova and a. p. james, “a survey on lstm memristive neural network architectures and applications,” eur. phys. j. spec. top., vol. 228, no. 10, pp. 2313–2324, oct. 2019. [36] x. meng, m. liu, and q. wu, “prediction of rice yield via stacked lstm,” int. j. agric. environ. inf. syst., vol. 11, no. 1, pp. 86–95, jan. 2020. [37] f. shahid, a. zameer, and m. muneeb, “predictions for covid-19 with deep learning models of lstm, gru and bi-lstm,” chaos, solitons & fractals, vol. 140, p. 110212, nov. 2020. [38] h. wang, y. zhang, j. liang, and l. liu, “dafa-bilstm: deep autoregression feature augmented bidirectional lstm network for time series prediction,” neural networks, vol. 157, pp. 240–256, jan. 2023. [39] q. cheng, y. chen, y. xiao, h. yin, and w. liu, “a dual-stage attention-based bi-lstm network for multivariate time series prediction,” j. supercomput., vol. 78, no. 14, pp. 16214–16235, sep. 2022. https://doi.org/10.1109/access.2022.3193643 https://doi.org/10.1109/access.2022.3193643 https://doi.org/10.17977/um018v5i12022p53-66 https://doi.org/10.17977/um018v5i12022p53-66 https://doi.org/10.1007/s11277-022-10079-4 https://doi.org/10.1007/s11277-022-10079-4 https://doi.org/10.1109/ijcnn.2018.8489399 https://doi.org/10.1109/ijcnn.2018.8489399 https://doi.org/10.1016/j.ijforecast.2020.06.008 https://doi.org/10.1016/j.ijforecast.2020.06.008 https://doi.org/10.1007/s11042-022-12390-5 https://doi.org/10.1007/s11042-022-12390-5 https://doi.org/10.1016/j.procs.2022.01.003 https://doi.org/10.1016/j.procs.2022.01.003 https://doi.org/10.1016/j.rineng.2022.100365 https://doi.org/10.1016/j.rineng.2022.100365 https://doi.org/10.1016/j.rineng.2022.100365 https://doi.org/10.29099/ijair.v6i1.274 https://doi.org/10.29099/ijair.v6i1.274 https://doi.org/10.26555/ijain.v9i2.1092 https://doi.org/10.26555/ijain.v9i2.1092 https://doi.org/10.1109/icaaic53929.2022.9792708 https://doi.org/10.1109/icaaic53929.2022.9792708 https://doi.org/10.1109/icaaic53929.2022.9792708 https://doi.org/10.1109/ic2e357697.2023.10262559 https://doi.org/10.1109/ic2e357697.2023.10262559 https://doi.org/10.1109/ic2e357697.2023.10262559 https://doi.org/10.1109/greentech48523.2021.00049 https://doi.org/10.1109/greentech48523.2021.00049 https://doi.org/10.1007/s42979-023-01981-0 https://doi.org/10.1007/s42979-023-01981-0 https://doi.org/10.1007/s12517-022-10564-x https://doi.org/10.1007/s12517-022-10564-x https://doi.org/10.1186/s40537-022-00599-y https://doi.org/10.1186/s40537-022-00599-y https://doi.org/10.1016/j.renene.2015.03.038 https://doi.org/10.1016/j.renene.2015.03.038 https://doi.org/10.1016/j.procir.2019.02.106 https://doi.org/10.1016/j.procir.2019.02.106 https://doi.org/10.1016/j.fcij.2017.05.001 https://doi.org/10.1016/j.fcij.2017.05.001 https://doi.org/10.1007/s12046-018-1011-y https://doi.org/10.1007/s12046-018-1011-y https://doi.org/10.1007/978-3-031-17922-8_4 https://doi.org/10.15575/join.v7i2.858 https://doi.org/10.15575/join.v7i2.858 https://doi.org/10.3390/s20041099 https://doi.org/10.3390/s20041099 https://doi.org/10.23919/wac.2018.8430387 https://doi.org/10.23919/wac.2018.8430387 https://doi.org/10.36548/jaicn.2019.2.003 https://doi.org/10.36548/jaicn.2019.2.003 https://doi.org/10.1109/iccubea.2018.8697857 https://doi.org/10.1109/iccubea.2018.8697857 https://doi.org/10.3390/w12051500 https://doi.org/10.3390/w12051500 https://doi.org/10.1029/2022wr032299 https://doi.org/10.1029/2022wr032299 https://doi.org/10.1109/access.2021.3054915 https://doi.org/10.1109/access.2021.3054915 https://doi.org/10.1140/epjst/e2019-900046-x https://doi.org/10.1140/epjst/e2019-900046-x https://doi.org/10.4018/ijaeis.2020010105 https://doi.org/10.4018/ijaeis.2020010105 https://doi.org/10.1016/j.chaos.2020.110212 https://doi.org/10.1016/j.chaos.2020.110212 https://doi.org/10.1016/j.neunet.2022.10.009 https://doi.org/10.1016/j.neunet.2022.10.009 https://doi.org/10.1007/s11227-022-04506-3 https://doi.org/10.1007/s11227-022-04506-3 187 a. p. wibawa et al. / knowledge engineering and data science 2023, 6 (2): 170–187 [40] c. hu, s. martin, and r. dingreville, “accelerating phase-field predictions via recurrent neural networks learning the microstructure evolution in latent space,” comput. methods appl. mech. eng., vol. 397, p. 115128, jul. 2022. [41] x. wang, n. xie, and l. yang, “a flexible grey fourier model based on integral matching for forecasting seasonal pm2.5 time series,” chaos, solitons & fractals, vol. 162, p. 112417, sep. 2022. [42] w. sun and c. huang, “a novel carbon price prediction model combines the secondary decomposition algorithm and the long short-term memory network,” energy, vol. 207, p. 118294, sep. 2020. [43] a. p. wibawa, z. n. izdihar, a. b. p. utama, l. hernandez, and haviluddin, “min-max backpropagation neural network to forecast e-journal visitors,” in 2021 international conference on artificial intelligence in information and communication (icaiic), apr. 2021, pp. 052–058. [44] a. p. wibawa, “mean-median smoothing backpropagation neural network to forecast unique visitors time series of electronic journal,” j. appl. data sci., vol. 4, no. 3, pp. 163–174, sep. 2023. [45] y. yang, c. yu, and r. y. zhong, “generalized linear model-based data analytic approach for construction equipment management,” adv. eng. informatics, vol. 55, p. 101884, jan. 2023, doi: 10.1016/j.aei.2023.101884. https://doi.org/10.1016/j.cma.2022.115128 https://doi.org/10.1016/j.cma.2022.115128 https://doi.org/10.1016/j.chaos.2022.112417 https://doi.org/10.1016/j.chaos.2022.112417 https://doi.org/10.1016/j.energy.2020.118294 https://doi.org/10.1016/j.energy.2020.118294 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.1109/icaiic51459.2021.9415197 https://doi.org/10.47738/jads.v4i3.97 https://doi.org/10.47738/jads.v4i3.97 https://doi.org/10.1016/j.aei.2023.101884 https://doi.org/10.1016/j.aei.2023.101884 knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 215–230 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p215-230 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) stacked lstm-gru long-term forecasting model for indonesian islamic banks yayat sujatna a,1, adhitio satyo bayangkari karno b,2, widi hastomo c,3,*, nia yuningsih b,4, dody arif d,5, sri setya handayani d,6, aqwam rosadi kardian e,7, ire puspa wardhani e,8, l.m rasdi rere e,9 a department of accounting, ahmad dahlan institute of technology and business jl. ir h. juanda no.77, tangerang selatan 15419, indonesia b department of information system, faculty of engineering, gunadarma university jl. margonda raya no. 100, depok 16424, indonesia c department of information technology, ahmad dahlan institute of technology and business jl. ir h. juanda no.77, tangerang selatan 15419, indonesia d department of management, faculty of economics, gunadarma university jl. margonda raya no. 100, depok 16424, indonesia e department of computer systems, stmik jakarta sti&k jl. bri radio dalam no.17, jakarta selatan 12140, indonesia 1 yayatsujatna@gmail.com; 2 adh1t10.2@gmail.com; 3 widie.has@gmail.com*; 4 nia_yuningsih@staff.gunadarma.ac.id; 5 dodiarif8@gmail.com; 6 srisetyahandayani@yahoo.com; 7 aqwam@staff.jak-stik.ac.id; 8 irepuspa@staff.jak-stik.ac.id; 9 rasdirere267@gmail.com * corresponding author i. introduction as the country with the world's largest muslim-majority population, indonesia has enormous potential for the expansion of the islamic banking financial system in the future, as evidenced by a robust network of islamic banks [1]. these banks follow islamic law (sharia) principles and follow ethical and moral criteria [2]. the indonesian government is aggressively promoting the growth of islamic banking in response to the growing demand for islamic financial products and services. various regulatory frameworks have been put in place to support the formation and expansion of islamic banks. the financial services authority (ojk) is responsible for managing and regulating the operations of islamic banks in order to maintain sharia compliance [3]. in addition to becoming full-service islamic banks, conventional banks have built islamic banking branches to accommodate the rising demand for shariah-compliant services. these institutions provide a wide article info a b s t r a c t article history: received 04 september 2023 revised 26 september 2023 accepted 20 october 2023 published online 06 november 2023 the development of the islamic banking industry in indonesia has become a significant concern in recent years, with rapid growth in the number of banks operating based on sharia principles. to face emerging challenges and opportunities, a deep understanding of the long-term financial behavior of islamic banks is becoming increasingly important. this study aims to predict the share price of pt bank syariah indonesia tbk, over 28 days using the lstm-gru stack. the observation stage includes importing the dataset, data separation, model variations, the training process, output, and evaluation. observations were conducted using 10 model variations from 4 stacks of lstm and gru. each model performs the training process in four epochs (200, 500, 750, and 1000). the results of observations in this study show that long-term predictions (28 days ahead) using four stacks of lstm-gru and daily training accumulation techniques produce better accuracy than the general method (using multiple outputs). from the observations we have made for predictions for the next 28 days, the model with the lglg stack arrangement (lstm-gru-lstm-gru) produces the best accuracy at epoch 750 with an mse lstm-gru 63.43762863. this study will undoubtedly continue in order to achieve even better precision, either by utilizing a new design or by further improving the technology we are now employing. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: sharia principles indonesian banks long-term forecasts gru lstm http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 216 range of shariah-compliant goods and services, including savings accounts, financing, investment instruments, takaful (islamic insurance), and zakat payments [4][5][6]. islamic banking in indonesia has experienced rapid growth in the last few decades [7]. this growth not only reflects global trends in sharia finance but is also reflected in the economic and social development of indonesia, which has a sizeable muslim population. sharia banking provides financial access to people previously not served by conventional banking [8]. the system has helped drive financial inclusion in indonesia by providing access to banking products and services to groups previously considered "unbankable". the existence of sharia banking also makes a positive contribution to the stability of the indonesian economy as a whole [9]. diversifying islamic banking and financing based on islamic ethics helps reduce systemic risk [10]. thus, the growth of sharia banking in indonesia not only reflects high market demand but also creates a positive impact by encouraging financial inclusion, sustainable economic development, and the development of financial products and services that are in line with islamic values [11], is an essential aspect of indonesia's diverse and dynamic economic and financial development. despite substantial progress in islamic banking in indonesia, there are still issues, difficulties, and possibilities to be addressed. evaluating and analyzing the performance, efficiency, and competitiveness of islamic banks in comparison to conventional banks, as well as comprehending the dynamics and factors influencing the growth and long-term sustainability of islamic banking in indonesia, is critical for policymakers, regulators, and market players [12][13][14]. long-term stock forecasting is required for investors and financial institutions to make good long-term investment decisions and strategies in the indonesian market [15][16]. for investors looking to improve their investment portfolios, accurate long-term stock prediction estimates from islamic banks are invaluable. while previous research still uses traditional financial models [17] or basic machine learning algorithms [18], with low accuracy results [19] and many biases, it is still far from what was expected [20]. in recent years, financial markets have seen a considerable surge in the application of artificial intelligence (ai) and machine learning (ml) techniques for stock market prediction [21][22][23][24]. these strategies have demonstrated promising results in identifying complicated patterns and trends in financial data, supporting investors in making educated decisions. recurrent neural networks (rnns) have attracted much interest among other ml techniques due to their ability to handle sequential and temporal connections in data. the long short-term memory (lstm) network is one form of rnn that has proven efficacy in time series analysis [25]. lstm networks can capture long-term relationships and reduce the missing gradient issue in standard rnns [26]. in addition, gated recurrent units (grus) have emerged as an alternative rnn architecture that offers computational efficiency and performance comparable to lstm [27]. individually, the lstm and gru networks have been regularly used to estimate stock prices in the context of stock market prediction [28][29][30][31][32][33][34]. however, improved models that integrate the capabilities of the two architectures are still required to increase forecast accuracy. despite the growing interest in islamic banking and the importance of islamic bank shares in indonesia, there is a significant vacuum in the existing literature on long-term forecasts utilizing deep learning techniques. most of the study focuses on traditional bank financial performance and short-term predictions, with minimal discussion of long-term stock projections in the indonesian setting. this study aims to evaluate the performance of pt bank syariah indonesia tbk's long-term stock prediction model. two novel approaches are proposed. the first is optimizing the model with a separate training process using ten variations of the 4 lstm-gru stacks. the second approach is the input and target data segmentation technique, adjusted to the predictions for the next 1 to 28 days. by stacking many models, deep learning models become better and more useful for forecasting time series data [35][36][37], particularly for predicting stock values [38][39][40][41]. several experiments on merging several machine learning approaches to predict time series data have been conducted [42]. predicting water prices with an lstm-gru model is more accurate than using the gru and piles with an lstm-lstm arrangement [43]. when predicting complicated stock 217 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 market data, the hybrid akima-emd-lstm model outperforms the hybrid emd-lstm, eemdlstm, and semd-lstm models [44]. stock price prediction employs a time-series analysis of lstm and sentiment analysis of the valence aware dictionary and sentiment reasoner (vader). compared to earlier research, this method yields more accuracy [45]. the cnn, rnn, lstm, cnn-rnn, and cnn-lstm algorithms are used to predict the shanghai composite index shares. the cnn-rnn approach outperforms other methods (cnn, rnn, and lstm) [46]. for music data, classic tanh, lstm, and gru are used, with lstm and gru having benefits over standard tanh units [47]. a stacked lstm model is used to detect abnormalities in four separate datasets. ii. methods this study was carried out in stages, beginning with data collection, then the separation of training and test data, the separation of goal data for long-term predictions of the following 28 days, model creation, and assessment. the research flowchart shows in figure 1 describes the steps of this investigation in general. the following is a detailed explanation of the experimental process flow for predicting sharia stock prices using the lstm and gru stack models, starting from importing the dataset to output: • import dataset: from 01-07-2020 to 01-07-2023, the stock time series dataset from pt bank syariah indonesia tbk (bris) was taken from https://finance.yahoo.com. the data set has 728 rows (days) and six columns (open, high, low, close, adjclose, and volume), with data from the "close" column being used in this study. • data separation is done by taking the last 28 days of the dataset to be used as prediction data for the next 28 days. then, the remaining 700 days of data are divided into training data (600 days) and test data (100 days). • modeling is building 10 model variations from 4 lstm and gru stack arrangements, namely: gggg, gggl, ggll, glgl, gllg, lggl, lglg, llgg, lllg, llll. g is for gru, and l is for lstm. this model will be trained on training data using machine learning algorithms, includes initializing the model, determining the loss function, selecting the optimizer (e.g., adam), and determining the evaluation metric, the mean square error (mse). • evaluation: once training is complete, the model should be evaluated to measure how well it predicts stock prices. this evaluation is usually carried out on previously separated test data. this experiment uses evaluation metrics such as mse to assess the quality of model predictions. additionally, visualizations such as graphs comparing predictions with actual data can also provide valuable insights. • the output is depicted in the form of a graph that shows historical visuals between actual data and predicted data. to be able to determine the level of accuracy of the results of the training that has been carried out. so, measurements are made between the predicted results and actual data using the mse measurement method. fig. 1. research flowchart a. lstm-gru rnn employing backpropagation is the first deep learning model that can recall prior data and predict data one step ahead [48][49][50][51]. adding layers can enhance accuracy, but doing so with the rnn might result in a diminishing gradient. as a result, the rnn can only overcome short-term reliance [52][53]. because of this issue, lstm [54] and [55] cells were created, which have several gates and may overcome long-term dependence. gru, a cell with a simpler gate that y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 218 can also overcome long-term dependencies, is a further advancement [46][56]. figure 2 depicts architectural advancements beginning with rnn, then lstm, and finally gru. fig. 2. rnn, lstm and gru architecture development initialize the initial hidden state and cell state values for each lstm layer. 𝐻0 𝐿𝑆𝑇𝑀𝑖 = 0, 𝐶0 𝐿𝑆𝑇𝑀𝑖 = 0 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖 and gru layer 𝐻 0 𝐺𝑅𝑈𝑗 = 0 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑗. iterate through each time step t (usually from t =1 to t, where t is the length of the input sequence). for each lstm layer i, calculate the hidden state 𝐻𝑡 𝐿𝑆𝑇𝑀𝑖 and cell state 𝐶𝑡 𝐿𝑆𝑇𝑀𝑖 , as in (1) to (6) and for each to-j gru layer, calculate the hidden state 𝐻 𝑡 𝐺𝑅𝑈𝑗 as in (6) to (10). 𝑓𝑡 𝐿𝑆𝑇𝑀𝑖 =  (𝑊 𝑓 𝐿𝑆𝑇𝑀𝑖 . [𝐻𝑡−1 𝐿𝑆𝑇𝑀𝑖 , 𝑋𝑡 ] + 𝑏𝑓 𝐿𝑆𝑇𝑀𝑖 ) (1) 𝑖𝑡 𝐿𝑆𝑇𝑀𝑖 =  (𝑊 𝑓 𝐿𝑆𝑇𝑀𝑖 . [𝐻𝑡−1 𝐿𝑆𝑇𝑀𝑖 , 𝑋𝑡 ] + 𝑏𝑖 𝐿𝑆𝑇𝑀𝑖 ) (2) ĉ𝑡 𝐿𝑆𝑇𝑀𝑖 = 𝑡𝑎𝑛ℎ (𝑊𝑐 𝐿𝑆𝑇𝑀𝑖 . [𝐻𝑡−1 𝐿𝑆𝑇𝑀𝑖 , 𝑋𝑡 ] + 𝑏𝑐 𝐿𝑆𝑇𝑀𝑖 ) (3) 𝐶𝑡 𝐿𝑆𝑇𝑀𝑖 = 𝑓𝑡 𝐿𝑆𝑇𝑀𝑖 . 𝐶𝑡−1 𝐿𝑆𝑇𝑀𝑖 + 𝑖𝑡 𝐿𝑆𝑇𝑀𝑖 . ĉ𝑡 𝐿𝑆𝑇𝑀𝑖 (4) 𝑜𝑡 𝐿𝑆𝑇𝑀𝑖 =  (𝑊𝑜 𝐿𝑆𝑇𝑀𝑖 . [𝐻𝑡−1 𝐿𝑆𝑇𝑀𝑖 , 𝑋𝑡 ] + 𝑏𝑜 𝐿𝑆𝑇𝑀𝑖 ) (5) 𝐻𝑡 𝐿𝑆𝑇𝑀𝑖 = 𝑜𝑡 𝐿𝑆𝑇𝑀𝑖 . tanh (𝐶𝑡 𝐿𝑆𝑇𝑀𝑖 ) (6) 𝑍 𝑡 𝐺𝑅𝑈𝑗 =  (𝑊𝑧 𝐺𝑅𝑈𝑗 . [𝐻 𝑡−1 𝐺𝑅𝑈𝑗 , 𝑋𝑡 ] + 𝑏𝑧 𝐺𝑅𝑈𝑗 ) (7) 𝑇 𝑡 𝐺𝑅𝑈𝑗 =  (𝑊𝑟 𝐺𝑅𝑈𝑗 . [𝐻 𝑡−1 𝐺𝑅𝑈𝑗 , 𝑋𝑡 ] + 𝑏𝑟 𝐺𝑅𝑈𝑗 ) (8) ĥ 𝑡 𝐺𝑅𝑈𝑗 = 𝑡𝑎𝑛ℎ (𝑊 ℎ 𝐺𝑅𝑈𝑗 . [𝑟 𝑡 𝐺𝑅𝑈𝑗 . 𝐻 𝑡−1 𝐺𝑅𝑈𝑗 , 𝑋𝑡 ] + 𝑏ℎ 𝐺𝑅𝑈𝑗 ) (9) 𝐻 𝑡 𝐺𝑅𝑈𝑗 = (1 − 𝑧 𝑡 𝐺𝑅𝑈𝑗 ) . ĥ 𝑡 𝐺𝑅𝑈𝑗 + 𝑧 𝑡 𝐺𝑅𝑈𝑗 . 𝐻 𝑡 𝐺𝑅𝑈𝑗 (10) the output result of the last layer of lstm and gru at the last time step t is the final result of the model as in (11). 𝑋 is the input at each time step, 𝐻𝑡 𝐿𝑆𝑇𝑀𝑖 is the state (hidden state) of the-i lstm layer at time step t, 𝐶𝑡 𝐿𝑆𝑇𝑀𝑖 is the cell state of the-i lstm layer at time step t, 𝐻𝑡 𝐺𝑅𝑈𝑖 is the state (hidden state) of the-i gru layer at time step t. 219 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 𝑂𝑢𝑡𝑝𝑢𝑡 = [𝐻𝑇 𝐿𝑆𝑇𝑀𝑙𝑎𝑠𝑡 , 𝐻𝑇 𝐺𝑅𝑈𝑙𝑎𝑠𝑡 ] (11) pseudocode 1 is a pseudocode representation of stacking lstm and gru layers in a recurrent neural network (rnn). pseudocode 1. lstm and gru stack 1 input_data = placeholder(shape=(batch_size, sequence_length, input_size)) 2 hidden_states_lstm = [] 3 hidden_states_gru = [] 4 for i in range(num_layers_lstm): 5 if i == 0: 6 lstm_input = input_data 7 else: 8 lstm_input = hidden_states_lstm[-1] 9 lstm_layer = lstm(hidden_size_lstm, return_sequences=true)(lstm_input) 10 hidden_states_lstm.append(lstm_layer) 11 for j in range(num_layers_gru): 12 if j == 0: 13 gru_input = input_data 14 else: 15 gru_input = hidden_states_gru[-1] 16 gru_layer = gru(hidden_size_gru, return_sequences=true)(gru_input) 17 hidden_states_gru.append(gru_layer) 18 final_lstm_hidden_state = hidden_states_lstm[-1] 19 final_gru_hidden_state = hidden_states_gru[-1] 20 combined_hidden_state = concatenate(axis=-1)([final_lstm_hidden_state, final_gru_hidden_state]) 21 output_layer = dense(output_size, activation='softmax')(combined_hidden_ state) 22 model = model(inputs=input_data, outputs=output_layer) 23 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy']) 24 model.fit(input_data, target_data, epochs=num_epochs,batch_size=batch_size) pseudocode for lstm-gru stacks represents a high-level algorithmic outline for constructing a deep neural network architecture that combines lstm and gru layers. this pseudocode specifies the critical steps for building a stacked rnn, starting with the definition of hyperparameters and input data placeholders, followed by creating multiple lstm and gru layers with their respective hidden states. the final hidden states of these layers can be concatenated or combined as needed for downstream tasks. by stacking lstm and gru units, the model aims to capture complex sequential patterns, making it particularly useful for tasks involving sequential data analysis. the traditional lstm and gru models have several limitations compared to model stacks that combine lstm and gru. following are some of the main limitations of traditional lstm and gru models. lack of ability to handle long-term information [57]. although lstm and gru are designed to overcome the vanishing gradient problem in rnn models, they still have limitations in handling long-term information. these models can remember information from several previous time steps, but over very long periods, they may still have difficulty. more expensive computing, lstm, and gru models are relatively computationally complex [58], mainly when used in deep or layered networks, which can result in longer training times and require more excellent computing resources [59]. susceptible to overfitting: lstm and gru models are more susceptible to overfitting when used on relatively small datasets [60]. because the number of parameters in these models is significant, they can “memorize” existing training data rather than understanding general patterns. not optimal for specific tasks: while lstm and gru are reasonable solutions for many tasks in time series modeling, there are some specialized tasks, such as text processing (nlp), that require more specialized architectures, such as transformers [61]. y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 220 to overcome these limitations, a stack of lstm and gru models can provide several advantages, including. richer representation capabilities: with a stack of lstm and gru models, we can use multiple lstm and gru layers sequentially [61], allowing the model to represent the data better and describe more complex relationships in the time series. in hierarchical learning, the model stack can learn a hierarchy of information. the first layer can understand more basic patterns, while subsequent layers can understand increasingly abstract and complex patterns [61]. reduces the risk of overfitting with the addition of layers and techniques such as dropout between layers, and model stacks can help reduce the risk of overfitting, mainly if managed wisely [62]. flexible architectural combinations: combining lstm and gru in various configurations in a model stack allows flexibility in designing the most appropriate architecture for a particular task [62]. however, it should be noted that stacked lstm and gru models also require careful tuning and attention to overfitting. the selection of appropriate architecture and parameters will significantly influence the quality of model predictions. b. data separation the dataset is divided into training data (700 days), test data (100 days), and prediction data (28 days). figure 3 shows the division of training data and test data as a history graph. the training procedure is conducted to create a model. predictions were performed using training and test data to evaluate the performance of the resultant model as shown in figure 4. fig. 3. separation of training data (green), test data (blue), and predictive data (yellow) fig. 4. predicted results of training data (magenta), and predicted results of test data (cyan) prediction data (28 days) has been disguised and is only used to evaluate prediction outcomes; it is not included in the training process. we employ recurrent training approaches that are carried out individually for predictions from 1 day to 28 days ahead to anticipate the following 28 days without training data. the input data spans 7 days, whereas the desired data spans 1 day. the forecast for 221 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 the first day is based on one day of target data, which is one day following the training data input. the forecast for the second day uses one day of target data that were collected two days after the input training data, and so on until the prediction for the 28th day utilizes one day of target data that was collected 28 days after the input training data. each training procedure is repeated ten times with a distinct 4-layer lstm-gru arrangement model [63] to get the best outcomes. figure 5 depicts the separation of input and target data for forecasts from one to 28 days. fig. 5. illustration of training and target data separation for predictions ranging from 1 to 28 days c. modeling each training procedure is carried out in 10 variations of four distinct layers of the lstm-gru arrangement to get the most excellent model performance: var-01: gggg, var-02: gggl, var03: ggll, var-04: glgl, var-05: gllg, var-06: lggl, var-07: lglg, var-08: llgg, var09: lllg, var-10: llll. the letter l represents lstm, and the letter g represents gru. each training procedure uses four epochs (200, 500, 750, and 1000). choosing the number of epochs (iterations through the entire training dataset) in training a neural network model is an important decision based on sound judgment, especially in using four epochs (200, 500, 750, and 1000). below, we will provide scientific arguments for choosing this number of epochs: • convergence requirements: the number of epochs used in model training depends mainly on the complexity of the model, the volume of data, and the desired level of convergence. the more complex the model, the longer it takes to reach convergence. the number of epochs spanning four points (200, 500, 750, and 1000) reflects an attempt to examine how the model behaves at various points in training, from early to more advanced stages. • performance monitoring: during training, it is essential to monitor model performance on validation or test datasets to prevent overfitting. by using several different epoch points, we can examine how the model behaves over time. also seeing whether the model's performance continues to increase, reaches a peak, or even decreases at a certain point will help decide when to stop training or take other actions, such as reducing the learning rate or adjusting the model architecture. training data input target data prediction 28-day --------------------------------------------------------------------------------------------------------------------r e su lt with target without target train and test data pred. 28-day t r a in in g p r o c e s s 2 8 -d a y d -1 d -2 d -3 d -2 8 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 222 • probability map exploration: by trying several different epoch points, this process can also explore the likelihood map of the model's behavior. for example, at the initial epoch (200), the model has not converged enough and is biased towards the training data. at midpoints (500 and 750), the model can approach convergence and begin to fit the validation data. at the endpoint (1000), one can see whether the model continues improving in performance or has reached a saturation point. • stability evaluation: the stability of the model can also be assessed through these four epoch points. when a model has highly fluctuating behavior at early points in training, this may indicate that the high learning rate and complexity of the model may need to be adjusted. conversely, if the model shows good stability at specific points, this may indicate that the process has found an exemplary training configuration. • testing and generalisation: once training is complete at the endpoint (1000), the process can then test the model on never-before-seen data to measure generalization capabilities. if the model can produce good results on the test data, this will indicate that the training has been successful. the selection of these four epoch points provides a rich perspective on how the model develops its performance over time. however, keep in mind that in practice, the choice of the number of epochs must also be considered along with other factors such as learning rate, batch size, model complexity, and the characteristics of the data used. the adam optimization function is used to construct the model, with a learning rate of 1,001, nodes for each layer of 50, and a batch size of 64. figure 6 depicts the process from input to deep learning models with 10 variations, predictions, and mse values produced for each model variant. adam combines the concepts of momentum (to help handle local minima) and rmsprop (to set the learning rate) in one algorithm. it uses moving estimates of the first gradient (momentum) and the second gradient (rms momentum) to calculate weight updates. the learning rate can fluctuate for each parameter based on previous gradient history. these estimates are adjusted to consider the weighted average exponential factor (with higher learning rates). fig. 6. input, model, prediction results, and performance evaluation using mse learning rate 1.001: the learning rate is the factor that controls the extent to which the model will adjust based on the gradient of the training data. a value of 1.001 is relatively high, and input mse g g g g g g g l g g l l g l g l g l l g l g g l l g l g l l g g l l l g l l l l var-10 var-09 var-01 var-02 var-03 var-04 var-05 var-06 var-07 var-08 d-2 d-3 …. d-28d-1 d-1 d-2 d-3 …. d-28 d-2 d-3 …. d-28 d-2 d-3 …. d-28 d-1 d-1 d-28 d-1 d-2 d-3 …. d-28 d-1 d-2 d-3 …. d-28 d-3 real d-1 d-2 …. d-28 d-3 d-3 d-3 d-28 d-1 d-2 d-3 …. d-28 d-1 d-2 d-3 …. d-28 d-1 d-2 d-3 …. …. var-08 d-1 d-2 … d-28 …. var-09 d-28 var-10 d-1 d-2 … d-1 d-2 … d-28 …. var-07 d-1 d-2 … d-28 d-3 d-3 …. var-06 d-1 d-2 … d-28 …. var-05 d-1 d-2 … d-28 …. var-04 d-1 d-2 …d-3 d-3 d-28 deep learning prediction d-1 var-01 d-1 d-2 … d-28 d-3 var-03 d-1 d-2 … d-28 d-2 var-02 d-1 d-2 … d-3 d-3 d-3 d-28 223 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 usually, smaller learning rate values (e.g., 0.001) are used to ensure stable convergence. hidden layer 50: this refers to the number of nodes (neurons) in each hidden layer in a neural network. this value shows the complexity of the model that has been created. the more nodes, the greater the model's ability to capture complex patterns in the data, but it can also increase the risk of overfitting if the training data is limited. batch size 64: this is the number of data samples used in each weight update iteration (mini-batch learning iteration). larger batches can speed up training due to more efficient optimization, but they also require more memory. too small a batch can cause unstable convergence. batch size 64 is a commonly used value in most cases. d. evaluation criteria to assess model effectiveness, we employ a statistical technique known as mean square error (mse). mse is calculated as the sum of the squares of the error distance between the anticipated outcomes and 28 previously hidden observation data points (actual data), then divided by the sample size. a lower mse value suggests improved performance [64]. the formulation for mse is shown in equation 1, where the variables 𝑝 are predicted data, variables 𝑟 are actual data (observations) that are concealed, and n indicates the number of sample data. 𝑀𝑆𝐸 = 1 𝑛 ∑(𝑝 − 𝑟)2 (12) a lower mse value indicates that the experimental model can better predict stock prices accurately, which means that the difference between model predictions and actual stock prices tends to be smaller. conversely, a high mse value indicates the model has a significant mismatch in predicting stock prices. mse is a simple and easy-to-understand metric. the smaller the mse value, the better the model predicts stock prices. mse can give high weight to significant errors in predictions, which is helpful in cases where outliers (significant differences between predicted and actual values) must be considered. the use of mse in evaluating forecasting models for the next 28 days will help to measure the quality of model predictions and to compare different models or update the model if necessary. iii. results and discussions the training procedure used 10 model versions and 4 epochs (200, 500, 750, and 1000), resulting in 40 prediction graphs with 120 mse measures. we only provide one graph of the projected outcomes (out of 40 graphs) for the training data phase, test data, and 28 days of prediction data (figure 7) because of page limits. to make the 28-day forecast chart more visible, we expanded a smaller section (figure 8). figure 8 indicates that the 28-day forecast, particularly, has acceptable fluctuations until day 28 and continues to follow the original data pattern, starkly contrasting with long-term prediction approaches in general, which tend towards a specific value (convergent) with a more substantial bias for more extended data forecasts. fig. 7. training data, test data, and 28-day predicted data prediction results in full size y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 224 fig. 8. expanded sizes for test predictions and 28-day predictions figure 7 and figure 8 shows all mse values for training data predictions, testing, and 28-day forecasts numerically, while tables 1 and 2 show the mse values graphically. tables 1-2 and 7 show that the best model for predicting training and test data is the var-10 with the lstm-lstmlstm-lstm (llll) stack architecture, with mse values of 1795.1927 and 1485.7672, respectively. meanwhile, var-7 with the lstm-gru-lstm-gru (lglg) stack architecture is the best model for 28-day predictive data, with an mse of 63.4376. table 1 summarizes the mse evaluation with all training procedures in the 200-500 epoch range. this model is a stack of four sequential layers with two different types of memory cells, namely the gru and lstm. epoch 200 prediction of 28 days: this mse value of 90.8903961 shows how much this model performs in predicting data and indicates that the model has a relatively large error rate, which means that the difference between the stock price predicted by the model and the actual stock price at each time point in the dataset is relatively significant. mse of 90.8903 indicates that the gru, lstm, lstm, and gru stack model needs to be refined to improve the quality of stock price predictions. careful evaluation and model adjustment are essential to overcome these limitations and achieve more accurate predictions. table 1. the mse of the whole training procedure in numerical form for epochs 200–500 var mse epoch-200 epoch-500 train test pred-28 train test pred-28 var-01 g g g g 1857.0976 1541.8996 111.25726 1856.6352 1529.7655 91.91923 var-02 g g g l 1858.9776 1533.7580 113.43025 1846.4071 1525.7432 103.7180 var-03 g g l l 1924.8036 1591.7732 130.17904 1867.3014 1534.2615 78.1313 var-04 g l g l 1839.2664 1519.7423 103.48911 1873.7208 1560.0412 82.2884 var-05 g l l g 1809.4829 1495.1547 90.8903 1853.8268 1529.8587 95.7873 var-06 l g g l 1811.1107 1495.3051 115.3705 1854.5001 1528.7287 77.9630 var-07 l g l g 1854.3511 1534.2149 113.8544 1856.7671 1534.1344 74.1705 var-08 l l g g 1850.9032 1527.7330 113.4182 1890.1816 1569.1906 70.1206 var-09 l l l g 1854.5735 1535.9987 116.4838 1865.5825 1539.1219 78.8466 var-10 l l l l 1795.1926 1485.7672 139.1183 1841.9540 1530.2617 82.5902 table 2 summarizes optimization prediction results for the next 28 days in the epoch 750 training process with an mse value of 63.4376. these results use variant seven with a stack of lstm, gru, lstm, and gru. the mse value is a metric that measures the average of the squared differences between model predictions and actual values. in this context, an mse value of 63.4376 means that the squared average difference between the predicted value and the actual stock value for the next 28 days is approximately 63.44 (in units that correspond to the stock data, for example, in dollars). interpretation: a lower mse value indicates that this model can predict better because the difference between the prediction and the actual value is smaller on average. therefore, 225 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 in general, the mse value of 63.44 indicates that the model has fairly good prediction quality. epoch 750 is an iteration through the entire training dataset used to train the model. by the 750th epoch, the model has undergone many iterations through the data and has made repeated adjustments to the weights and parameters used to make predictions. table 2. the mse of the whole training procedure in numerical form for epochs 750–1000 var mse epoch-750 epoch-1000 train test pred-28 train test pred-28 var-01 g g g g 1889.6429 1558.6723 215.5718 1844.8476 1521.5133 237.05712 var-02 g g g l 1831.7365 1503.1840 156.7340 1838.4467 1524.5493 246.4624 var-03 g g l l 1845.0175 1523.0437 152.2221 1845.1898 1529.3063 202.8599 var-04 g l g l 1844.9218 1528.9018 122.2034 1843.0441 1522.9162 129.7584 var-05 g l l g 1841.6457 1524.5890 63.7433 1843.1699 1531.5649 107.1425 var-06 l g g l 1863.3424 1538.3165 120.0539 1855.0585 1531.2086 146.0880 var-07 l g l g 1881.2632 1560.3401 63.4376 1845.0715 1529.2144 161.1198 var-08 l l g g 1847.7620 1529.9814 87.0923 1845.5463 1521.8910 77.7621 var-09 l l l g 1832.0935 1514.4305 65.8479 1868.4365 1541.8586 85.3185 var-10 l l l l 1813.4162 1499.3618 81.2523 1835.4334 1518.0367 94.5993 the combination of lstm, gru, lstm, gru stack can give the model the ability to capture complex patterns in time series data. lstm has the ability to remember information in the long term, while gru is more efficient at handling information in the short term. this combination allows the model to combine the advantages of both. the prediction results for the next 28 days show that the seven variants model with the lstm, gru, lstm, gru stack has the potential to provide fairly good stock price predictions. however, the use of these predicted results must be integrated into a careful investment strategy and pay attention to risk factors that may influence stock prices. figure 9 to figure 11 show the mse values of the training process for training data, test data, and 28-day data predictions, respectively. fig. 9. the mse values of the whole training process for training data y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 226 fig. 10. expanded sizes for test predictions and 28-day predictions fig. 11. expanded sizes for test predictions and 28-day predictions table 3 present the performance study of present models. in previous studies conducted by [31], in this paper, a new model for optimizing stock forecasting is proposed that incorporates a range of technical indicators, including investor sentiment indicators and financial data, and performs dimension reduction on the many influencing factors of the retrieved stock price using depth learning lasso and pca approaches. the paper's insight is to propose a new model for optimizing stock forecasting by incorporating technical indicators and performing dimension reduction using lstm and gru models. lstm and gru models can effectively predict stock prices; the lasso dimension reduction method performs better than pca. in previous studies by [65] to forecast the stock price, the lstm, bi-lstm, gru, and ordinary neural network (nn) modules are each designed sequentially. the performance of each separate model is then compared in this work with that of the suggested hybrid model. the nifty-50 stock market data implements the proposed stock price prediction model. the model predicts values along with the actual values of stock opening prices for (a) 100 days, (b) 300 days, (c) 500 days, and (d) 1000 days. in the results of studies by [66], the authors proposed using deep learning in making stock predictions. this paper compared the performance of six deep-learning algorithms to predict stock closing prices on the indonesian stock exchange. insights the paper proposes using a cnnlstm-gru hybrid algorithm for stock price prediction, which outperforms other methods in terms of accuracy. based on the research that has been carried out by [67], this paper proposes a trading strategy designed for the moroccan stock market based on two deep learning models: lstm and gru to predict, respectively, the close price for the shortand mid-term horizons. the 227 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 proposed strategy outperforms benchmark indices in the moroccan market; future work includes focusing on mediumand long-term predictions. the paper proposes a trading strategy for the moroccan market using lstm and gru models for shortand medium-term price prediction. table 3. performance study of present models reference methods results [31] depth learning lasso and pca approaches lstm and gru models mse 733.8773 [65] bi-lstm and gru models mse 0.0018 [66] cnn-lstm-gru hybrid algorithm rmse decreased by 14%, mae reduced by 13.4%, r2 3.9% [67] lstm and gru models mse 0.57 [68] lstm and gru mape 97.37% [69] two-layer stacked lstm (tls-lstm) correlation analysis between different currency pairs mse 0.0015129 [70] stacked-bi-lstm rmse 0.025 proposed models lstm-gru-lstm-gru stack mse 63,44 the results of studies carried out by [68] methods use lstm and gru. in this paper, the authors propose eight new architectural models for stock price forecasting by identifying joint movement patterns in the stock market, which combine the lstm and gru models with four neural network block architectures. eight new architectural models have been proposed for stock price forecasting. evaluation of the proposed models using three accuracy measures the paper proposes eight new architectural models that combine lstm and gru algorithms with neural network block architectures to predict stock prices using grouped time-series data accurately. in the research conducted by [69] in this article, a tls-lstm neural network was used to forecast the trend of the australian dollar and united states dollar (aud/usd) and conduct a correlation analysis. tls-lstm outperforms other models in forex trend prediction; aud/usd movement affects eur/aud and aud/jpy. the study proposes using a tls-lstm neural network for forex market forecasting and conducting correlation analysis between different currency pairs. research conducted by [70] the stacked bi-lstm (sbilstm) architecture, a modification of the conventional deep long-short term memory (tdlm), is offered in this study. two-time series from oilfield production are used to test the method. comparative comparisons are made regarding the proposed sbilstm model's performance with those of multi-layer rnns, deep gru, and deep lstm. iv. conclusions machine learning can deliver improved long-term predicted performance for pt bank syariah indonesia tbk (bris) shares, which is critical for investors when making stock market decisions. this data may also assist analysts in developing long-term financial strategy indicators. in this paper, we propose a distinct training approach for 1-day to 28-day forecasts utilizing 10 versions of deep learning models from 4 lstm-gru stacks and tailored input-target data segmentation algorithms. the lstm-lstm-lstm-lstm (llll) stack is used to obtain the best model for the prediction phase of training and test data utilizing bris stock history data from 01-07-2020 to 0107-2023 (728 days). furthermore, the lstm-gru-lstm-gru (lglg) stack model gives the most accurate long-term forecast for the next 28 days. the graph results from the altered input-target data segmentation approach exhibit variations and a perfect correlation with the observed data. long-term forecasts do not exhibit significant volatility when utilizing the deep learning approach (without input-target data segmentation) solely but tend towards a constant (convergent) value. long-term predictive research with even better accuracy is still possible, either by applying different methodologies or extending the techniques and procedures we have developed. the lstm-gru-lstm-gru stack model is a complex model that can be very good at handling complex time-series data. however, managing and maintaining such models requires considerable computing resources and a deep understanding of time series modeling. overall, the lstm-gru-lstm-gru stack model can be a handy tool for forecasting long-term stock prices. y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 228 however, it should be used as one aspect of broader analysis and decision-making in investing in the stock market. declarations author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. conflict of interest the authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. additional information reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. publisher’s note: department of electrical engineering and informatics universitas negeri malang remains neutral with regard to jurisdictional claims and institutional affiliations. references [1] world bank, leveraging islamic fintech to improve financial inclusion. world bank, 2020. [2] m. a. khattak and n. a. khan, “islamic finance, growth, and volatility: a fresh evidence from 82 countries,” j. islam. monet. econ. financ., vol. 9, no. 1, pp. 39–56, 2023. [3] e. santi, b. budiharto, and h. saptono, "pengawasan otoritas jasa keuangan terhadap financial technology (peraturan otoritas jasa keuangan nomor 77/pojk.01/2016)," diponegoro law journal, vol. 6, no. 3, pp. 1 -20, jul. 2017. [4] s. syarifuddin, r. muin, and a. akramunnas, “the potential of sharia fintech in increasing micro small and medium enterprises (msmes) in the digital era in indonesia,” j. huk. ekon. syariah, vol. 4, no. 1, p. 23, 2021. [5] r. a. kasri and m. w. sosianti, “determinants of the intention to pay zakat online: the case of indonesia,” j. islam. monet. econ. financ., vol. 9, no. 2, pp. 275–294, 2023. [6] h. hiyanti, l. nugroho, c. sukamadilaga, and t. fitrijanti, “sharia fintech (financial technology) opportunities and challenges in indonesia,” j. ilm. ekon. islam, vol. 5, no. 03, pp. 326–333, 2019. [7] m. a. kurniawan, m. anwar, and s. r. nidar, “developing a strategy for islamic money market model to enhance quality of islamic banking performance during the pandemic in indonesia 2021,” qual. access to success, vol. 23, no. 190, pp. 261–268, 2022. [8] n. nurdin and k. yusuf, “knowledge management lifecycle in islamic bank: the case of syariah banks in indonesia,” int. j. knowl. manag. stud., vol. 11, no. 1, pp. 59–80, jan. 2020. [9] s. m. anwar, j. junaidi, s. salju, r. wicaksono, and m. mispiyanti, “islamic bank contribution to indonesian economic growth,” int. j. islam. middle east. financ. manag., vol. 13, no. 3, pp. 519–532, jan. 2020. [10] m. h. ali, m. a. uddin, m. a. r. khan, and b. goud, “faith-based versus value-based finance: is there any portfolio diversification benefit between responsible and islamic finance?,” int. j. financ. econ., vol. 26, no. 4, pp. 5570–5583, oct. 2021. [11] s. alhammadi, “expanding financial inclusion in indonesia through takaful: opportunities, challenges and sustainability,” j. financ. report. account., vol. ahead-of-print, no. ahead-of-print, jan. 2023. [12] a. d. songer, j. diekmann, w. hendrickson, and d. flushing, “situational reengineering: case study analysis,” j. constr. eng. manag., vol. 126, no. 3, pp. 185–190, may 2000. [13] m. mursyid, h. kusuma, a. tohirin, and j. sriyana, “performance analysis of islamic banks in indonesia: the maqashid shariah approach,” j. asian financ. econ. bus., vol. 8, no. 3, pp. 307–318, 2021. [14] a. ding, x., haron, r., & hasan, “the influence of basel iii on islamic bank risk,” j. islam. monet. econ. financ., vol. 9, no. 1, pp. 167–198, 2023. [15] e. b. boukherouaa et al., powering the digital economy: opportunities and risks of artificial intelligence in finance. international monetary fund, 2021. [16] m. asutay, p. f. aziz, b. s. indrastomo, and y. karbhari, “religiosity and charitable giving on investors’ trading behaviour in the indonesian islamic stock market: islamic vs market logic,” j. bus. ethics, 2023. [17] d. defrizal, k. romli, a. purnomo, and h. a. subing, “a sectoral stock investment strategy model in indonesia stock exchange,” j. asian financ. econ. bus., vol. 8, no. 1, pp. 015–022, 2021. [18] a. thakkar and k. chaudhari, “a comprehensive survey on portfolio optimization, stock price and trend prediction using particle swarm optimization,” arch. comput. methods eng., vol. 28, no. 4, pp. 2133–2164, 2021. http://journal2.um.ac.id/index.php/keds https://elibrary.worldbank.org/doi/abs/10.1596/34520 https://doi.org/10.21098/jimf.v9i1.1625 https://doi.org/10.21098/jimf.v9i1.1625 https://ejournal3.undip.ac.id/index.php/dlr/article/view/19683 https://ejournal3.undip.ac.id/index.php/dlr/article/view/19683 https://ejournal3.undip.ac.id/index.php/dlr/article/view/19683 https://doi.org/10.30595/jhes.v4i1.9768 https://doi.org/10.30595/jhes.v4i1.9768 https://doi.org/10.21098/jimf.v9i2.1664 https://doi.org/10.21098/jimf.v9i2.1664 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=sharia+fintech+%28financial+technology%29+opportunities+and+challenges+in+indonesia&btng= https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=sharia+fintech+%28financial+technology%29+opportunities+and+challenges+in+indonesia&btng= https://doi.org/10.47750/qas/23.190.28 https://doi.org/10.47750/qas/23.190.28 https://doi.org/10.47750/qas/23.190.28 https://doi.org/10.1504/ijkms.2020.105073 https://doi.org/10.1504/ijkms.2020.105073 https://doi.org/10.1108/imefm-02-2018-0071 https://doi.org/10.1108/imefm-02-2018-0071 https://doi.org/10.1002/ijfe.2081 https://doi.org/10.1002/ijfe.2081 https://doi.org/10.1002/ijfe.2081 https://doi.org/10.1108/jfra-05-2023-0256 https://doi.org/10.1108/jfra-05-2023-0256 https://doi.org/10.1061/(asce)0733-9364(2000)126:3(185) https://doi.org/10.1061/(asce)0733-9364(2000)126:3(185) https://doi.org/10.13106/jafeb.2021.vol8.no3.0307 https://doi.org/10.13106/jafeb.2021.vol8.no3.0307 http://www.jimf-bi.org/index.php/jimf/article/view/1590 http://www.jimf-bi.org/index.php/jimf/article/view/1590 https://books.google.com/books?hl=en&lr=&id=nvlxeaaaqbaj&oi=fnd&pg=pa1&dq=powering+the+digital+economy:+opportunities+and+risks+of+artificial+intelligence+in+finance.+international+monetary+fund&ots=1zu6oddswc&sig=nxqvhvlgkqzf8q6urz5w1yvta_w https://books.google.com/books?hl=en&lr=&id=nvlxeaaaqbaj&oi=fnd&pg=pa1&dq=powering+the+digital+economy:+opportunities+and+risks+of+artificial+intelligence+in+finance.+international+monetary+fund&ots=1zu6oddswc&sig=nxqvhvlgkqzf8q6urz5w1yvta_w https://doi.org/10.1007/s10551-023-05324-0 https://doi.org/10.1007/s10551-023-05324-0 https://doi.org/10.13106/jafeb.2021.vol8.no1.015 https://doi.org/10.13106/jafeb.2021.vol8.no1.015 https://doi.org/10.1007/s11831-020-09448-8 https://doi.org/10.1007/s11831-020-09448-8 229 y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 [19] e. i. ardyanta and h. sari, “a prediction of stock price movements using support vector machines in indonesia,” j. asian financ., vol. 8, no. 8, pp. 399–0407, 2021. [20] w. budiharto, “data science approach to stock prices forecasting in indonesia during covid -19 using long shortterm memory (lstm),” j. big data, vol. 8, no. 1, p. 47, 2021. [21] m. kunwar, “artificial intelligence in finance understanding how automation and machine learning is transforming the financial industry,” no. august, 2019. [22] a. saranya and r. anandan, “stock market prediction using machine learning algorithms,” int. j. recent technol. eng., vol. 8, no. 2 special issue 4, pp. 280–283, 2019. [23] s. ahmed, m. m. alshater, a. el ammari, and h. hammami, “artificial intelligence and machine learning in finance: a bibliometric review,” res. int. bus. financ., vol. 61, p. 101646, 2022. [24] c. milana and a. ashta, “artificial intelligence techniques in finance and financial markets: a survey of the literature,” strateg. chang., vol. 30, no. 3, pp. 189–209, may 2021. [25] w. hastomo, a. s. b. karno, n. kalbuana, e. nisfiani, and l. etp, “optimasi deep learning untuk prediksi saham di masa pandemi covid-19,” j. edukasi dan penelit. inform., vol. 7, no. 2, p. 133, aug. 2021. [26] n. navarin, b. vincenzi, m. polato, and a. sperduti, “lstm networks for data-aware remaining time prediction of business process instances,” in 2017 ieee symposium series on computational intelligence (ssci), 2017, pp. 1–7. [27] m. o. rahman, m. s. hossain, t.-s. junaid, m. s. a. forhad, and m. k. hossen, “predicting prices of stock market using gated recurrent units (grus) neural networks,” int. j. comput. sci. netw. secur, vol. 19, no. 1, pp. 213–222, 2019. [28] k. a. althelaya, e.-s. m. el-alfy, and s. mohammed, “stock market forecast using multivariate analysis with bidirectional and stacked (lstm, gru),” in 2018 21st saudi computer society national computer conference (ncc), 2018, pp. 1–7. [29] m. a. i. sunny, m. m. s. maswood, and a. g. alharbi, “deep learning-based stock price prediction using lstm and bi-directional lstm model,” in 2020 2nd novel intelligent and leading emerging sciences conference (niles), 2020, pp. 87–92. [30] y. liu, z. wang, and b. zheng, “application of regularized gru-lstm model in stock price prediction,” in 2019 ieee 5th international conference on computer and communications (iccc), 2019, pp. 1886–1890. [31] y. gao, r. wang, and e. zhou, “stock prediction based on optimized lstm and gru models,” sci. program., vol. 2021, p. 4055281, 2021. [32] m. e. karim, m. foysal, and s. das, “stock price prediction using bi-lstm and gru-based hybrid deep learning approach,” in proceedings of third doctoral symposium on computational intelligence: dosci 2022, 2022, pp. 701–711. [33] a. sethia and p. raut, “application of lstm, gru and ica for stock price prediction,” in information and communication technology for intelligent systems: proceedings of ictis 2018, volume 2, 2019, pp. 479–487. [34] j. zhao, d. zeng, s. liang, h. kang, and q. liu, “prediction model for stock price trend based on recurrent neural network,” j. ambient intell. humaniz. comput., vol. 12, no. 1, pp. 745–753, 2021. [35] k. wang, x. qi, and h. liu, “photovoltaic power forecasting based lstm-convolutional network,” energy, vol. 189, p. 116225, dec. 2019. [36] z. karevan and j. a. k. suykens, “transductive lstm for time-series prediction: an application to weather forecasting,” neural networks, vol. 125, pp. 1–9, may 2020. [37] g. ding and l. qin, “study on the prediction of stock price based on the associated network model of lstm,” int. j. mach. learn. cybern., vol. 11, no. 6, pp. 1307–1317, jun. 2020. [38] s. chen and l. ge, “exploring the attention mechanism in lstm-based hong kong stock price movement prediction,” quant. financ., vol. 19, no. 9, pp. 1507–1515, sep. 2019. [39] y. baek and h. y. kim, “modaugnet: a new forecasting framework for stock market index value with an overfitting prevention lstm module and a prediction lstm module,” expert syst. appl., vol. 113, pp. 457 –480, 2018. [40] x. liang, z. ge, l. sun, m. he, and h. chen, “lstm with wavelet transform based data preprocessing for stock price prediction,” math. probl. eng., vol. 2019, p. 1340174, 2019. [41] p. xu et al., “automatic evaluation of facial nerve paralysis by dual-path lstm with deep differentiated network,” neurocomputing, vol. 388, pp. 70–77, 2020. [42] a. u. muhammad, a. s. yahaya, s. m. kamal, j. m. adam, w. i. muhammad, and a. elsafi, “a hybrid deep stacked lstm and gru for water price prediction,” in 2020 2nd international conference on computer and information sciences (iccis), 2020, pp. 1–6. [43] m. ali, d. m. khan, h. m. alshanbari, and a. a.-a. h. el-bagoury, “prediction of complex stock market data using an improved hybrid emd-lstm model,” appl. sci., vol. 13, no. 3, 2023. [44] a. dutta, g. pooja, n. jain, r. r. panda, and n. k. nagwani, “a hybrid deep learning approach for stock price prediction,” in machine learning for predictive analysis, 2021, pp. 1–10. [45] s. zaheer et al., “a multi parameter forecasting for stock time series data using lstm and deep learning model,” mathematics, vol. 11, no. 3, 2023. [46] j. chung, c. gulcehre, k. cho, and y. bengio, “empirical evaluation of gated recurrent neural networks on sequence modeling,” arxiv prepr. arxiv1412.3555, 2014. [47] p. malhotra, l. vig, g. shroff, and p. agarwal, “long short term memory networks for anomaly detection in time series,” 23rd eur. symp. artif. neural networks, comput. intell. mach. learn. esann 2015 proc., no. april, pp. 89–94, 2015. [48] j. l. elman, “finding structure in time,” cogn. sci., vol. 14, no. 2, pp. 179–211, 1990. [49] l. medsker and l. c. jain, recurrent neural networks: design and applications. crc press, 1999. https://doi.org/10.13106/jafeb.2021.vol8.no8.0399 https://doi.org/10.13106/jafeb.2021.vol8.no8.0399 https://doi.org/10.1186/s40537-021-00430-0 https://doi.org/10.1186/s40537-021-00430-0 https://www.theseus.fi/handle/10024/227560 https://www.theseus.fi/handle/10024/227560 https://doi.org/10.35940/ijrte.b1052.0782s419 https://doi.org/10.35940/ijrte.b1052.0782s419 https://doi.org/10.1016/j.ribaf.2022.101646 https://doi.org/10.1016/j.ribaf.2022.101646 https://doi.org/10.1002/jsc.2403 https://doi.org/10.1002/jsc.2403 https://dx.doi.org/10.26418/jp.v7i2.47411 https://dx.doi.org/10.26418/jp.v7i2.47411 https://doi.org/10.1109/ssci.2017.8285184 https://doi.org/10.1109/ssci.2017.8285184 https://www.researchgate.net/profile/md-sabir-hossain/publication/331385031_predicting_prices_of_stock_market_using_gated_recurrent_units_grus_neural_networks/links/5c93b36492851cf0ae8e96fb/predicting-prices-of-stock-market-using-gated-recurrent-units-grus-neural-networks.pdf https://www.researchgate.net/profile/md-sabir-hossain/publication/331385031_predicting_prices_of_stock_market_using_gated_recurrent_units_grus_neural_networks/links/5c93b36492851cf0ae8e96fb/predicting-prices-of-stock-market-using-gated-recurrent-units-grus-neural-networks.pdf https://www.researchgate.net/profile/md-sabir-hossain/publication/331385031_predicting_prices_of_stock_market_using_gated_recurrent_units_grus_neural_networks/links/5c93b36492851cf0ae8e96fb/predicting-prices-of-stock-market-using-gated-recurrent-units-grus-neural-networks.pdf https://doi.org/10.1109/ncg.2018.8593076 https://doi.org/10.1109/ncg.2018.8593076 https://doi.org/10.1109/ncg.2018.8593076 https://doi.org/10.1109/niles50944.2020.9257950 https://doi.org/10.1109/niles50944.2020.9257950 https://doi.org/10.1109/niles50944.2020.9257950 https://doi.org/10.1109/iccc47050.2019.9064035 https://doi.org/10.1109/iccc47050.2019.9064035 https://doi.org/10.1155/2021/4055281 https://doi.org/10.1155/2021/4055281 https://link.springer.com/chapter/10.1007/978-981-19-3148-2_60 https://link.springer.com/chapter/10.1007/978-981-19-3148-2_60 https://link.springer.com/chapter/10.1007/978-981-19-3148-2_60 https://link.springer.com/chapter/10.1007/978-981-13-1747-7_46 https://link.springer.com/chapter/10.1007/978-981-13-1747-7_46 https://doi.org/10.1007/s12652-020-02057-0 https://doi.org/10.1007/s12652-020-02057-0 https://doi.org/10.1016/j.energy.2019.116225 https://doi.org/10.1016/j.energy.2019.116225 https://doi.org/10.1016/j.neunet.2019.12.030 https://doi.org/10.1016/j.neunet.2019.12.030 https://doi.org/10.1007/s13042-019-01041-1 https://doi.org/10.1007/s13042-019-01041-1 https://doi.org/10.1080/14697688.2019.1622287 https://doi.org/10.1080/14697688.2019.1622287 https://doi.org/10.1016/j.eswa.2018.07.019 https://doi.org/10.1016/j.eswa.2018.07.019 https://doi.org/10.1016/j.eswa.2018.07.019 https://doi.org/10.1155/2019/1340174 https://doi.org/10.1155/2019/1340174 https://doi.org/10.1016/j.neucom.2020.01.014 https://doi.org/10.1016/j.neucom.2020.01.014 https://doi.org/10.1109/iccis49240.2020.9257651 https://doi.org/10.1109/iccis49240.2020.9257651 https://doi.org/10.1109/iccis49240.2020.9257651 https://doi.org/10.3390/app13031429 https://doi.org/10.3390/app13031429 https://doi.org/10.1007/978-981-15-7106-0_1 https://doi.org/10.1007/978-981-15-7106-0_1 https://doi.org/10.3390/math11030590 https://doi.org/10.3390/math11030590 https://arxiv.org/abs/1412.3555 https://arxiv.org/abs/1412.3555 https://www.researchgate.net/publication/304782562_long_short_term_memory_networks_for_anomaly_detection_in_time_series https://www.researchgate.net/publication/304782562_long_short_term_memory_networks_for_anomaly_detection_in_time_series https://www.researchgate.net/publication/304782562_long_short_term_memory_networks_for_anomaly_detection_in_time_series https://doi.org/10.1016/0364-0213(90)90002-e https://books.google.com/books?hl=en&lr=&id=me1sakn0pymc&oi=fnd&pg=pa1&dq=recurrent+neural+networks+desihn+and+aplications&ots=7cbzco2ovm&sig=tm414y-mifei4unmxs7wwjimbny y. suyatna et al. / knowledge engineering and data science 2023, 6 (2): 215–230 230 [50] p. j. werbos, “backpropagation through time: what it does and how to do it,” proc. ieee, vol. 78, no. 10, pp. 1550– 1560, 1990. [51] j. l. elman and d. zipser, “learning the hidden structure of speech,” j. acoust. soc. am., vol. 83, no. 4, pp. 1615– 1626, apr. 1988.. [52] j. t. connor, r. d. martin, and l. e. atlas, “recurrent neural networks and robust time series prediction,” ieee trans. neural networks, vol. 5, no. 2, pp. 240–254, 1994. [53] y. bengio, p. simard, and p. frasconi, “learning long-term dependencies with gradient descent is difficult,” ieee trans. neural networks, vol. 5, no. 2, pp. 157–166, 1994. [54] j. brownlee, “how to develop lstm models for time series forecasting (2018).” 2019. [55] s. hochreiter and j. schmidhuber, “long short-term memory,” neural comput., vol. 9, no. 8, pp. 1735–1780, nov. 1997. [56] k. cho et al., “learning phrase representations using rnn encoder-decoder for statistical machine translation,” arxiv prepr., 2014. [57] s. m. al-selwi, m. f. hassan, s. j. abdulkadir, and a. muneer, “lstm inefficiency in long-term dependencies regression problems,” j. adv. res. appl. sci. eng. technol., vol. 30, no. 3, pp. 16–31, 2023. [58] c. hu, s. martin, and r. dingreville, “accelerating phase-field predictions via recurrent neural networks learning the microstructure evolution in latent space,” comput. methods appl. mech. eng., vol. 397, p. 115128, jul. 2022. [59] m. r. raza, w. hussain, and j. m. merigó, “cloud sentiment accuracy comparison using rnn, lstm and gru,” in 2021 innovations in intelligent systems and applications conference (asyu), 2021, pp. 1–5. [60] t. limouni, r. yaagoubi, k. bouziane, k. guissi, and e. h. baali, “accurate one step and multistep forecasting of very short-term pv power using lstm-tcn model,” renew. energy, vol. 205, pp. 1010–1024, 2023. [61] n. klyuchnikov et al., “nas-bench-nlp: neural architecture search benchmark for natural language processing,” ieee access, vol. 10, pp. 45736–45747, 2022. [62] s. wang and h. chen, “a novel deep learning method for the classification of power quality disturbances using deep convolutional neural network,” appl. energy, vol. 235, pp. 1126–1140, 2019. [63] w. hastomo, n. aini, a. s. b. karno, and l. m. r. rere, “metode pembelajaran mesin untuk memprediksi emisi manure management,” j. nas. tek. elektro dan teknol. inf., vol. 11, no. 2, pp. 131–139, 2022. [64] w. hastomo, a. s. bayangkari karno, n. kalbuana, a. meiriki, and sutarno, “characteristic parameters of epoch deep learning to predict covid-19 data in indonesia,” j. phys. conf. ser., vol. 1933, no. 1, 2021. [65] m. e. karim, m. foysal, and s. das, “stock price prediction using bi-lstm and gru-based hybrid deep learning approach,” 2023, pp. 701–711. [66] b. sulistio, h. l. h. s. warnars, f. l. gaol, and b. soewito, “energy sector stock price prediction using the cnn, gru & lstm hybrid algorithm,” in 2023 international conference on computer science, information technology and engineering (iccosite), 2023, pp. 178–182. [67] y. touzani and k. douzi, “an lstm and gru based trading strategy adapted to the moroccan market,” j. big data, vol. 8, no. 1, p. 126, 2021. [68] a. lawi, h. mesra, and s. amir, “implementation of long short-term memory and gated recurrent units on grouped time-series data to predict stock prices accurately,” j. big data, vol. 9, no. 1, p. 89, 2022. [69] m. ayitey junior, p. appiahene, and o. appiah, “forex market forecasting with two-layer stacked long short-term memory neural network (lstm) and correlation analysis,” j. electr. syst. inf. technol., vol. 9, no. 1, p. 14, 2022. [70] b. sirisha, k. k. c. goud, and b. t. v. s. rohit, “a deep stacked bidirectional lstm (sbilstm) model for petroleum production forecasting,” procedia comput. sci., vol. 218, pp. 2767–2775, 2023. https://doi.org/https:/doi.org/10.1121/1.395916 https://doi.org/https:/doi.org/10.1121/1.395916 https://doi.org/10.1121/1.395916 https://doi.org/10.1121/1.395916 https://doi.org/10.1109/72.279188 https://doi.org/10.1109/72.279188 https://doi.org/10.1109/72.279181 https://doi.org/10.1109/72.279181 https://scholar.google.com/scholar?hl=en&as_sdt=0%2c5&q=how+to+develop+lstm+models+for+time+series+forecasting+%282018%29&btng= https://doi.org/10.1162/neco.1997.9.8.1735 https://doi.org/10.1162/neco.1997.9.8.1735 https://doi.org/10.48550/arxiv.1406.1078 https://doi.org/10.48550/arxiv.1406.1078 https://doi.org/10.37934/araset.30.3.1631 https://doi.org/10.37934/araset.30.3.1631 https://doi.org/10.1016/j.cma.2022.115128 https://doi.org/10.1016/j.cma.2022.115128 https://doi.org/10.1109/asyu52992.2021.9599044 https://doi.org/10.1109/asyu52992.2021.9599044 https://doi.org/10.1016/j.renene.2023.01.118 https://doi.org/10.1016/j.renene.2023.01.118 https://doi.org/10.1109/access.2022.3169897 https://doi.org/10.1109/access.2022.3169897 https://doi.org/10.1016/j.apenergy.2018.09.160 https://doi.org/10.1016/j.apenergy.2018.09.160 https://journal.ugm.ac.id/v3/jnteti/article/view/2586 https://journal.ugm.ac.id/v3/jnteti/article/view/2586 https://doi.org/10.1088/1742-6596/1933/1/012050 https://doi.org/10.1088/1742-6596/1933/1/012050 https://doi.org/10.1007/978-981-19-3148-2_60 https://doi.org/10.1007/978-981-19-3148-2_60 https://doi.org/10.1109/iccosite57641.2023.10127847 https://doi.org/10.1109/iccosite57641.2023.10127847 https://doi.org/10.1109/iccosite57641.2023.10127847 https://doi.org/10.1186/s40537-021-00512-z https://doi.org/10.1186/s40537-021-00512-z https://doi.org/10.1186/s40537-022-00597-0 https://doi.org/10.1186/s40537-022-00597-0 https://doi.org/10.1186/s43067-022-00054-1 https://doi.org/10.1186/s43067-022-00054-1 https://doi.org/10.1016/j.procs.2023.01.248 https://doi.org/10.1016/j.procs.2023.01.248 microsoft word 2-7803-pujianto-le3 knowledge engineering and data science (keds) pissn 2597-4602 vol 2, no 2, december 2019, pp. 58–71 eissn 2597-4637 https://doi.org/10.17977/um018v2i22019p58-71 ©2019 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) comparison of naïve bayes algorithm and decision tree c4.5 for hospital readmission diabetes patients using hba1c measurement utomo pujianto a, 1, *, asa luki setiawan a, 2, harits ar rosyid a, 3, ali m. mohammad salah b, 4 a department of electrical engineering, universitas negeri malang jl. semarang no.5, malang 65145, indonesia b dept. of computer information systems, al quds open university beit jalla-the main road-khallat al badd, bethlehem, palestine 1 utomo.pujianto.ft@um.ac.id *; 2 asalukiasa@gmail.com; 3 harits.ar.ft@um.ac.id; 4 asalah@qou.edu * corresponding author i. introduction special care for diabetic patients is important for their survival. the hba1c examination is useful for controlling patients for diabetes. therefore, diabetes is a metabolic disorder because the body cannot use the insulin that is produced effectively [1]. the hormone that regulates the balance of blood sugar levels is a function of insulin, so that if there is an increase in the concentration of glucose in the blood it causes an abnormality called hyperglycemia [2]. international diabetes federation (idf) states that the prevalence of diabetes mellitus in the world is 1.9 % and has made diabetes mellitus the seventh leading cause of death in the world while in 2012 the incidence of diabetes mellitus in the world is 371 million [3]. the high prevalence of diabetes mellitus is caused by risk factors that cannot change such as heredity and changeable risk factors such as smoking habits, education level, occupation, physical activity, alcohol consumption, body mass index, waist circumference and age [4]. hba1c (glycated hemoglobin) is hemoglobin that binds to glucose. ordinarily, glucose binds to each other with hemoglobin in the red blood cells. consequently, the amount of hba1c in the human body is balanced with blood sugar levels. the higher the blood sugar level impact the higher the hba1c level. nevertheless, hba1c can measure the average blood sugar level for three months [5]. hospital readmission is a medical term to take action to re-treat patients who have previously received inpatient services in hospitals [6]. the readmission process relates to calculating the quality of patient article info a b s t r a c t article history: received 14 may 2019 revised 25 july 2019 accepted 19 august 2019 published online 23 december 2019 diabetes is a metabolic disorder disease in which the pancreas does not produce enough insulin or the body cannot use insulin produced effectively. the hba1c examination, which measures the average glucose level of patients during the last 2-3 months, has become an important step to determine the condition of diabetic patients. knowledge of the patient's condition can help medical staff to predict the possibility of patient readmissions, namely the occurrence of a patient requiring hospitalization services back at the hospital. the ability to predict patient readmissions will ultimately help the hospital to calculate and manage the quality of patient care. this study compares the performance of the naïve bayes method and c4.5 decision tree in predicting readmissions of diabetic patients, especially patients who have undergone hba1c examination. as part of this study we also compare the performance of the classification model from a number of scenarios involving a combination of preprocessing methods, namely synthetic minority over-sampling technique (smote) and wrapper feature selection method, with both classification techniques. the scenario of c4.5 method combined with smote and feature selection method produces the best performance in classifying readmissions of diabetic patients with an accuracy value of 82.74 %, precision value of 87.1 %, and recall value of 82.7 %. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: diabetes naïve bayes decision tree c4.5 comparison classification u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 59 handling by the hospital [7]. several attributes of a diabetic patient dataset are influential on the quality of treatment which refers to the resistance of glycemic serum in the body. consequently, the better quality of treatment for the hospital identified by the longer the glycemic serum is at a healthy level. but the differences in attributes associated with diabetic patients result in the calculation of quality, tend to be complicated [8]. the readmission process is very important to anticipate diabetic patients who are late in re-treating their disease. pattern recognizing data in the field of informatics is often known as classification [9]. in a study of the classification of hospital readmission diabetes patients, some methods that have been used are logistic regression [10]. the advantage of logistic regression is the output of logistic regression is more informative than other classification algorithms. like any regression approach, it expresses the relationship between an outcome variable (label) and each of its predictors (features) [11]. the disadvantages of logistic regression include vulnerability to underfitting in the imbalance data set and, consequently the value of accuracy is uncertain [12]. other studies of the classification of hospital readmission diabetes patients, are compared to decision tree algorithms, k-nearest neighbor (k-nn), and naive bayes with various parameters [8], resulted in the naïve bayes classification model having better statistics than other algorithm models such as decision tree and knn with an accuracy value of 57.52 %, mae of 0.512, and the kappa statistic of 0.182. there is another study by implementing the c4.5 algorithm to classify the readmissions of diabetic patients, tested the c4.5 algorithm with several different experiments. the results of this study, the c4.5 algorithm can classify readmissions of diabetic patients with an accuracy rate of 74.5 % with preprocessing data treatment using two label classes. nevertheless, the highest accuracy in the classification of the three label classes has an accuracy rate of only 57 % using the c4.5 algorithm as a classification method [13]. based on the consideration of the algorithm discussed earlier, this study uses the naïve bayes algorithm and gives a comparison of the decision tree c4.5 algorithm which has the advantage of being able to process a numerical data (continuous), category (discrete), handle missing attribute values and generate rules which is easily interpreted [14]. both algorithms are used to determine the performance of the preprocessing stage, which is done as an improvement in the accuracy of the classification, such as comparing the performance of the two methods by testing the dataset before and after changing the imbalance class dataset using smote (synthetic minority over-sampling technique). accordingly, smote is one of the supervised learning preprocessing methods to overcome imbalance classes [15], and in this case, smote is used for oversampling minority classes so that the data in the class is balanced. the next comparison is by using the feature selection to simplify the number of attributes. the wrapper is used because this method can perform a feature selection optimally which can be adjusted with the desired algorithm [16]. in this study, naïve bayes and decision tree c4.5 methods were tested to classify hospital readmissions of diabetic patients using input test results from laboratory tests and other variables in diabetic patients. the results of this study are the best performance results in the classification of hospital readmissions from several trial scenarios that have been carried out. consequently, they can be developed into further research in making recommendations for diabetic patients needing retreatment in less than 30 days of previous treatment, more than 30 days of previous treatment and do not require treatment. the purpose of this study is to find out the best algorithm in classifying hospital readmissions of diabetic patients, and the best combination of preprocessing methods. ii. materials and methods machine learning is a field of science about how a machine can manage data as desired [17]. machine learning is a part of artificial intelligence that focuses on developing a system that is able to learn its own patterns based on a training test and determined without human intervention. the application of machine learning is found in several fields, such as the field of education [18], the field of games [19], and in this research applying machine learning in the medical field. machine learning has three types of learning methods, namely supervised learning, unsupervised learning, and reinforcement learning. supervised learning is a structured learning method that the purpose is to group test data into the label class based on the model that has been found through learning in the training data. while unsupervised learning is an unstructured learning method so there is no class of labels, but only data that will be grouped into groups or new label classes. meanwhile, reinforcement 60 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 learning is a learning method without any knowledge so that in learning something, the system will do a certain action and see the results of the action [20]. basically, the way machine learning works is learning like humans by using examples and after that, it can answer a related question. this learning process uses data called the dataset train. unlike static programs, machine learning was created to form programs that can learn on their own. problems that can be solved by machine learning include regression, clustering, and classification. classification is a method of grouping data that has been determined by its class. in this research classification process use algorithms such as naïve bayes and decision tree c4.5 to classify a problem that is combined with the smote and feature selection. a. dataset the data used in the study are data obtained from the uci machine learning repository about diabetic patients. data on these diabetic patients represent 10 years (1999 to 2008) patient data at diabetes care clinics in 130 us hospitals that are interconnected with other networks. this dataset consists of 50 attributes and 101,776 instances. table 1 is a table of the dataset metadata used in this study. b. preprocessing data comparing the naïve bayes and c4.5 algorithms require preprocessing data before the process is done [21]. preprocessing data applies process types that process raw data to prepare the next data processing [22]. the purpose of this preprocessing is to transform data into a format that is easier and more effective for user needs, with more accurate indicators of results, reduction of computational time for large scale problems, making data values smaller without changing the contained information. the first preprocessing stage is trimming the data used by using only patient data that have an hba1c examination. consequently, the attribute data a1c test result deletes the value of the "none" variable which amounts to 84,748 instances with the intention of data on patients who do not take the hba1c examination. the results of the data after trimming only amounts to 17,018 instances. this is advantageous for this research with a smaller amount of data can improve processing time. several preprocessing stages are compared to eight different preprocessing scenarios (see table 2). this scenario compares the effect of smote and feature selection in processing data before entering the classification phase. data in all scenarios only applies the data cleaning method as the initial preprocessing stage. the first scenario without using the smote preprocessing method and the feature selection only uses initial data with label classes totaling three classes "no", ">30", and "<30". the second scenario in this study applies the smote method for minority class data so that the distribution of label classes is balanced, moreover the number of label classes consists of the same three classes with scenario one. the third scenario in this study applies the feature selection method using a wrapper for feature selection. the features that are omitted are features that have an unbalanced data distribution or one of the empty data distribution values (zero). in the fourth scenario, apply both the preprocessing method of balancing three label classes using smote then using the feature selection to simplify the number of attributes. after that in the fifth to eight scenarios apply the same method in a row with the first to the fourth method, but only use two label classes ">30" and "<30" for the next classification data. several scenarios test are useful to find out the combination of preprocessing techniques that produce high accuracy values in the next process. the scenarios arranged are several combinations of smote preprocessing techniques and feature selection. this research is tested by the 10-fold cross validation method by comparing naïve bayes and c4.5 algorithms. 1) data cleaning the process of detecting and repairing datasets that have missing value, noise, and other imperfections can be detected by the data cleaning process. data cleaning is useful for identifying data that is incomplete, incorrect and noise. consequently, the data will be replaced, modified or deleted. this data cleaning process is quite important in conducting modeling of machine learning algorithms because at this stage data cleaning can prevent duplicate data, missing value data, ambiguous data and naming conflicts. there are several focus areas in the data cleaning like missing values, outliers, inconsistent codes, schema integration, and duplicates [23]. one of the frequently used data cleaning techniques is handling data missing. according to twisk 2002, a method that is able to handle the u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 61 case of missing data is a replace missing values [24]. the working principle of this method is to detect each instance that has empty data. and then take the average value of the data attribute that has missing and fill in the average value of the attribute to the data that has empty data. this is useful as a substitute value for the empty data so that it is expected to increase accuracy in the subsequent modeling. the concept of data cleaning applied in this study is by removing attribute values that have very high missing values such as the attribute "payer code" with a missing value of 52 % which has the potential to have no correlation with this study, and the "weight" and "medical specialty" attribute that should be removed because it has very large missing values. this attribute causes data ineffectiveness on processing with a 97 % and 53 % missing value. in addition, these three attributes, attributes that have a missing value will use the replace with value method in the missing value by giving the results in the attributes found in table 3. table 1. list of attributes in the dataset attributes name data type attributes description encounter id numerical visit number as id. patient number numerical number of patients. race nominal values: caucasian, asian, african american, hispanic, and others. age nominal value: grouped with 10-year intervals (0 to 10, 10 to 20, & 90 to 100). gender nominal values: male, female, and unknown. weight nominal weight in pounds. admission type nominal value: emergency, urgent, elective, newborn, and not available. discharge disposition nominal value: discharge to home, expired, not available. admission source nominal value: referral physician, emergency room, and transfer from hospital. time in hospital numerical the number or duration of patients treated at home from enrollment to discharge is discharged from the hospital. payer code nominal payment code. medical specialty nominal special handling such as cadiology, internal medicine, etc. number of lab procedures numerical the number of lab tests carried out at one visit. number of procedures numerical number of procedures at one visit. number of medications numerical number of medicines for patients given in one visit. number of outpatients visits numerical number of outpatient visits in the treatment process. number of emergency visits numerical number of emergency visits while in the maintenance phase. number of inpatients visits numerical number of inpatient visits that are in the care stage. diagnosis 1 nominal main diagnosis; there are 848 different values. diagnosis 2 nominal second diagnosis; there are 923 different values. diagnosis 3 nominal additional diagnosis that supports a second diagnosis; there are as many as 954 different values. number of diagnoses numerical the number of diagnoses that are input into the system. glucose serum test result nominal indicates the range of results; value,> 200,> 300, normal, and none (none). a1c test result nominal vulnerable indications of the hba1c test with a value of "> 8" if the test results are more than 8 %, "> 7" if the test results are more than 7 % but less than 8 %, "norm" if the test results are less than 7 % and "none" if do not test. change of medication nominal indicates if there are changes in treatment (either the dose of the drug or drug used), the value: no (if there is no change) or change (if change) diabetes medication nominal indicate if there is another diabetes treatment prescribed. value: yes and no 24 features for medications nominal information about changes in medication dosage during treatment with a value of "up" indicates an increased dose, "down" the drug dose is lowered, "steady" remains. 24 types of drugs such as: metformin, repaglide, nateglinide, chlorpropamide, and others. readmitted nominal a value of "> 30" for patients readmissions for more than 30 days, value "<30" for patients readmissions for less than 30 days, and "no" for those who are not readmissions. 62 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 2) smote addressing data imbalance problems need to pay attention to unbalanced data distribution from each class. smote is one of the supervised learning preprocessing methods to overcome the imbalance class [15]. and in this case, smote is used for oversampling minority classes so that the data in the class is balanced. the label class data in this dataset show the imbalance of the data shown in table 4. there is a second scenario in this study, which is found in the felix tamin 2017 study by eliminating the class label "no" and assumed to be the same as the label class "<30" because the label "no" does not have a history of readmissions [13]. the elimination of the class label "no" is also based on that diabetes cannot be cured [25], with this statement the class value label "no" becomes irrelevant, because basically when a person has diabetes, they have readmission to the hospital with a certain period of time to control the patient's blood sugar level. when a person has diabetes, the cure that can be attempted by medical personnel is to control the blood sugar of the patient so that the patient's blood sugar remains in the normal position. the comparison of the data before and after preprocessing is using two class labels as can be found in table 5. table 2. experimental scenario scenario preprocessing label 1. no smote & feature selection 3 classes 2. smote 3 classes 3. feature selection 3 classes 4. smote + feature selection 3 classes 5. tanpa smote & feature selection 2 classes 6. smote 2 classes 7. feature selection 2 classes 8. smote + feature selection 2 classes table 3. attributes with missing values attribute name data type % missing values race nominal 2 % diagnosis 3 nominal 1 % table 4. comparison of smote data distribution with 3 class dataset label class data distribution before smote after smote total percentage total percentage no 9542 56 % 9542 34 % >30 5800 34 % 9570 34 % <30 1676 10 % 9218 32 % total 17018 28330 table 5. comparison of smote data distribution with 2 class dataset label class data distribution before smote after smote total percentage total percentage >30 5800 77 % 5800 50 % <30 1676 23 % 5866 50 % total 7476 11666 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 63 3) feature selection optimizing the performance of the classification algorithm model by feature selection is an important part. feature selection can be based on a large reduction in feature space, for example by eliminating less relevant attributes. using the right feature selection algorithm can improve the performance of the algorithm. the feature selection can be divided into filters and wrappers. examples of filter types are information gain (ig), chi-square, and log likelihood ratio. examples of wrapper types are forward selection, wrapper subset evaluation, and backward elimination. the results of the precision using wrapper are higher than the filter method, but these results are achieved with a large degree of complexity. consequently, high complexity can cause problems [26]. one feature selection method that can be used to make feature selection is wrapper subset evaluation. wrapper subset evaluation used to evaluate the set of attributes using the learning scheme and to estimate the accuracy of the learning scheme for several attributes is by using cross validation [27]. this study uses the wrapper subset evaluation with the greedy stepwise method in selecting features for several data processing scenarios. in the data scenario with three label classes, the application of feature selection used for scenario 3 and 4. the attributes used before feature selection is 47 attributes. in the feature selection of the naïve bayes algorithm, the features used only 18 attributes on scenario 3 and 18 attributes on scenario 4. and in the c4.5 algorithm classification for scenario 3 and 4, the attribute used after feature selection is 7 attributes. in the scenario using two label classes, the application of feature selection used for scenario 7 and 8. the naïve bayes algorithm feature selection test uses 25 attributes and in the c4.5 algorithm uses 9 attributes. c. classification the process to find a model that is able to distinguish data classes based on rules in order to predict the class of an unknown data label called classification. classification is also a field of research in the acquisition of information that develops methods to determine or categorize data into one or more groups that have been previously known automatically based on the contents of the data. classification aims to group unstructured data into groups that describe the contents of the dataset [28]. classification is useful for finding models from training data that distinguish records into appropriate categories or classes, the model is then used to classify records whose classes have not been previously known in testing data. classification can also make decisions by predicting a case based on the classification results obtained [29]. the data classification in this study is used to test two classification algorithms, naïve bayes and decision tree c4.5 in classifying readmission diabetes patients. 1) naïve bayes the naive bayes algorithm is a simple classification method that calculates probabilities by calculating the frequency of combination values on a given dataset [30]. using the naive bayes algorithm assumes that all attributes become independent considering the value of the class variable has conditional properties. the naive bayes algorithm predicts future opportunities based on prior experience so that it is known as the bayes theorem. the main feature of naive bayes is a very naive assumption of independence from each condition or event. this algorithm is so popular in machine learning applications because naive bayes has a simple algorithm that allows each attribute to contribute to the final decision. this simplicity is similar to computational efficiency, which makes the naive bayes algorithm interesting and suitable for many domains [31]. this algorithm performs pattern recognition and several approaches to get the desired results [32]. naive bayes works very well compared to other classifier models. this is evidenced in the journal xhemali 2009 that naive bayes has a better level of accuracy than other classifier models [31]. the use of the naive bayes algorithm has several important benefits, one of which is that this method only requires a relatively small amount of training data in determining the estimated parameters needed for the classification process. because what is assumed to be an independent variable, only the variance of a variable in a class is needed to determine the classification, not the whole of the covariance matrix [33]. the stages of the naïve bayes algorithm process are quite simple, including: 1. calculate the total number of classes. 2. calculate the probability of each class. 3. apply the bayes formula (1) by multiplying all class variables. 4. compare the results of each class. 64 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 to describe the bayes theorem there are bayes formula as in (1) 𝑃(𝑐|𝑥) = (𝑥|𝑐) ( ) ( ) (1) where x is data with an unknown class, c is the data hypothesis of a specific class, p(c|x) is probability of hypothesis based on condition, p(c) is probability of hypothesis (prior probability), p(x|c) is probability based on conditions in the hypothesis, and p(x) is probability c. 2) decision tree c4.5 decision tree c4.5 algorithm is an algorithm that has the advantage of being able to process numerical data (continuous), categories (discrete), handles missing values and produce rules that are easily interpreted [14]. this c4.5 algorithm is the development of the id3 algorithm. the working principle of algorithm id3 and c4.5 algorithm is similar, but there are some differences that make the c4.5 algorithm have better results than the id3 algorithm. the c4.5 algorithm is able to handle attributes with discrete or continuous types. the selection of attributes in this algorithm uses entropy size, known as information gain, as a heuristic for selecting attributes that are the best part of the example in the class. all attributes are discrete value categories where attributes with continuous values must be discounted. attribute discretization aims to facilitate the grouping of values based on predetermined criteria, and also to simplify the problems and improve the learning process accuracy [34]. the selection of attributes in the c4.5 algorithm using gain replaces the information gain value. the selection of a good attribute is an attribute that makes it possible to get the smallest decision tree size or attributes that can separate objects according to their class. heuristically the attribute chosen is the attribute that produces the cleanest node. the cleanest size is expressed with the level of impurity, and to calculate it, can be done using the concept of entropy, entropy expresses the impurity of a collection of objects [35]. based on hansun 2017, there are four stages in carrying out the classification step using c4.5 algorithm [36], including: 1. select attributes as roots. 2. make a branch for each value. 3. divide each case in a branch. 4. repeat the process in each branch so that all cases in the branch have the same class. calculations start from counting the number of attributes and determining which attributes will be used as the root of the decision tree. subsequently, entropy and gain calculation will be carried out to form leaf from the decision tree. after calculations completed, a decision tree can be formed based on the previously calculated gain value. the attribute with the highest gain value will be located at a higher priority and has a higher position also in the decision tree. the formula for finding entropy is as follows: a) entropy equation (2) shows the formulay on entropy 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −𝑝 (2) where s is dataset, k is number of s partitions, and 𝑝 is the probability obtained from sum is divided by total cases. b) gain ratio gain ratio can be found using (3) 𝑔𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜(𝑎) = ( ) ( ) (3) where a is the attribute, gain(a) is information gain in attribute a, and split(a) is split information on attributes a. c) splitinfo splitinfo on (3) can be calculated using (4) u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 65 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝑆, 𝐴) = − 𝑙𝑜𝑔 (4) where s is the sample room used for training, a is the attribute, and 𝑆 is the number of samples for attributes i. d) gain finally, the gain can be achieved using (5) 𝐺𝑎𝑖𝑛(𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − | | | | × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 ) (5) where s is the set of cases, a is the number of partition attributes a, |s | is the number of samples for attribute i, |s| is the number of all data samples, and entropi (s ) is represent the entropy for samples that have values i. d. output & evaluation the evaluation phase of the classification results in this study uses confusion matrix. confusion matrix is an evaluation method in the form of a matrix table that shows the performance of the classification model being tested. confusion matrix gives results in the form of numbers that show the amount of data that is successfully predicted correctly and the data that is not. this model is useful to know the accuracy, precision, recall of the algorithm model being tested. the confusion matrix model in the dataset has two label classes in table 6. the results of confusion matrix are useful for calculating the accuracy, precision, and recall of algorithm performance using the following formula: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑥 100 % (6) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑥 100 % (7) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑥 100 % (8) based on the results of the evaluation of the confusion matrix, the best classification results are obtained based on the highest value of accuracy, precision, and recall. accuracy is used to calculate effectiveness and evaluate the performance of classification methods. precision is used to calculate the level of accuracy between the information requested by the user and the answer given by the system. whereas recall is the success rate of the system in rediscovering information. data classification sometimes does not only have two label classes, so it is different in determining positive classes and negative classes. there are several data that have more than two label classes. this case can use the confusion matrix multiclass classification evaluation method as shown in table 7. in the confusion matrix multiclass classification there is an evaluation metrics formula that is different from confusion matrix binary classification. the accuracy formula, precision, and recall algorithm performance with the confusion matrix multiclass classification are as follows: 𝐴𝑘𝑢𝑟𝑎𝑠𝑖 = ∑ 𝑥 100 % (9) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ∑ ∑ ( ) 𝑥 100 % (10) 𝑅𝑒𝑐𝑎𝑙𝑙 = ∑ ∑ ( ) 𝑥 100 % (11) where 𝑇𝑃 is true positive, which is the amount of positive data that is correctly classified by the system for class i, tn is true negative, which is the amount of negative data that are correctly classified by the system for class i, fn is false negative, which is the number of negative data but incorrectly classified by the system for class i, fp is false positive, that is the number of positive data but is incorrectly classified by the system for class i, and l is the number of classes 66 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 iii. results and discussions a. research results this research gets results from the final stages of evaluation. in this evaluation process, compared the performance of naïve bayes classification algorithms and decision tree c4.5 with several preprocessing combinations performed. so, the best scenario combination can be found in the preprocessing smote method and feature selection. this evaluation process determines the best algorithm between naïve bayes and decision tree c4.5 based on the value of accuracy to classify hospital readmissions of a diabetic patient. a comparison of accuracy can be seen in table 8. the results of table 8 show that the best accuracy is in scenario 8 with the preprocessing method using combination of smote and feature selection which classifies the two label classes. decision tree c4.5 algorithm is also a better algorithm for classifying hospital readmissions of diabetic patients with an accuracy of 82.73 %. the results of the confusion matrix from each stage of the scenario are in table 9 with the positives class uses for scenario with 3 class label is “>30” class. the confusion matrix of the best results is in scenario 8 c4.5 on table 9, it shows the detail data that is successfully classified correctly and the amount of data that is incorrectly classified. from the results of the confusion, the matrix can also be calculated the values of evaluation metrics using (6) to (8) for binary class classification (9) to (11) for multiclass classification. the results of the evaluation metrics of scenario 8 c4.5 on table 9 as the best results show an accuracy of 82.74 %, a precision of 87.1 % and a recall of 82.7 %. in more detail, the results of each trial are compared based on the evaluation values of the metrics. a comparison of the performance of all classification trial scenarios is shown in figure 1 to figure 3. based on the results shown in figure 1, the comparison of the performance of the naïve bayes algorithm and the decision tree c4.5 based on the accuracy of each scenario has insignificant differences, but it can be seen that the accuracy value of the c4.5 algorithm is always better than the table 6. confusion matrix true class prediction class + + true positives (tp) false negatives (fn) false positives (fp) true negatives (tn) table 7. confusion matrix multiclass classification true class prediction class a b c a true a false a false a b false b true b false b c false c false c true c table 8. comparison of experimental results scenario preprocessing label naïve bayes accuracy c4.5 accuracy 1. no smote & feature selection 3 classes 59.47 % 59.68 % 2. smote 3 classes 59.85 % 62.30 % 3. feature selection 3 classes 59.28 % 60.85 % 4. smote + feature selection 3 classes 60.22 % 61.32 % 5. tanpa smote & feature selection 2 classes 75.61 % 77.58 % 6. smote 2 classes 77.69 % 78.88 % 7. feature selection 2 classes 76.39 % 77.58 % 8. smote + feature selection 2 classes 79.39 % 82.74 % u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 67 naïve bayes algorithm in each scenario. significant differences in the value of accuracy are found in the performance of preprocessing applied to each scenario. accuracy values look significantly different in the scenario 4 with the scenario 5, this is because in the scenario 1 to the scenario 4, the label class in the four scenarios uses three classes, thus increasing the data complexity and influencing the accuracy value of the naïve bayes and c4.5. whereas in the scenario 5 to the scenario 8, all four scenarios use two label classes so that the low level of complexity makes it easier for the algorithm to classify the data. based on the results shown in figure 2, the lowest precision results were obtained by naïve bayes classification in the scenario 3 with 55.4 %, and the highest precision is obtained by classification c4.5 in the scenario 8 with 87.1 %. precision shows the results of the accuracy between the information requested and the results so that in the classification results c4.5 scenario 8, the accuracy of predictions with true classes gets the best results compared to other scenarios. based on the results shown in figure 3, the comparison chart of recall values gives the best results in the scenario 8 trial using the c4.5 algorithm method. the recall value generated by the c4.5 algorithm when classifying the scenario 8 data is 82.7 %. the recall is the result of data that can be recovered by the system. in c4.5 classification the scenario 8 can recover the desired data well compared to other scenarios. fig. 1. accuracy comparison 59 ,4 7 59 ,8 5 59 ,2 8 60 ,2 2 7 5, 61 77 ,6 9 76 ,3 9 79 ,3 9 59 ,6 8 62 ,3 60 ,8 5 61 ,3 2 7 7, 58 78 ,8 8 77 ,5 8 82 ,7 4 1 2 3 4 5 6 7 8 a cc u ra cy ( % ) scenario naïve bayes c4.5 table 9. confusion matrix all scenarios scenario tp tn fp fn scenario 1 naïve bayes 1655 8466 5454 1443 scenario 1 c4.5 2100 8057 5376 1485 scenario 2 naïve bayes 2225 14730 6745 4630 scenario 2 c4.5 4336 13315 6032 4647 scenario 3 naïve bayes 1319 8769 5838 1092 scenario 3 c4.5 2110 8245 5307 1356 scenario 4 naïve bayes 2676 14385 5639 5630 scenario 4 c4.5 2276 15095 6052 4907 scenario 5 naïve bayes 5424 229 1447 376 scenario 5 c4.5 5800 0 1676 0 scenario 6 naïve bayes 4538 4526 1340 1262 scenario 6 c4.5 5201 4001 1865 599 scenario 7 naïve bayes 5535 176 1500 265 scenario 7 c4.5 5800 0 1676 0 scenario 8 naïve bayes 4774 4488 1378 1026 scenario 8 c4.5 5791 3861 2005 9 68 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 b. discussion the comparative results of the smote and feature selection show that combining the two preprocessing methods produce better performance than applying this method independently. table 8 shows that the application of the smote method independently shows better results than the feature selection method. while the feature selection method applies to data on diabetic patients tends not to increase accuracy significantly because the label class on the dataset is still imbalance. this shows that the imbalance of data has a negative effect on the performance of the classification in the case of diabetes patient data. however, the feature selection combined with smote can produce excellent accuracy values. smote can overcome the imbalance of the data by adding new data to the minority class based on the value of the nearest neighbor so that it has properties similar to the minority class. new data were added at the smote stage amounts to a majority of classes, so the label class is balanced. after the label class is balanced, the combination of feature selection methods will eliminate the attributes that are less relevant. thus, the imbalance distribution of data and does not affect the performance of the algorithm or actually decreases accuracy. in the case of diabetes patients, the feature selection method is very useful, because the number of initial attributes is 47 attributes. with the selection, the feature can reduce complexity by eliminating some irrelevant attributes. feature selection is also useful for anticipating the curse of dimensionality which can cause the classification accuracy at a certain point to decrease if the number of attributes is too much while the number of sample data is limited. from the results of the experiments found in several tables above, it can be seen that the decision tree c4.5 algorithm has better results than the naïve bayes algorithm. the best results are found in fig. 2. precision comparison fig. 3. recall comparison 55 ,5 58 ,6 55 ,4 58 ,4 6 9, 7 77 ,7 70 79 ,5 59 ,7 62 ,1 56 ,6 59 ,1 77 ,9 80 ,3 77 ,6 8 7, 1 1 2 3 4 5 6 7 8 pr ec is io n ( % ) scenario naïve bayes c4.5 59 ,5 59 ,8 59 ,3 60 ,2 75 ,6 77 ,7 76 ,4 79 ,4 59 ,7 62 ,3 60 ,8 61 ,3 77 ,6 78 ,9 77 ,6 82 ,7 1 2 3 4 5 6 7 8 re ca ll ( % ) scenario naïve bayes c4.5 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 69 scenario 8 with preprocessing treatment combining smote and feature selection. in the trial scenario 8 using the c4.5 algorithm, the results obtained were the best results from another scenario trials with an accuracy of 82.74 %, precision of 87.1 % and recall of 82.7 %. the best results of scenario 8, shows that at the stage of applying smote and feature selection in this scenario using 9 attributes from 47 attributes. selected attributes in building the c4.5 model in scenario 8 are admission source, time in hospital, number of emergency visits, glucose serum test result, replaginide, glipzide, glyburide, rosiglitazone, and readmitted. the attributes selected using the feature selection can make the best decision tree because it contains high gain values and includes attributes that do not cause outliers. the highest gain value is the “time in hospital” attribute in the form of numerical data, then it is used as the root of the decision tree c4.5 and other attributes as branches of the specified value. the attribute “time in hospital” is considered relevant in this study because it provides enough information about whether diabetic patients need hospital readmissions with the total length of time patients to stay in the hospital. the attribute “admission source” is also an attribute that is considered relevant in classifying readmissions of diabetic patients because this data is useful for knowing the source of acceptance of these patients. some drug dosage information attribute that have good data distribution on this dataset are replaginide, glipzide, glyburide, and rosiglitazone, so it can produce decision trees that have high accuracy. iv. conclusion based on the results of the discussion of this study it can be concluded that the application of several pre-processing methods can improve the performance of the tested algorithm so as to obtain maximum evaluation values. combining several pre-processing methods are also recommended to improve accuracy and close weaknesses found in the data to be tested. the results of the application of the preprocessing method and without using preprocessing show very significant results, by using the preprocessing method the results have better accuracy. this study also shows better results than previous studies using the naïve bayes algorithm and also than studies using the decision tree c4.5 algorithm. acknowledgement this research was supported by universitas negeri malang and al quds open university. we thank our colleagues from both institutions who provided insight and expertise that greatly assisted the research, although they may not agree with all of the interpretations/conclusions of this paper. we thank dr. aji p. wibawa for assistance with suggestion in methodology and for comments that greatly improved the manuscript. declarations a. author contribution all authors contributed equally as the main contributor of this paper. all authors read and approved the final paper. b. funding statement this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. c. conflict of interest the authors declare no conflict of interest. d. additional information no additional information is available for this paper. references [1] g. e. umpierrez, s. d. isaacs, n. bazargan, x. you, l. m. thaler, and a. e. kitabchi, “hyperglycemia: an independent marker of in-hospital mortality in patients with undiagnosed diabetes,” j clin endocrinol metab, vol. 87, no. 3, pp. 978–982, mar. 2002. [2] m. dewi, “resistensi insulin terkait obesitas: mekanisme endokrin dan intrinsik sel,” jurnal gizi dan pangan, vol. 2, no. 2, pp. 49–54, jul. 2007. 70 u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 [3] h. sonmez, v. kambo, d. avtanski, l. lutsky, and l. poretsky, “the readmission rates in patients with versus those without diabetes mellitus at an urban teaching hospital journal of diabetes and its complications,” journal of diabetes and its complications, no. october, 2017. [4] r. n. fatimah, “diabetes melitus tipe 2,” jurnal majority, vol. 4, no. 5, jan. 2015. [5] j.-o. jeppsson et al., “approved ifcc reference method for the measurement of hba1c in human blood,” clinical chemistry and laboratory medicine, vol. 40, no. 1, pp. 78–89, 2005. [6] h. m. krumholz et al., “readmission after hospitalization for congestive heart failure among medicare beneficiaries,” arch intern med, vol. 157, no. 1, pp. 99–104, jan. 1997. [7] d. kansagara et al., “risk prediction models for hospital readmission: a systematic review,” jama, vol. 306, no. 15, pp. 1688–1698, oct. 2011. [8] m. yusa, e. utami, and e. t. luthfi, “analisis komparatif evaluasi performa algoritma klasifikasi pada readmisi pasien diabetes,” jurnal buana informatika, vol. 7, no. 4, oct. 2016. [9] j. ren, s. d. lee, x. chen, b. kao, r. cheng, and d. cheung, “naive bayes classification of uncertain data,” in 2009 ninth ieee international conference on data mining, 2009, pp. 944–949. [10] b. strack et al., “impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records,” biomed research international, 2014. [online]. available: https://www.hindawi.com/journals/bmri/2014/781670/. [accessed: 22-dec-2019]. [11] d. w. hosmer, “the multiple logistic regression model,” 2013. [12] j. e. kolassa, “inference in the presence of likelihood monotonicity for polytomous and logistic regression,” advances in pure mathematics, vol. 6, no. 5, pp. 331–341, mar. 2016. [13] f. tamin and n. m. s. iswari, “implementation of c4.5 algorithm to determine hospital readmission rate of diabetes patient,” in 2017 4th international conference on new media studies (conmedia), 2017, pp. 15–18. [14] b. hssina, a. merbouha, h. ezzikouri, and m. erritali, “a comparative study of decision tree id3 and c4.5,” international journal of advanced computer science and applications, vol. 4, no. 2, 2014. [15] s. maldonado, j. lópez, and c. vairetti, “an alternative smote oversampling strategy for high-dimensional datasets,” applied soft computing, vol. 76, pp. 380–389, mar. 2019. [16] r. kohavi and g. h. john, “wrappers for feature subset selection,” artificial intelligence, vol. 97, no. 1, pp. 273–324, dec. 1997. [17] j. snoek, h. larochelle, and r. p. adams, “practical bayesian optimization of machine learning algorithms,” in proceedings of the 25th international conference on neural information processing systems volume 2, usa, 2012, pp. 2951–2959. [18] m. d. jaelani, a. p. wibawa, and u. pujianto, “technology acceptance model of student ability and tendency classification system,” bulletin of social informatics theory and application, vol. 2, no. 2, pp. 47–57, dec. 2018. [19] h. a. rosyid, m. palmerlee, and k. chen, “deploying learning materials to game content for serious education game development: a case study,” entertainment computing, vol. 26, pp. 1–9, may 2018. [20] f. pedregosa et al., “scikit-learn: machine learning in python,” j. mach. learn. res., vol. 12, pp. 2825–2830, nov. 2011. [21] j. guedes and n. kikuchi, “preprocessing and postprocessing for materials based on the homogenization method with adaptive finite element methods,” computer methods in applied mechanics and engineering, vol. 83, no. 2, pp. 143– 198, oct. 1990. [22] r. schmieder and r. edwards, “quality control and preprocessing of metagenomic datasets,” bioinformatics, vol. 27, no. 6, pp. 863–864, mar. 2011. [23] a. riezka, analisis dan implementasi data-cleaning dengan menggunakan metode multi-pass neighborhood (mpn). universitas telkom, 2011. [24] j. twisk and w. de vente, “attrition in longitudinal studies: how to deal with missing data,” journal of clinical epidemiology, vol. 55, no. 4, pp. 329–337, apr. 2002. [25] j. b. buse et al., “how do we define cure of diabetes?,” diabetes care, vol. 32, no. 11, pp. 2133–2135, nov. 2009. [26] s. visa, b. ramsay, a. ralescu, and e. vanderknaap, “confusion matrix-based feature selection,” in proceedings of the 22nd midwest artificial intelligence and cognitive science conference, maics 2011, usa, 2011, pp. 120–127. [27] r. kohavi and h. john, “wrappers for feature subset selection,” artificial intelligence, vol. 97, no. 97, pp. 273–324, 2011. [28] a. indriani, “klasifikasi data forum dengan menggunakan metode naive bayes classifier,” seminar nasional aplikasi teknologi informasi (snati), vol. 1, no. 1, jun. 2014. [29] y. trisaputra, indriyani, s. m. biru, and m. ervan, “klasifikasi profil siswa sma/smk yang masuk ptn (perguruan tinggi negeri) dengan k-nearest neighbor,” researchgate, 2015. [online]. available: https://www.researchgate.net/publication/305917029_klasifikasi_profil_siswa_smasmk_yang_masuk_ptn_perg uruan_tinggi_negeri_dengan_k-nearest_neighbor. [accessed: 22-dec-2019]. [30] t. r. patil and s. s. sherekar, “performance analysis of naive bayes and j48 classification algorithm for data classification,” international journal of computer science and applications, vol. 6, no. 2, pp. 256–261, 2013. [31] d. xhemali, c. j. hinde, and r. g. stone, “naïve bayes vs. decision trees vs. neural networks in the classification of training web pages,” international journal of computer science issues (ijcsi), vol. 4, no. 1, pp. 16–23, 2009. u. pujianto et al. / knowledge engineering and data science 2019, 2 (2): 58–71 71 [32] m. ridwan, h. suyono, and m. sarosa, “penerapan data mining untuk evaluasi kinerja akademik mahasiswa menggunakan algoritma naive bayes classifier,” jurnal eeccis, vol. 7, no. 1, pp. 59–64, 2013. [33] d. l. naik and r. kiran, “naïve bayes classifier, multivariate linear regression and experimental testing for classification and characterization of wheat straw based on mechanical properties,” industrial crops and products, vol. 112, pp. 434–448, feb. 2018. [34] r. al-otaibi, r. b. c. prudêncio, m. kull, and p. a. flach, “versatile decision trees for learning over multiple contexts,” in proceedings of the european conference on machine learning and principles and practice of knowledge discovery (ecml pkdd) 2015, portugal, 2015. [35] chen jin, luo de-lin, and mu fen-xiang, “an improved id3 decision tree algorithm,” in 2009 4th international conference on computer science education, 2009, pp. 127–130. [36] f. f. harryanto and s. hansun, “penerapan algoritma c4.5 untuk memprediksi penerimaan calon pegawai baru di pt wise,” jatisi (jurnal teknik informatika dan sistem informasi), vol. 3, no. 2, pp. 95–103, 2017. knowledge engineering and data science (keds) pissn 2597-4602 vol 6, no 2, october 2023, pp. 231–248 eissn 2597-4637 https://doi.org/10.17977/um018v6i22023p231-248 ©2023 knowledge engineering and data science | w : http://journal2.um.ac.id/index.php/keds | e : keds.journal@um.ac.id this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/) comparison of machine learning algorithms for species family classification using dna barcode lala septem riza a,1,*, m ammar fadhlur rahman a,2, yudi prasetyo a,3, muhammad iqbal zain a,4, herbert siregar a,5, topik hidayat b,6, khyrina airin fariza abu samah c,7, miftahurrahma rosyda d,8 a department of computer science education, universitas pendidikan indonesia jl. dr. setiabudi no.229, bandung 40154, indonesia b department of biology education, universitas pendidikan indonesia jl. dr. setiabudi no.229, bandung 40154, indonesia c faculty of computer and mathematical sciences, universiti teknologi mara cawangan melaka 110 off, jalan hang tuah, malaysia d universitas ahmad dahlan jl. kapas no.9, yogyakarta 55166, indonesia 1 lala.s.riza@upi.edu*; 2 mafr@student.upi.edu; 3 yudiprasetyo@upi.edu; 4 iqbalzain99@upi.edu; 5 herbert@upi.edu; 6 topikhidayat@upi.edu; 7 khyrina783@uitm.edu.my; 8 miftahurrahma.rosyda@tif.uad.ac.id * corresponding author i. introduction the development of living specimen processing technology [1] in recent decades has created many biological data, including deoxyribonucleic acid (dna) sequence data. the collection of dna sequences starts with taking samples from living organisms. the sample is then processed through various stages such as extraction, enumeration, and amplification to obtain pieces of dna. these dna fragments are then collected and sequenced to obtain the nucleic acid symbols (such as adenine (a), guanine (g), cytosine (c), and thymine (t)), which compose the dna sequence [2]. the pieces of dna sequences are then analyzed to obtain a genome that has been restructured so that it becomes a complete genome. that part of the genome is then selected as a barcode representing the species [3][4]. all these stages are depicted in figure 1. it has long been known that dna sequences can be used to identify species, and nowadays, this activity is better known as dna barcoding [5][6]. dna barcoding is a method for identifying unknown specimens. it sequences in certain gene regions/loci that represent species in each kingdom, namely: cytochrome c oxidase subunit i (coi) for animals [7] obtained from mitochondria in cells, article info a b s t r a c t article history: received 25 october 2023 revised 27 october 2023 accepted 03 november 2023 published online 07 november 2023 classifying plant species within the liliaceae and amaryllidaceae families presents inherent challenges due to the complex genetic diversity and overlapping morphological traits among species. this study explores the difficulties in accurate classification by comparing 11 supervised learning algorithms applied to dna barcode data, aiming to enhance the precision of species family classification in these taxonomically intricate plant families. the ribulose-1,5-bisphosphate carboxylaseoxygenase large sub-unit (rbcl) gene, selected as a dna barcode locus for plants, is used to represent species within the amaryllidaceae and liliaceae families. the experimental results demonstrate that nearly all tested models achieve accurate species classification into the appropriate families, with an accuracy rate exceeding 97%, except for the naïve bayes model. regarding computational time, the random forest model requires significantly more time for training than other models. regarding memory usage, the least squares support vector machine with a polynomial kernel, and regularized logistic regression consume more memory than other models. these machine learning models exhibit strong concordance with ncbi's classifications when predicting families using the test dataset, effectively categorizing species into the amaryllidaceae and liliaceae families. this is an open access article under the cc by-sa license (https://creativecommons.org/licenses/by-sa/4.0/). keywords: machine learning supervised classification species classification dna barcode rbcl gene data analysis bioinformatics http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 232 ribulose-1,5-bisphosphate carboxylase -oxygenase large sub-unit (rbcl) and megakaryocyteassociated tyrosine kinase (matk) for plants [8] obtained from chloroplast cells, and internal transcribed spacer (its) for fungi [9] found in nucleus cells. fig. 1. process of processing living specimens into dna barcodes the process of identifying species in dna barcoding is done by analyzing the similarity of a barcode belonging to a specimen with another barcode belonging to a species already known in the database. the specimen can be classified as an existing species if the barcode has a high degree of similarity. if no barcode pairs are found with a high degree of similarity, then the specimen may be a new species and needs to be verified by a taxonomist. several approaches are commonly used to classify species in dna barcodes: tree-based, similarity-based, and character-based [10][11]. the tree-based method classifies a barcode into species based on its membership in the dna barcode tree. the similarity-based method classifies barcodes based on the number of similar characters in the dna barcode. at the same time, the character-based method relies on the presence or absence of specific characters in the dna barcode. in addition to these three approaches, species classification using dna barcodes can also be treated as a case of machine learning problems with supervised learning [12][13][14][15][16]. the liliaceae family, colloquially called the 'lily family', predominantly consists of monocotyledonous plants characterized by notable morphological diversity. encompassing approximately 16 genera and over 610 species [17], members of this family manifest primarily as herbs and shrubs. they are predominantly distributed across temperate and subtropical regions [18]. the amphipathic properties inherent to certain compounds within liliaceae render them effective as surfactants. beyond their ecological significance, these plants exhibit multifaceted utility: they are esteemed for ornamental purposes and utilized as vegetables, and certain species are acknowledged for their medicinal properties. given the vast potential inherent to the liliaceae family, they hold promise in cosmetics and pharmaceutical development [19]. the amaryllidaceae family, a prominent member of the order asparagales, is distinguished by its bulbous flowering plants. these plants are celebrated for their visually captivating flowers, making them famous for ornamental cultivation [20]. from a taxonomic perspective, the amaryllidaceae family is stratified into three subfamilies: agapanthoideae, allioideae, and amaryllidoideae [21]. historically, these were regarded as distinct families. the term “amaryllidaceae” is recurrently cited in phytochemical and pharmaceutical literature, particularly in discussions centered on the amaryllidoideae subfamily [20][22]. the medicinal potential of the amaryllidaceae family is both historical and contemporary. tracing back to the classical period, luminaries like hippocrates and dioscorides harnessed the therapeutic properties of narcissus oil, particularly for conditions believed to be associated with uterine tumors. in modern traditional medicine, the applications are diverse. for instance, ammocharis is employed for blood purification and wound treatment, brunsvigia for respiratory and 233 l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 hepatic ailments, clivia for snakebites and facilitation of childbirth, and crinum for a spectrum of conditions ranging from tumors to rheumatism [23]. in previous research, the amaryllidaceae family was classified under the liliaceae family. however, advancements in phylogenetics have led to a taxonomic reorganization. a team of scientists, spearheaded by rolf dahlgren [24], extensively examined monocot characteristics, including numerous microscopic features, culminating in a revised classification. historically, taxonomic experts such as bentham and hooker [25], engler and prantl [26], bessey [27], rendle [28], and hutchinson [29] categorized amaryllidaceae with an inferior ovary and liliaceae with a superior ovary into distinct families based on ovary position differences. despite these distinctions, both families exhibited numerous shared characteristics. consequently, cronquist [30] and takhtajan [31] integrated the amaryllidaceae family into liliaceae. further research regarded 'lilies' as a heterogeneous collection of genera and positioned them in families grouped under two orders: asparagales and liliales [32]. the problem in both families is depicted in the classification of allium albopilosum. allium albopilosum, indigenous to turkestan, is cultivated for its notable utility as a cut flower. while traditionally, allium species have been categorized under the liliaceae family due to the presence of superior ovaries in their flowers, there exists a divergence of opinion among botanists. some propose their reclassification to the amaryllidaceae family, citing the characteristic umbellate inflorescence. conversely, others advocate for a distinct classification, suggesting establishing a unique family, alliaceae, to accommodate them [33]. the consortium barcode of life [8] advocated the rbcl gene as a barcode for plant taxonomy and phylogenetic analysis. this gene is pivotal in plant species identification, phylogenetics, and relationships. the rbcl gene is located in chloroplast dna [8]. several studies have employed the rbcl gene for plant relationship research. for instance, the rbcl gene elucidates the relationships within selaginellaceae [34]. similarly, another research combined the rbcl gene with trnl-f for a phylogenetic study on rhamnaceae [35]. machine learning is a study attempting to extract knowledge from available data using computer programs that can learn and get smarter automatically based on experience [36][37]. currently, the application of machine learning can be found in various activities in everyday life, such as recommendations for goods in amazon e-commerce services [38], recommendations on the music streaming platform spotify [39], and recommendations in education assessment [40][41][42]in bioinformatics, machine learning has been widely used to solve problems in various areas, including genomics, proteomics, systems biology, evolution, microarrays, and text mining [43][44] [45]. the application of machine learning in each case handles the different characteristics of the input data. based on the type of feedback from the input data, there are three forms of learning: supervised learning, unsupervised learning, and reinforcement learning [46]. of the three forms of machine learning, bioinformatics case studies generally use supervised learning and unsupervised learning to solve problems. for example, supervised learning is used in genomics for the case of gene finding [47]. another example is the application of support vector machines (svm) [48] and random forests (rf) [49] for the prediction of phenotypic effects [50]. an example of the application of unsupervised learning in bioinformatics is microarray science for clustering genes into groups with specific biological meanings [51]. this study attempts to compare supervised machine learning algorithms to predict families of species based on dna barcode sequences in the r programming language. by predicting the family, we can more accurately place the species in the correct family in the taxonomy. machine learning algorithms that are used in this research are random ferns, svm linear, svm poly, svm radial, svm radial weights, lssvm poly, naïve bayes, random forest, c5.0, k-nearest neighbours, and regularized logistic regression. the dna barcode sequence employed in this study is derived from a segment of the chloroplast gene specific to the rbcl gene region of each examined species. this research contributes to resolving the existing classification ambiguity between the liliaceae and amaryllidaceae families. it l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 234 accomplishes this by applying various machine learning methodologies, the results of which are juxtaposed with contemporary, state-of-the-art classification systems from ncbi to yield more definitive insights into the precise familial categorizations. ii. methods a. data collection the data used are dna barcode sequence data obtained from genbank [52] (ncbi.nlm.nih.gov, accessed august 15, 2023). the dataset contains rbcl enzyme sequences from the chloroplast gene of plants in the amaryllidaceae and liliaceae families. information on the number of species, sequences, and file size of each dataset is listed in table 1. table 1. descriptions of the used datasets dataset number of species number of dna sequences file capacity (kb) training data amaryllis 308 689 708.4 lily 331 713 784.3 testing data amaryllis 23 113 114.7 lily 28 140 136.5 total 690 1,655 1,743.9 the amaryllis dataset contains 802 samples from the amaryllidaceae family, of which 689 were used for training and 113 for testing. meanwhile, the lily dataset comes from the liliaceae family and contains 853 samples, with details of 713 used for training and 140 for testing. all sequences in the dataset have varying sequence lengths (base pair; bp), with the most extended sequence having 1,458bp and an average sequence length of 903bp. the training dataset was obtained by downloading all species sequences in the family and omitting several selected species in the amaryllidaceae and liliaceae families. the complete list of species omitted from the training dataset can be seen in table 2. the testing dataset is a sequence of species omitted from the training dataset. the difference in the number of species in the testing dataset in table 1 with the species in table 2 is due to (1) not all species have samples of the rbcl gene sequence in genbank at the time of data collection (example: allium chrysanthum) and (2) genbank distinguishes main species from varieties/sub-species (example: crinum asiaticum and crinum asiaticum var. japonicum). all species collected in the testing dataset are listed in table 3. the entire dataset is downloaded and saved in fasta format. figure 2 shows an example of dataset content containing the genbank accession number, species name, sequence description, and dna sequence. each sequence is indicated by a line starting with the greater than symbol (“>”) and ending with a blank line. fig. 2. rnn, lstm, and gru architecture development 235 l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 table 2. list of species selected for test data no. amaryllis lily 1 agapanthus campanulatus alstroemeria aurea 2 allium altaicum calochortus apiculatus 3 allium cepa calochortus lyallii 4 allium chrysanthum cardiocrinum cathayanum 5 allium chrysocephalum cardiocrinum cordatum 6 allium fistulosum cardiocrinum giganteum 7 allium monanthum erythronium albidum 8 allium obliquum erythronium americanum 9 allium porrum fritillaria unibracteata 10 allium prattii gagea serotina 11 allium pskemense lilium bulbiferum 12 allium sativum lilium davidii 13 allium tuberosum lilium distichum 14 allium xichuanense lilium fargesii 15 amaryllis minuta lilium lancifolium 16 crinum asiaticum lilium longiflorum 17 crinum macowanii lilium pardalinum 18 hymenocallis caribaea lloydia oxycarpa 19 hymenocallis henryae medeola virginiana 20 hymenocallis tubiflora nomocharis aperta 21 lycoris radiata scoliopus bigelovii 22 narcissus poeticus tricyrtis macropoda 23 pancratium arabicum tulipa gesneriana 24 zephyranthes candida zigadenus glaberrimus 25 zephyranthes simpsonii table 3. list of species included in test data no. amaryllis lily 1 agapanthus campanulatus calochortus apiculatus 2 allium altaicum calochortus lyallii 3 allium ampeloprasum cardiocrinum cathayanum 4 allium cepa cardiocrinum cordatum 5 allium fistulosum cardiocrinum giganteum 6 allium monanthum cardiocrinum giganteum var. giganteum 7 allium prattii cardiocrinum giganteum var. yunnanense 8 allium pskemense erythronium albidum 9 allium sativum erythronium americanum 10 allium tuberosum fritillaria unibracteata 11 amaryllis minuta fritillaria unibracteata var. longinectarea 12 crinum asiaticum gagea serotina 13 crinum asiaticum var. japonicum lilium apertum 14 crinum macowanii lilium bulbiferum 15 hymenocallis caribaea lilium bulbiferum subsp. croceum 16 hymenocallis henryae lilium davidii 17 hymenocallis tubiflora lilium davidii var. willmottiae 18 lycoris radiata lilium distichum 19 narcissus poeticus lilium fargesii 20 narcissus poeticus var. plenus lilium lancifolium 21 pancratium arabicum lilium longiflorum 22 zephyranthes candida lilium longiflorum var. scabrum 23 zephyranthes simpsonii lilium pardalinum 24 lilium pardalinum subsp. pardalinum 25 lloydia oxycarpa 26 medeola virginiana 27 tricyrtis macropoda 28 tulipa gesneriana b. computational model the computational model used in this study is depicted in figure 3. this study uses the r programming language r version 4.2.1, which is run on a computer with an eight-core cpu using an intel core i5-1135g7 processor with a frequency of 2.4 ghz, ram with a capacity of 16gb and 512gb solid-state disk (ssd). several stages use the package libraries available in the public l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 236 repository cran and bioconductor. however, preparatory steps are still being taken to use the package according to research needs. furthermore, each stage in the computational model of this research will be explained as follows. fig. 3. computational model of comparison of machine learning algorithms for species family classification using dna barcode 237 l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 the first is to retrieve the training/testing dataset. all data are downloaded using the program code with the help of the rentrez package [53]. first, a filter query was made to search for dna sequences that matched the following criteria: (1) members of the amaryllidaceae and liliaceae families, (2) more than 450 bp and less than 10,000 bp in length, (3) excluding species excluded from training data or only species selected for data testing, and (4) is the rbcl gene. the search results are used to download the whole sequence in fasta format. a series of pre-processed data stages are carried out to use dna sequences in the classification model. the pre-processing stage starts from the dna sequence parsing stage to family labeling. the second is dna sequence parsing. at this stage, sequences in fasta format are converted to the dnastringset format with the help of the biostrings package [54]. the results of the sequence conversion in this stage are exemplified in figure 4. fig. 4. conversion of dna sequences from fasta format to the dnastringset data type the third is sequence alignment. datasets are combined and processed so that the symbols in the sequences are arranged between each sequence to have the same length. sequence alignment is run using the multiple sequence alignment (muscle) algorithm with the help of the muscle package [55]. fourth, aligned sequence parsing. the sequence alignment results are then converted to dnabin format with the help of the ape (analyses of phylogenetics and evolution) package [56] so that it can be read by the package used in the next stage. fifth is sequence trimming. the next step is to perform sequence trimming on the existing sequences so that there are no gap symbols in each sequence's upstream (left end) and downstream (right end). the sequences were trimmed with the help of the ips (interfaces to phylogenetic software) package [57] until 99% of the sequences had no gaps upstream downstream. figure 5 shows an example of dna sequence data before and after sequence alignment. fig. 5. dna sequences before and after alignment and trimming l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 238 sixth, conversion to the data frame. furthermore, the sequence conversion from the sequence trimming results is carried out into the data frame structure. it is the fundamental format commonly used in the r programming language. each symbol in the sequence is converted to a column with the character data type (character; chr). the dna representation in the data frame is shown in figure 6. fig. 6. dna sequences in the data frame seventh is a split training and testing set. the data in the data frame are then separated back into the data frame for training and testing. all species sequences whose species names are listed in table 3 are separated into a new data frame as a testing data frame. eight casts dna bases into factor. each column containing the dna sequence symbol in the data frame is then cast to an unordered factor data type with five levels. each of these levels represents a gap symbol and a nucleobase in the dna sequence, “-”, “a”, “c”, “g”, and “t”. gaps replace other nucleobase symbols that have ambiguous properties. ninth is family labeling. the training data frame is then added to a new column filled with family labels according to the data from genbank, while the data frame testing added a new column for the family but with empty data. the next is one-hot encoding. the data is transformed into a numeric representation in this stage, facilitating its subsequent processing. precisely, each character that represents nucleobases-namely “a”, “c”, “g”, “t”, or “-” derived from the alignment process, is mapped to a five-column matrix. within this matrix, the column corresponding to the specific nucleobase character is assigned a value of 1, while the remaining columns are assigned a value of 0, as illustrated in figure 7 [58]. fig. 7. one hot encoding process after that, models training. at this stage, a prediction model is made based on the training data frame that has been prepared. packages used to build classification models are c5.0, kknn, liblinear, naivebayes, rferns, randomforest, kernlab, and caret. at this stage, experiments were carried out on the dataset and the parameters of the random ferns algorithm, the number of ferns, and the depth. a validation process is also carried out to ensure the model is not overfitting or underfitting through a cross-validation process with the help of a caret package [59]. parallel [60] and doparallel [61] packages speed up the cross-validation resampling. the foreach package [62] also turns off parallel compute mode. the model used in this experiment including: c5.0, knn (k-nearest neighbors), lssvmpoly (least squares support vector machine with a polynomial kernel), naive_bayes, reglogistic (regularized logistic regression), rf: (random forest), rferns (random ferns), svmlinear (support vector machines with linear kernel), svmpoly (support vector machines with 239 l. s. riza et al. / knowledge engineering and data science 2023, 6 (2): 231–248 polynomial kernel), svmradial (support vector machines with radial basis function kernel), and svmradialweights (support vector machines with class weights). the next one is prediction. class prediction is carried out on the data frame testing based on the model made in the previous stage. the last is evaluation. the prediction results of the classification models are then evaluated based on the level of accuracy concerning the family label of each sequence in genbank and the results of the sequence consensus made using the decipher package [63]. duration and memory used when training the model are measured using the profvis package [64]. iii. results and discussions this study used rbcl gene sequence data from species in the amaryllidaceae and liliaceae families obtained from genbank. each species in the dataset has more than one sample because each sequence comes from sequencing results in different locations. all dataset downloads are performed using program code. for example, downloading the amaryllis training dataset starts by searching genbank using the entrez_search function from the rentrez package in the following program code. the argument for the term parameter is a variable that contains the search query. search_result