International Journal of Applied Sciences and Smart Technologies International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 131 Classification of Toddler Nutrition Using C4.5 Decision Tree Method Kartono Pinaryanto1,*, Robertus Adi Nugroho1, Yanuarius Basilius1 1Department of Informatics, Faculty of Science and Technology, Sanata Dharma University, Yogyakarta, Indonesia *Corresponding Author: kartono@usd.ac.id (Received 07-05-2021; Revised 15-06-2021; Accepted 16-06-2021) Abstract Nutrition is very much needed in the growth of toddlers. It is very important to give babies a balanced nutritional intake at the right stage so that the baby grows healthy and is accustomed to a healthy lifestyle in the future. Children under five years of age are a group that is vulnerable to health and nutrition problems. In determining the nutritional status, it can be done in a system manner using the C4.5 decision tree classification method and entering several variables or attributes. The dataset tested was 853 toddlers. Classification is carried out to determine the nutritional status based on the weight/age (BB/U), height/age (TB/U) and weight/height (BB/TB) categories. The attributes used for the classification of BB/U are gender, weight and age. The attributes used for TB/U are gender, body length or height, and age. The attributes used for BB/TB are gender, weight, body length or height, and age. The average accuracy of the BB/U category is 90.16%, the average accuracy of the TB/U category is 76.64%, and the average accuracy of the BB/TB category is 83.83%. Keywords: Classification, decision tree, C4.5, nutrition for toddlers International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 132 1 Introduction Nutrients are organic substances required for normal functioning of the body’s systems, growth and health maintenance. It is very important to give babies a balanced nutritional intake at the right stage so that the baby grows healthy and is accustomed to a healthy lifestyle in the future. Children under five years of age are a group that is vulnerable to health and nutrition problems, so that the toddler years are an important period of growth and need serious attention [1]. Based on the results of the 2018 Ministry of Health's Basic Health Research, 17.7% of infants under 5 years of age (toddlers) still experience nutritional problems. This figure consisted of Under-fives who suffered from malnutrition by 3.9% and those suffering from malnutrition by 13.8% [2]. The nutritional status of toddlers can be measured anthropometry, anthropometric indices are often used, namely: body weight for age (BB/U), height for age (TB/U), body weight for height (BB/TB). The weight index based on age (BB/U) is the most commonly used indicator because it has the advantage of being easy and quicker to understand by the general public. The reference standard used for determining nutritional status by anthropometry is based on the Decree of the Minister of Health No. 920/Menkes/SK/VIII/2002, to use the reference book of the “World Health Organization-National Center for Health Statistics” (WHO-NCHS) by looking at the Z-score. In determining the nutritional status, it has been done manually by the Community health centers, so patients have to come physically to the Community health centers. This is of course very troublesome especially in the current pandemic situation and conditions. Determining nutritional status can be done automatically using a classification approach. One approach that can be taken is to use the C4.5 decision tree method. The C4.5 method is an algorithm that works by applying the concept of a decision tree. A decision tree is a predictive model using a tree structure or hierarchical structure. The concept of a decision tree is to transform data into a decision tree with decision rules. In previous research on the comparison of the performance of the C4.5 and Naive Bayes algorithms for the classification of scholarship recipients by Choirul Anam and International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 133 Harry Budi Santoso, stated that the C4.5 algorithm has better performance than Naive Bayes with the level of accuracy obtained using the C4.5 algorithm of 96.4%, while the accuracy rate of Naïve Bayes is 95.11% [3]. Based on research [4] on the classification of typhoid fever (TF) and dengue hemorrhagic fever (DHF) by applying the C4.5 decision tree algorithm. It can be concluded that by using the k-folds cross validation test, the highest average accuracy value is 91.875% using 32 test data and 128 training data. From the description above, a study was conducted using the C4.5 decision tree method in determining the nutritional status of children under five. It is hoped that applying the C4.5 decision tree method can help classify the nutritional status of toddlers to determine the growth of children under five. 2 Methodology The methodology used in this study is as follows (Figure 1). Figure 1. Decision tree classification research methodology C4.5 The research began to prepare the dataset, then the dataset went through the cleaning process and continued with data selection. The next stage, the data will be divided into testing data and training data. Training data will be used to form a decision tree, while testing data will be used to evaluate the system being created. In the next sub-section, it will be explained in detail about the stages that are passed. 2.1. Dataset The dataset used in this study is the monitoring data on the nutritional status of toddlers, obtained from the Kebong Health Center, Kelam Permai District, Sintang International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 134 District, West Borneo in 2017 with a total of 853 toddlers. Monitoring data on the nutritional status of toddlers has three categories, namely the category of body weight according to age (BB/U), height for age (TB/U), and body weight for height (BB/TB). The BB/U category has 4 classification labels namely Best, Good, Bad and Worst. The TB/U category has 4 classification labels namely High, Normal, Short, and Very Short. While the BB/TB category has 4 classification labels namely Fat, Normal, Thin, and Very Thin (Table 1). Table 1. Categories and Labels No Category Label 1 BB/U Best, Good, Bad, Worst 2 TB/U High, Normal, Short, Very Short 3 BB/TB Fat, Normal, Thin, Very Thin 2.2. Data Cleaning Data cleaning is a process for cleaning unused data [5]. In this study, some data were deleted because were incomplete. An example of deleted data is that it does not have a BB/TB label, has no PB/TB value, and does not have a TB/PB conversion value. 2.3. Data Selection In the dataset, there are 19 attributes, including name, date of birth, gender M/F, body weight, PB/TB, measured position, age, age calculation process, conversion of TB/PB, age family, code, code1, code2, nutritional standards Poor BB/U, Nutritional Standards Good BB/U, Short PB/U or TB/U Standards, Normal PB/U or TB/U Standards, Weight Standards BB/TB or BB/TB, and Normal Standards of BB/TB or BB/TB. At the data selection stage, the attributes used for the classification were determined (feature selection). In the selection of attributes, the attributes of Gender M/F, Body Weight, PB/TB and Age were selected. These attributes were selected based on recommendations from the health center. The results of the attribute selection are shown in Table 2. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 135 Table 2. Attributes used by each category No Category Attribute Label 1 BB/U Gender M/F, Body Weight, Age Best, Good, Bad, Worst 2 TB/U Gender M/F, PB/TB, Age High, Normal, Short, Very Short 3 BB/TB Gender M/F, Body Weight, PB/TB, Age Fat, Normal, Thin, Very Thin 2.4. Dividing the Dataset The dataset is divided into testing data and training data using 𝑘-folds validation. The number of 𝑘 is chosen by the user where the values of 𝑘 are 3, 5, 7 and 9 folds. If the value of 𝑘 = 3, then the data is divided into 3 parts, 2 parts used for training data and 1 part for testing data, and likewise for dividing the value of 5, 7 and 9 folds. 2.5. Modeling C4.5 Decision Tree Every fold is modeled using the C4.5 decision tree method, so that there are 𝑛 models for each 𝑛 folds. The C4.5 decision tree method classifies the data by looking for the value of Entropy, Information Gain, Split Info and Gain Ratio. Tree formation begins with finding the highest Gain Ratio value to become the root node, then for leaf nodes it is carried out recursively until a decision tree is formed [6]. The following is an example of a tree formation step: 1. Prepare the data that will be used for the formation of the C4.5 decision tree model. In this example, 9 data on children under five are used for the classification of the BB / U category with the attributes used according to Table 3. 2. Separating data into training data such as Table 4 and testing data as in Table 5 with a total of 3 folds. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 136 Table 3. Dataset Gender M/F Body Weight Age BB/U 1 8 9 Good 1 7.8 8 Good 1 10.1 8 Good 2 6.1 6 Good 2 4.6 6 Worst 2 10 44 Worst 2 7.3 27 Worst 2 8.9 17 Worst 1 8.1 26 Worst Table 4. Data Training Gende r M/F Body Weight Age BB/U 1 8 9 Good 1 7.8 8 Good 1 10.1 8 Good 2 6.1 6 Good 2 4.6 6 Worst 2 10 44 Worst Table 5. Data Testing Gender M/F Body Weight Age BB/U 2 7.3 27 Worst 2 8.9 17 Worst 1 8.1 26 Worst 3. Calculating entropy using formula (1), information gain using formula (2), split info using formula (3), and calculating the gain ratio using formula (4) for each attribute. The entropy is formulated as Entropy(𝑆) = ∑ −𝑝𝑖 ∗ log2 𝑝𝑖 𝑛 𝑖=1 . (1) Description of formula (1) follows: 𝑆 is the set of cases, 𝑛 is the number of partitions 𝑆 and 𝑝𝑖 is the proportion of 𝑆𝑖 to 𝑆. The gain is formulated as Gain (𝑆, 𝐴) = Entropy(𝑆) − ∑ |𝑆𝑖 | |𝑆| 𝑛 𝑖=1 ∗ Entropy(𝑆𝑖). (2) Description of formula (2) follows: 𝑆 is Sample, 𝐴 is attribute, 𝑛 is the number of partitions of the attribute set 𝐴, |𝑆𝑖 | is the number of samples on the partition, and |𝑆| is the number of samples in 𝑆. Now we formulate the Split Info as International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 137 SplitInfo(𝑆, 𝐴) = − ∑ |𝑆𝑖 | |𝑆| 𝑣 𝑖=1 × log2 ( |𝑆𝑖 | |𝑆| ) . (3) Description of formula (3) follows: 𝑣 is the subset resulting from solving using attribute 𝐴 which has as many as 𝑣 values. Then, we have the Gain Ratio as GainRatio(𝑆, 𝐴) = Gain(𝑆, 𝐴) SplitInfo(𝑆, 𝐴) . (4) Next, look for the root node candidates by looking for the highest information gain value for each attribute. Determine the root node by finding the highest gain ratio value for each candidate. The highest gain ratio value is found in the weight attribute with a variable value of 4.6, thus the root node of the tree is Weight B. with a value of 4.6. The decision tree formed from the calculation is shown in Figure 2. Figure 2. Root Node 4. After getting the root node, then we do a leaf node search. Data with a weight value of 4.6 are deleted / removed from the dataset before searching for leaf nodes (Table 6). Table 6. The dataset table at node 2 Gender M/F Body Weight Age BB/U 1 8 9 Good 1 7.8 8 Good 1 10.1 8 Good 2 6.1 6 Good 2 4.6 6 Worst 2 10 44 Worst International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 138 After it has been removed, it is followed by looking for leaf nodes, and searching for the highest information gain value. The highest information gain value is in the Age attribute with a value of 9, thus the leaf node is Age, if the age is below 9 then the classification label is Good and if it is above 9 then the classification label is Worst. The resulting tree is shown in Figure 3. Figure 3. Leaf Node 2.6. Evaluation Several experiments were carried out to evaluate this system. Each experiment was carried out by dividing the data into 3, 5, 7 and 9 folds. Each experiment was carried out for each category, namely the categories BB/U, TB/U and BB/TB. The experiments are shown in Table 7. Table 7. C4.5 decision tree experiment Experiment Number of Folds 1st 3-folds 2nd 5-folds 3rd 7-folds 4th 9-folds 3 Results and Discussion Based on the experiments, the system is able to classify the nutritional status of the toddler based on BB/U, TB/U and BB/TB. The test results for the BB/U category showed when the number of 3 folds the measured accuracy was 89.52%, when the International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 139 number of 5 folds the measured accuracy was 90.93%, when the number of 7 folds the measured accuracy was 90.10% and when the number was 9 folds measured accuracy was 90.10%. These results indicate the average level of accuracy is 90.16%. Where the greatest accuracy occurs when using 5 folds (Table 8). This shows that the system can classify the BB/U category well. Table 8. Results of the BB / U experiment BB/U Experiment Number of Folds Average accuracy (%) 1 3 89.52 2 5 90.93 3 7 90.10 4 9 90.10 While the TB/U category trial showed the average accuracy rate was 76.64% and the highest accuracy occurred at folds 7 (Table 9). Table 9. Results of the TB/U experiment TB/U Experiment Number of Folds Average accuracy (%) 1 3 75.27 2 5 75.96 3 7 78.32 4 9 77.03 While the BB/TB category trial showed the average accuracy rate was 83.83% and the highest accuracy occurred at folds 7 (Table 10). International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 140 Table 10. Results of the BB/TB experiment BB/TB Experiment Number of Folds Average accuracy (%) 1 3 83.27 2 5 83.27 3 7 84.45 4 9 84.34 Based on the test results, we observe that the C4.5 decision tree works well for classifying the categories of BB/U, TB/U and BB/TB using the selected attributes. Although a minority of cases cannot be classified properly. 4 Conclusion Based on the results of the nutritional classification of children under five using the C4.5 decision tree method, the following conclusions can be drawn: 1. The C4.5 decision tree classification method can be used to classify the nutrition of toddlers quite well. 2. The average accuracy for each category is as follows: a. The BB/U category classification has an average accuracy of 90.16%. b. The TB/U category classification has an average accuracy of 76.64%. c. The BB/TB category classification has an average accuracy of 83.83%. References [1] P.T. Juniman. “4 Ancaman Bahaya yang Dialami Balita dengan Gizi Buruk” [Online]. Available: https://www.cnnindonesia.com/gaya-hidup/20180125110614- 255-271456/4-ancaman-bahaya-yang-dialami-balita-dengan-gizi-buruk, 2008 [2] Kemenkes. Hasil Utama Riset Kesehatan Dasar Kementerian Kesehatan 2018 [Online]. Available: https://www.depkes.go.id/resources/download/info- terkini/materi_rakorpop_2018/Hasil%20Riskesdas%202018.pdf. 2018 International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 141 [3] C. Anam and H.B. Santoso. “Perbandingan Kinerja Algoritma C4.5 dan Naive Bayes untuk Klasifikasi Penerima Beasiswa,” Jurnal ENERGY, 8 (1), 13–19, 2018. [Online]. Available: https://ejournal.upm.ac.id/index.php/energy/article/view/111 [4] U. Febriana, M.T. Furqon, and B. Rahayudi. (2017). “Klasifikasi Penyakit Typhoid Fever (TF) dan Dengue Haemorhagic Fever (DHF) dengan Menerapkan Algoritma Decision Tree C4.5 (Studi Kasus : Rumah Sakit Wilujeng Kediri),” Jurnal Pengembangan Teknlogi Informasi dan Ilmu Komputer, 2 (3), 1275–1282, 2017. [Online]. Available: https://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/1124. [5] J. Han and M. Kamber. Data Mining: Concept and Techniques, Second Edition, Morgan Kaufmann Publishers, 2006. [6] D.T. Larose. Discovering Knowledge in Data: An Introduction to Data Mining, John Willey & Sons, Inc., 2005. https://ejournal.upm.ac.id/index.php/energy/article/view/111 International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 131–142 p-ISSN 2655-8564, e-ISSN 2685-9432 142 This page intentionally left blank