53 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
Risks of Chronic Kidney Disease Prediction using 

various Data Mining Algorithms 

Akalya Devi C*, Fatima Abdul Jabbar**, Kavi Varshini S***, Kriti S Rithanya****, Miruthubashini M*****, 

Naveena K S****** 

*Assistant Professor, 2UG Scholar, Department of Information Technology, PSG College of Technology, Coimbatore, India. 

*Corresponding Email: 1akalya.jk@gmail.com  

 
A B S T R A C T S  A R T I C L E   I N F O 

Twenty million people have chronic kidney disease 
where patients experience a gradual deterioration of 
kidney function, the result of which is kidney failure. 
Early detection of chronic renal disease can help to slow 
its progression, avert complications, and reduce the risk 
of cardiovascular complications. Data mining has been 
broadly used in order to support medical professionals 
and physicians in the prediction and examination. Here, 
in this paper, multiple data mining algorithms are used 
to solve a problem in the field of medical diagnosis and 
examine how effective they were at predicting the 
consequences. The study's focus was on the diagnosis of 
chronic renal disease. This dataset used for this study 
consists 400 instances & 25 attributes. Preprocessing of 
the large amount of raw data is carried out to impute 
any missing data and determine which of the variables 
should be taken into account in the prediction models. 
The accuracy of the prediction is used to compare and 
contrast the various predictive analytic models.  

 
 Article History: 
Received 18 Dec 2021 
Revised 20 Dec 2021 
Accepted 25 Dec 2021 
Available online  26 Dec 2021 
Aug 2018 

__________________ 
Keywords: 
Chronic kidney disease, K-
Nearest Neighbor 
Classification, predictive 
analytics, Decision Tree, 
data mining, Support Vector 
Machine, Random Forest.  

 
International Journal of Informatics, 
Information System and Computer 

Engineering 

International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

mailto:akalya.jk@gmail.com


Akalya et al. Risks of Chronic Kidney Disease Prediction using various Data Mining...| 54 

 
1. INTRODUCTION 

Chronic Kidney Disease (CKD) or 
Chronic Kidney Failure is the increasing 
impairment of the kidney's ability to 
function normally. Chronic Kidney 
Disease is induced primarily due to high 
blood pressure, diabetes, hypertension, 
and several other factors in particular 
smoking, obesity, heart disease, heredity, 
consumption of alcohol, usage of drugs, 
age, race, ethnicity, etc. In India and other 
developing nations, chronic diseases 
remain a leading cause of mortality. The 
number of casualties in India owing to 
chronic disease was anticipated to be 5.21 
million in 2008 and is likely to increase 
from over 7.63 million by 2020. There are 
five distinct stages of disease 
development in which each stage 
increases in severity while as it advances 
between stage 1 and stage 5. Stage 1 is 
when a person's kidney function falls 
below normal. As the affected 
individuals it goes ahead into step 2, they 
may experience a mild to moderate loss 
of renal functions. The worsening 
condition escalates in level 3, where there 
is a moderate to average deterioration in 
the nephrological operation followed by 
acute damage in the functioning of the 
excretory system in stage 4. Stage 5 is the 
absolute collapsing of the urinary organs. 
(Almustafa, 2021).  

The massive increase in the amount 
of medical data available to predict the 
disease has raised the question of being 
effectively classified, managed, and 
transferred. To extract useful insight and 
knowledge from this raw data, effective 
ways are required. Data mining 
techniques are a dependable and 
pragmatic way of accomplishing this. 
Data mining is the process for processing 
massive amounts of data and extracting 

knowledge from all of this. In addition to 
the medical sector, the data are 
sequentially organized and are exploited 
in multiple number of real-time 
applications such as social networking 
sites, online websites, and so on. Data 
mining is categorized in many other 
domains including graphic data 
extraction, web data mining, textual data 
mining, image data extraction domain. 
These data mining sectors facilitate in 
decision-making and the extraction of 
useful information from the dataset 
undergoing investigation.  

Prediction of the risk of chronic 
kidney disease is based on several health 
parameters including random blood 
glucose level, blood pressure, serum 
creatinine level, and others. Supervised 
classification algorithms which are used 
to predict the risk of chronic kidney 
disease are Decision Tree Classification, 
Support Vector Machine Classification 
(SVM), Random Forest Classification, 
and K-Nearest Neighbor Classification 
(KNN) (Aqlan, et al., 2017). From 
experimenting, Random Forest 
Classification and KNN were shown to 
be the best classifiers for classification. 
Random Forest and KNN classifications 
have maximum reliability than Decision 
Tree and SVM classifiers.  

2. LITERATUR REVIEW 

In this research paper, recent data 
mining procedures were used to classify 
and forecast chronic kidney disease 
which considers various influencing 
factors such as blood pressure, red blood 
cells count, haemoglobin, etc. The 
techniques used in this paper provide 
more accuracy than the techniques used 
in other existing works.  


55 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
Kaur G et al. applied two data mining 
classifiers to predict chronic kidney 
disease: KNN and SVM, which gave the 
exactitude and error percentage (Arasu, 
D., & Thirumalaiselvi, R. 2017). 

Bhatla N et al. Has analysed most of 
the dangerous diseases among which 
breast cancer, heart disease, and diabetes 
are the predominant ones (Bhatla, N., & 
Jyoti, K. 2012). On investigating 168 
articles the techniques for implementing 
the diagnosis of various diseases have 
been performed. All techniques, data 
mining approaches, and evaluation 
methodologies are carefully investigated 
and properly considered. 

Kunwar V et al. Using categorization 
approaches such as Naive Bayes and 
Artificial Neural Networks (ANN), 
authors hypothesize chronic renal 
disease. According to the RapidMiner 
tool's trial results, Naive Bayes generates 
further accurate outcomes than ANN. 
(Gharibdousti, M. S et al., 2017). Decision 
Tree, Linear Regression, Super Vector 
Machine, Naive Bayesian, and Artificial 
Neural Networks (ANNs) were one of the 
classification strategies utilized (Ilyas, et 
al., 2021).  

The correlation matrix was used to 
investigate the features' correlation. As a 
result, they observed the influence of 
properties on classification findings. 

Padmanaban K. A et al. On the 
incurable renal disease dataset, 
researchers implemented data extraction 
algorithms such as Naive Bayes and the 
Decision tree algorithm (Padmanaban, K. 
A., & Parthiban, G. (2016). On comparing 
and contrasting several categorization 
algorithms, they recommended decision 
tree classification to reach substantial 
results with suitable accuracy by 

estimating its performance to its 
specificity and sensitivity (Kunwar, et al., 
2021). 

Sharma S et al. Evaluated 12 data 
mining clustering techniques by 
implementing them to the CKD dataset 
(sharma, et al., 2016). To determine 
efficiency, the findings of the prediction 
were contrasted with the factual medical 
outcomes. A few of the metrics used to 
evaluate performance comprise 
predictive accuracy, precision, 
sensitivity, and specificity (Kunwar, et 
al., 2016). With an accuracy at about 
98.6%, a sensitivity of 0.9720, a precision 
of 1, and a specificity of 1, the decision 
tree showed the best performance. 

Arasu D et al. Employed significant 
data extraction methods in particular 
clustering, classification, association 
analysis, and regression to predict renal 
diseases (Milley, A. 2000). These 
techniques had insignificant 
shortcomings in the picturality of 
preprocessing or at any other stages. 
Various data mining techniques are 
evaluated and the major problems are 
briefly explained. 

Vijayarani S et al has focused on 
using a novel machine learning 
classification strategy to predict chronic 
renal illness employing SVM on a data 
sample of 400 observations and 24 
attributes (Vijayarani, et al., 2015).  

3. PROPOSED WORK 

1. Due to CKD millions of 
individuals pass on each year 
since they don't experience 
legitimate treatment. CKD risk 
factors fall under four main 
categories: Susceptibility 
components which lead to a rise in 
renal damage susceptibility, 


Akalya et al. Risks of Chronic Kidney Disease Prediction using various Data Mining...| 56 

 
2. The terminology "initiation 
factors" refers to the elements that 
play a key role in renal damage. 

3. Progression Factors leads to more 
regrettable reality of kidney harm 
and fast decay functionalities once 
the harm gets begun. 

4. Kidney failure occurs as a result of 
end-stage conditions, culminating 
in morbidity and mortality. 

Kidney illnesses are anticipated and 
compared utilizing SVM and ANN 
algorithm stationed on the exactness and 
performance time. SVM, KNN, and some 
other algorithms have been used to assess 
the performance of the CKD dataset from 

the UCI repository and the raw data 
which have been taken was cleaned and 
processed by various steps which have 
been explained in the figure 1.  

Four different classifiers have been 
analyzed majorly established on the 
succeeding approaches:  Decision-Tree, 
Support Vector Machine (SVM), Random 
Forest, K-nearest neighbor (KNN) in 
Section 3. 

These technics were picked for the 
examination and review for the reason 
that of their ubiquity within the later 
important writing. A concise portrayal 
about the chosen strategies has been 
given underneath.  

3.1. Data Mining Algorithms & 
Technuques  

An algorithmic data mining program can 
be a well-specified plan of action that 
takes data as in and out. It includes 
designs in the shape of models. It 
comprises a small number of algorithms 

and strategies namely classification, 
grouping, prediction, association rules, 
neuronal networks, etc., to perform 
knowledge revelation from data banks. 
Table 1 shows the evaluation plans 
employed here. 

 
Table 1. Classification of CKD & Evaluation plans 

S. No Phases of CKD 

GFR 

(Glomerular       

Filtration Rate) 

Evaluation Strategies 

1. 
Nephrological damage 

with common GFR 
90 or beyond 

Treating the coexisting conditions, 

reduction of hazard variables for cardiac 

and vascular illness 

 
2. 

Renal impairment with 

moderate reduction 
60-89 

Approximation of ailment advancement 

3. Reasonable reduction 30-59 
Assessment and medication of sickness 

intricacies 


57 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
4. Rigorous diminution  15-29 
Formulation of excretory organs 

switching remedy (dialysis, granting) 

5. Renal failure Less than 15 Nephrological organs grafting therapy 

The prediction analytics conducted is 
based upon the typically picked data 
columns of data, which comprises of the 
age, blood pressure, number of red blood 
cells, and appetite fields. These above 
mentioned four entries incorporate the 
numeric data in the case of blood 
pressure and age, while   categorical data 

for the number RBGs and appetite. The 
nominal data has indeed been converted 
into numeric types so as to –make 
classification techniques suitable to 
string-based categorical attributes, which 
cannot be handled using statistical 
models. The proposed framework for the 
study is illustrated in Figure 1.  

Figure 1. Proposed framework for CKD analysis and prediction 


Akalya et al. Risks of Chronic Kidney Disease Prediction using various Data Mining...| 58 

 
Here are the basic steps which were 
performed initially;  

1. Acquire the data from the local disk. 

2. With the help of the column identifier 

IDs, manually choose the columns. 

3. To make all the nominal values 

numerical, the conversion is made. 

4. After the categorical transformation, 

make the last data matrix. 

5. Inside the last data matrix, search for 

the missing values.  

6. Compute the average of every column 

that constitutes the variable.  

7. Load in the missing values with the 

appropriate average value from the 

mean values.  

8. To make a non-uniform feature matrix, 

shuffle the data matrix. 

9. Divide the training and testing data 

matrices.  

10. Make the observation vectors ready 

for training and testing. 

3.2. Classification 

The best and most common data mining 
approach is classification. Where entities 
are classified into different categories 
called classes and assigned to them. Each 
and every thing needs to be distributed 
precisely to one class and not more than 
one and never to no classes at all. 
Decision tree, SVM, KNN, and Random 
Forest were the classification algorithms 
included in this model.  

3.2.1. Decision Tree Classification 

This method is especially beneficial for 

deciphering classification problems in 

which a tree is formed to depict the 

categorization process. The tree is linked 

to every tuple in the database to yield 

classification as long as it is established. 

Classification tree analysis and regression 

tree analysis are the two forms of decision 

trees used in data mining, and they have 

been employed for a spectrum of 

potential results such as belonging to a 

specific statistical class or an actual 

number.  

1. Fitting Decision Tree to the 

training set. 

2. Predicting the test result. 

3. Calculating the accuracy. 

4. Displaying the confusion matrix. 

 
3.2.2 SVM Classification 

SVM is a set of rules for supervised 

machine learning that can be used to 

resolve classification and regression 

problems. It uses a strategy called the 

kernel trick to convert your data and after 

that based on these changes it finds an 

ideal boundary between the possible 

outputs (Sinha, P., & Sinha, P. 2015). The 

following steps are the ones performed;  

1. Support Vector Machine (a 

classification technique) is applied 

on the available data for the 

purpose of predictive analysis. 

2. Using the training data matrix and 

the training observation vector the 

classifier is trained. 

3. The testing data matrix with 

unseen data is utilized to examine 

the classifier 


59 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
4. The predictions (Observations 

predicted by SVM classifier) are 

returned as output.  

The entire performance is computed by 

comparing and contrasting the outcomes 

of support vector machine classifier and 

the actual perceptions.  

 
3.2.3 Random – Forest Classifier 

Random Forest is an analyzer that equips 

the average of a number of decision trees 

on discrete subsections of a given set of 

data to advance the dataset's predicted 

accuracy. The following steps are the 

ones performed (Sinha, P., & Sinha, P. 

2015). 

1. Fitting Random Forest 

classification to the training set. 

2. Predicting the test result. 

3. Calculating the accuracy. 

4. Displaying the confusion matrix.  

 
3.2.4 KNN Classifier 

It's a type of distance-based technique 

that's typically used while the values of 

each and every attribute is uninterrupted 

and continual, but it can also be used with 

nominal features (Subasi, et al., 2017). To 

compute the categorization of an 

unknown sample data based on the 

classification of the closest instance or 

instances. More occurrences inside the 

preparation set use the same way to 

group the k-nearest neighbors (also 

known as k-nearest Neighbor), 

(Vijayarani, et al., 2015). The steps that 

were taken were as follows:  

1. K-Nearest Neighbor (one of the 

classification technics) is 

employed over the given data for 

the purpose of predictive 

analytics. 

2. Before initiating the entire process, 

the value of k should be initialized 

which will be symbolizing the 

number of neighbors that has to be 

considered. 

3. The k-nearest neighbor classifier 

needs to be trained with the 

specified k value over the training 

data matrix and the training 

observation vector. 

4. With the help of the test data 

matrix, which contains the unseen 

data the classifier is tested and 

evaluated for the required metrics. 

5. The forecasts (observations 

predicted by the KNN classifier) 

made by the KNN analyzer should 

be returned. 

6. The entire accuracy and 

performance of the KNN classifier 

is estimated by comparing the 

predictions made by KNN and the 

actual observations. 

 
5. RESULT AND DISSCUSION 
The chronic kidney disease (CKD) 

dataset was acquired based on the UCI 
machine learning repository and is 
employed in this study for prediction and 
validation. Both numerical and nominal 
attributes were included in CKD dataset. 
There are 25 attributes and 400 instances. 
This dataset also contains missing values. 
There are 24 attributes and one class 
attribute (i.e.) CKD, NOT-CKD. Table 2 
gives the attribute description of the 
dataset.

 
Akalya et al. Risks of Chronic Kidney Disease Prediction using various Data Mining...| 60 

 
Table 2. CKD Dataset Attributes description 

S. No 
Attribute 

Name 
Expansion S. No 

Attribute 

Name 
Expansion 

1 age Age of the patient 13 pot Potassium  

2 bp Blood pressure 14 hemo Hemoglobin  

3 sg Specific gravity 15 pcv Packed cell volume 

4 al  Albumin  16 wc White blood cell count 

5 su Sugar  17 rc Red blood cell count 

6 rbc Red blood cells 18 htn Hypertension  

7 pc Pus cell 19 dm Diabetes mellitus 

8 pcc Pus cell clumps 20 cad Coronary artery disease 

9 ba Bacteria  21 appet Appetite  

10 bgr Blood glucose random 22 pe Pedal edema 

11 bu  Blood urea 23 ane Anemia  

12 sc Serum creatinine 24 sod Sodium  

 
Data Cleaning and data pre-
processing is the most critical point in the 
data mining procedure as it influences 
the rate of success drastically. The 
categorical attributes were displaced 
with 0s and 1s corresponding to their 
values. The missing values were replaced 
with the mean of that particular attribute. 
As there was a wide range of age, the age 
attribute was grouped in batches 
(Sharma, et al., 2016). The CKD dataset 
includes features that vary in the degree 

of magnitude, range, and units. In order 
to interpret all the features on the same 
scale, Feature Scaling (Data 
Normalization) was carried out. 

The CKD dataset was parted into 70% for 

the purpose training and 30% for the 

purpose of testing data. Four different 

data mining procedures encompassing 

Decision Tree Classification, Support 

Vector Machine Classification, Random 

Forest Classification, KNN Classification 


61 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
were applied to the training and testing 

data and the performance measurement 

using different metrics like precision, f1-

score, recall, accuracy, specificity, and 

sensitivity were observed (Vijayarani, et 

al., 2015). Table 3 presents the different 

performance metrics used in this paper.   

 
Table 3. Different overall performance analysis metrics used 

Metrics Definition  Equation 

Precision The proportion of predicted accurately positive 

considerations to fully predicted positive observations 

is referred as precision.  

Recall 

(Sensitivity) 

Estimates the percentage of number of yes’s that are 

effectively-recognized correctly. 

 
F1-Score Precision and Recall are weighted averages which 

determine the F1 score. 
 

Accuracy Measures the model's ability to accurately estimate 

class label of latest or previously unknown 

information.  

Specificity Here, ratio of negatives (or No's) that have been 

correctly recognized as such is measured. 
 

The performance metrics of the various proposed algorithms were derived using the equations 

listed in Table 3. Table 4 depicts the results obtained for every algorithm. 

Table 4. Performance measures of the proposed algorithms 

 
Model 

Precision Recall F1-Score  

Specificity 

 
Accuracy 
Not-CKD CKD Not-CKD CKD Not-CKD CKD 

Decision 

Tree  

0.91 1.00 1.00 0.95 0.95 0.97 1.00 0.967 

SVM 0.93 1.00 1.00 0.96 0.97 0.98 1.00 0.975 


Akalya et al. Risks of Chronic Kidney Disease Prediction using various Data Mining...| 62 

 
Model 

Precision Recall F1-Score  

Specificity 

 
Accuracy 
Not-CKD CKD Not-CKD CKD Not-CKD CKD 

Random 

Forest  

0.95 0.99 0.98 0.97 0.96 0.98 0.987 

 
0.975 

KNN 0.95 1.00 1.00 0.97 0.98 0.99 1.00 0.983 

The train score is the measurement that states us in what way the model suits the 

training data. Similarly, the test score shows how the model reacts to the unknown 

data. The area under the curve (AUC) score portrays the model’s overall performance 

at differentiating between the positive and negative classes. Figure 2 shows the 

comparison of the training score, test score and mean AUC score. 

 
Figure 2. Depiction of Train, Test, and Mean AUC Scores of the proposed algorithms 

The difference in magnitude between both the observation's prediction and its true 

value is termed as the mean absolute error.  For the proposed algorithms, Figure 3 

illustrates the mean absolute error.  


63 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
Figure 3. Illustration of the proposed algorithms' Mean Absolute Error 

The Receiver operating characteristic Curve (ROC Curve) reflects the classification 

model's overall performance among all class thresholds [10]. The ROC curve for the 

algorithms employed in this study is illustrated in Figure 4. 

 
Figure 4. Plot of ROC curve for the proposed algorithms 

5. CONCLUSION  
The objective of this article is to 

analyze the variety of data mining 

techniques and algorithms utilized to 
predict Chronic Renal Disease. CKD has 
been predicted and diagnosed using data 


Akalya et al. Risks of Chronic Kidney Disease Prediction using various Data Mining...| 64 

 
mining classifiers: Decision Tree, SVM, 
Random Forest, and KNN. It was found 
that KNN results in the best accuracy. 
The performance of the KNN method 
was found to be 98.3% accurate compared 
to Decision Tree (96.7%), SVM (97.5%), 
and Random Forest (97.5%). The work 
can further be extended keeping into 
consideration the other parameters like 

food intake, living conditions like 
sanitation, availability of clean water, 
working environment, environmental 
factors like pollution, etc. for the 
detection of kidney disease. Further 
experimentation can be conducted using 
other classifiers like ANN or by using 
ensemble techniques. 

 
REFERENCES 

Almustafa, K. M. (2021). Prediction of chronic kidney disease using different 

classification algorithms. Informatics in Medicine Unlocked, 100631.Dobrucka, 

R. (2018). Synthesis of MgO nanoparticles using Artemisia abrotanum herba 

extract and their antioxidant and photocatalytic properties. Iranian Journal of 

Science and Technology, Transactions A: Science, 42(2), pp. 547-555. 

Aqlan, F., Markle, R., & Shamsan, A. (2017). Data mining for chronic kidney disease 

prediction. In IIE Annual Conference. Proceedings (pp. 1789-1794). Institute of 

Industrial and Systems Engineers (IISE). 

Arasu, D., & Thirumalaiselvi, R. (2017). Review of chronic kidney disease based on 

data mining techniques. International Journal of Applied Engineering Research, 

12(23), 13498-13505. 

Bhatla, N., & Jyoti, K. (2012). An analysis of heart disease prediction using different 

data mining techniques. International Journal of Engineering, 1(8), 1-4  

Gharibdousti, M. S., Azimi, K., Hathikal, S., & Won, D. H. (2017). Prediction of chronic 

kidney disease using data mining techniques. In IIE Annual Conference. 

Proceedings (pp. 2135-2140). Institute of Industrial and Systems Engineers (IISE).  

Ilyas, H., Ali, S., Ponum, M., Hasan, O., Mahmood, M. T., Iftikhar, M., & Malik, M. H. 

(2021). Chronic kidney disease diagnosis using decision tree algorithms. BMC 

nephrology, 22(1), 1-11.  

Ilyas, H., Ali, S., Ponum, M., Hasan, O., Mahmood, M. T., Iftikhar, M., & Malik, M. H. 

(2021). Chronic kidney disease diagnosis using decision tree algorithms. BMC 

nephrology, 22(1), 1-11.  

Kunwar, V., Chandel, K., Sabitha, A. S., & Bansal, A. (2016, January). Chronic Kidney 

Disease analysis using data mining classification techniques. In 2016 6th International 

Conference-Cloud System and Big Data Engineering (Confluence) (pp. 300-305). 

IEEE.  


65 | International Journal of Informatics Information System and Computer Engineering 2(2) (2021) 53-65 

 
Milley, A. (2000). Healthcare and data mining. Health Management Technology, 21(8), 

44-45.  

Padmanaban, K. A., & Parthiban, G. (2016). Applying machine learning techniques for 

predicting the risk of chronic kidney disease. Indian Journal of Science and 

Technology, 9(29), 1-6.   

Sharma, S., Sharma, V., & Sharma, A. (2016). Performance based evaluation of various 

machine learning classification techniques for chronic kidney disease 

diagnosis. arXiv preprint arXiv:1606.09581.  

Sinha, P., & Sinha, P. (2015). Comparative study of chronic kidney disease prediction 

using KNN and SVM. International Journal of Engineering Research and 

Technology, 4(12), 608-12.  

Subasi, A., Alickovic, E., & Kevric, J. (2017). Diagnosis of chronic kidney disease by 

using random forest. In CMBEBIH 2017 (pp. 589-594). Springer, Singapore.    

Rubini, L. J., & Eswaran, P. (2015). UCI Machine Learning Repository: 

Chronic_Kidney_Disease Data Set.   

Vijayarani, S., Dhayanand, S., & Phil, M. (2015). Kidney disease prediction using SVM 

and ANN algorithms. International Journal of Computing and Business Research 

(IJCBR), 6(2), 1-12.