Discrimination of Patients with Prostate Cancer from Healthy Persons Using a Set of Single Nucleotide 
Polymorphisms

Mir Davood Omrani1, Hossein Mohammad-Rahimi2, Abbas Basiri1, Milad Fallahian3, Rezvan Noroozi4, 
Mohammad Taheri5*, Soudeh Ghafouri-Fard6**

Purpose: Prostate cancer is the second cancer diagnosed in males. It accounts for about 4% of cancer-related 
mortality in men. Several genetic polymorphisms in different genes have been identified that alter the risk of this 
kind of malignancy. 

Materials and methods: We used the random forest (RF) algorithm for prediction of prostate cancer risk in Irani-
an population using 13 different single nucleotide polymorphisms (SNPs) in four genes (ANRIL, HOTAIR, IL-6 
and IL-8). The samples were divided into a training set (n=320) and a test set (n=80) to evaluate the generalization 
power for training algorithm. For hyper-parameters tuning, we used randomized search with 5-fold cross-valida-
tion for the following hyper-parameters: (1) Number of trees or estimators in the forest (set from 3 to 500); (2) The 
maximum number of leaf nodes (set from 2 to 32); (3) The maximum number of features used for the best split 
(set from 5 to 13); and (4) Using bootstrap samples in the trees building (True or False). Accuracy, sensitivity, 
specificity, and F1-score in both training and test sets were reported. 

Results: The most important SNP was ANRIL-rs1333048: A/A (Gini index= 0.096) followed by AN-
RIL-rs10757278: G/G (Gini index= 0.059). Training Dataset Outcomes were as follow: Accuracy: 0.896, Sen-
sitivity: 0.85, Specificity: 0.944 and F1 Score: 0.891. Test Dataset Outcomes were as follow: Accuracy: 0.787, 
Sensitivity: 0.775, Specificity: 0.800 and F1 Score: 0.784. The AUC Scores were 0.966 and 0.841 for training and 
test datasets, respectively. 

Conclusion: The proposed panels of SNPs can predict risk of prostate cancer in Iranian population with appropri-
ate accuracy. 

Keywords: prostate cancer, single nucleotide polymorphism, IL-8, HOTAIR, ANRIL

INTRODUCTION

Prostate cancer ranks second among the diagnosed cancer in males. It accounts for about 4% of can-
cer-related mortality in men(1). A comprehensive study 
in Iranian patients has shown that 97% of all cases have 
been adenocarcinoma. The other defined pathologies 
have been malignant carcinoma and transitional cell 
carcinoma(2). At early phases of cancer development, 
prostate cancer usually does not have any symptoms 
and progresses in an indolent manner, needing minimal 
or even no therapeutic intervention. During its course, 
it can cause difficult urination, increased frequency or 
urgency in urination, nocturia and urinary retention and 
back pain in advanced stages, the latter being caused by 
metastasis(3). Genome wide association studies (GWAS) 
conducted in different populations have identified tens 

1Urology and Nephrology Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
2Dental Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences,
Tehran, Iran.
3Faculty of Civil Engineering, Amirkabir University of Technology, Tehran, Iran.
4Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland.
5Skull Base Research Center, Loghman Hakim Hospital, Shahid Beheshti University of Medical Sciences, Tehran,
 Iran.
6Department of Medical Genetics, School of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, 
Iran.
*Correspondence: Urology and Nephrology Research Center, Shahid Beheshti University of Medical Sciences, 
Tehran, Iran. mohammad_823@yahoo.com.
** Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland. s.ghafourifard@sbmu.ac.ir.
Received June 2020 & Accepted May 2021

of genetic polymorphisms that confer risk of this ma-
lignancy (4-6). We have recently assessed the role of a 
number of single nucleotide polymorphisms (SNPs) in 
different genes in conferring risk of prostate cancer in 
Iranian population. These SNPs were located in  ANRIL  
(rs1333045, rs4977574, rs1333048 and rs10757278) (7), 
HOTAIR (rs12826786, rs1899663 and rs4759314)(8), 
IL-6 (rs1800795 and rs2069845)(9) and IL-8 (rs4073, 
rs2227306 and rs1126647)(10). In the current study, we 
applied the random forest (RF) algorithm for predic-
tion of risk of prostate cancer based on the genotyp-
ing results of these 13 distinct SNPs. RF algorithm is 
an ensemble learning method for supervised classifi-
cation introduced by Breiman(11). This nonparametric 
tree-based approach combines the concepts of adaptive 
nearest neighbors with bagging(12). RF method has the 
ability to assess correlation and interaction among var-

Urology Journal/Vol 18 No. 6/ November-December 2021/ pp. 639-645. [DOI: 10.22037/uj.v18i.6337]

UROLOGICAL ONCOLOGY


iables. Notably, RF can facilitate selection and ranking 
of variables by calculating variable importance values. 
These features potentiate RF for evaluation of genomic 
data and bioinformatics investigation(13). 

MATERIALS and METHODS
We used the RF algorithm for prediction of prostate 
cancer using 13 different SNPs. The samples were di-
vided into a training set (n=320) and a test set (n=80) 
for the purpose of generalizing the outcome of the train-
ing algorithm. For hyper-parameters tuning, we used 
randomized search with 5-fold cross-validation for the 
following hyper-parameters: (1) Number of trees or es-
timators in the forest (set from 3 to 500); (2) The max-
imum number of leaf nodes (set from 2 to 32); (3) The 
maximum number of features used for the best split (set 
from 5 to 13); and (4) Using bootstrap samples in the 
trees building (True or False). A total of 1000 combi-
nation of these hyper-parameters were evaluated on the 
validation sets. After fixing the hyperparameters, we 
retrained the whole training set again.
Totally, 20 percent of the data samples were used as test 
set. The samples were chosen randomly. For evaluation 
of the hyperparameters, one fifth of the training set was 
used as validation set for the 5-fold-cross validation.
We did not set a limitation on the maximum depth of 
the trees. Therefore, the nodes were further expanded 
until all leaves became pure or until all leaves contained 
fewer samples than the “min samples split” amount. To 

avoid overfitting, we used the k-fold cross-validation 
technique.
In the current study, we used the Python programming 
language version 3.8.2. For applying the RF algorithm 
and hyper-parameters randomized search, we imple-
mented Python Scikit-Learn 0.23.0 (https://scikit-learn.
org/).
Accuracy, sensitivity, precision, and F1-score in both 
training and test sets were reported. Precision, sensitivi-
ty and F1-score were defined in equations 1-3 (14):

Where, TP is the number of true positives, FP is the 
number of false positives, and FN is the number of false 
negatives.
Furthermore, we used the receiver operating character-
istic (ROC) curve and area under curve (AUC) score 
to evaluate the performance of the model. We also pre-
sented the most important SNPs based on the impuri-
ty-based feature importance (also known as the Gini 
importance). The Gini index measures the importance 
of a feature by computing the level of the impurity of 
samples assigned to a node based on a split at its parent 
(15). Gini index was calculated using Equation 4:

SNPs for prostate cancer-Omrani et al.

Urological Oncology  640

    Control  Prostate cancer BPH

ANRIL-rs1333045 C/T  129  65  57
  C/C  57  30  25
  T/T  14  5  18
ANRIL-rs1333048 A/A  110  22  27
  A/C  50  55  32
  C/C  40  23  41
ANRIL-rs4977574 G/G  82  62  65
  A/G  82  33  24
  A/A  36  5  11
ANRIL-rs10757278 A/G  91  58  65
  G/G  84  21  20
  A/A  25  21  15
HOTAIR-rs12826786 C/T  108  37  48
  C/C  60  25  28
  T/T  32  38  24
HOTAIR-rs4759314 A/A  121  61  54
  A/G  77  38  44
  G/G  2  1  2
RORA-rs11639084 C/C  126  59  73
  C/T  63  41  23
  T/T  11  0  4
RORA-rs4774388 T/T  105  64  49
  C/T  75  29  41
  C/C  19  7  10
IL-6-rs2069845 A/G  97  54  44
  A/A  82  33  47
  G/G  21  13  9
IL-6-rs56588968 C/G  87  41  40
  G/G  77  30  32
  C/C  36  29  28
IL-8-rs4073 A/T  96  53  27
  T/T  63  34  51
  A/A  41  13  22
IL-8-rs2227306 C/T  92  52  43
  C/C  76  37  35
  T/T  32  11  22
IL-8-rs1126647 A/T  89  39  37
  A/A  72  32  46
  T/T  39  29  17

Table 1.  The frequency and distribution of various polymorphisms.


Where, n is the number of the total the total samples, nk 
is the number of samples from class k = {0, 1}, pk is the 
fraction of nk out of n samples at node τ.
We measured the generalization power based on the 
test data rather than the generalization. The strategy for 
setting the optimum value of hyperparameters (Hyper-
parameter Tuning) was randomized search and k-cross 
fold-validation.

RESULTS
Samples containing at least one NaN value were ruled 
out. The frequency and distribution of various polymor-
phisms are summarized in Table 1.
In the hyper-parameter tuning stage, hyper-parameters 
were set as follow: 1) Number of trees = 34; 2) The 
maximum leaf nodes = 30; 3) The maximum features = 
8; and 4) Using bootstrap = True. The most important 
SNP was ANRIL-rs1333048: A/A (Gini index= 0.096) 

Figure 1. Training Dataset Confusion Matrix. The color bar next 
to the chart shows the frequency.

Figure 2. Test Dataset Confusion Matrix. The color bar next to the 
chart shows the frequency.

Figure 3. Visualization of the first estimator (Decision Tree) in our Random Forest Model

SNPs for prostate cancer-Omrani et al.

Vol 18 No 6  November-December 2021  641


followed by ANRIL-rs10757278: G/G (Gini index= 
0.059).
Training Dataset Outcomes were as follow: Accura-
cy: 0.896, Sensitivity: 0.85, Specificity: 0.944 and F1 
Score: 0.891. Test Dataset Outcomes were as follow: 
Accuracy: 0.787, sensitivity: 0.775, Specificity: 0.800 
and F1 Score: 0.784.
Figure 1 shows the Training Dataset Confusion Matrix.
We also depicted Dataset ROC Curve for both training 
and test datasets (Figures 4 and 5). The AUC Scores 
were 0.966 and 0.841 for training and test datasets, re-
spectively.

Features Importance of the assessed SNPs is shown 
in Figure 6. The best features have been demonstrated 
for ANRIL-rs1333048: A/A and ANRIL-rs10757278: 
G/G, respectively.

DISCUSSION
In the current study, we re-analyzed our genotyping 
data of 13 SNPs in a population of Iranian patients 
with prostate cancer using the RF method. This method 
has been previously applied in the analysis of SNPs in 
genetic studies. In GWAS, RF has been shown to be 
able in screening of SNPs with interaction effects. Such 

Urological Oncology  642

Figure 4. Training dataset ROC curve showing the AUC value of 0.966 for the proposed approach in the diagnosis of prostate cancer.

Figure 5. Test dataset ROC curve showing the AUC value of 0.841 for the proposed approach in the diagnosis of prostate cancer.

SNPs for prostate cancer-Omrani et al.


method has decreased the number of SNPs that should 
be recalled for additional study compared to routine 
univariate screening strategies(16). RF has been success-
fully applied for assessment of the effects of 42 SNPs 
located in the asthma risk gene ADAM33 to reach 44% 
misclassification rate(17). In coronary artery calcifica-
tion, RF has been applied for predication of the effects 
of 287 tagged SNPs and 17 risk elements(18). 
RF is superior to artificial neural network as it can de-
crease the high variance from a flexible model such as a 
decision tree through integrating several trees into one 
collaborative model. RF provides a different interpre-
tation of a decision tree yet with superior performance. 
RF classifiers produce a large number of decision trees, 
without trimming or pruning. For each variable, this 
approach generates a significance score, which quan-
tifies the variable relative contribution to prediction(19). 
RF classifiers has been successfully used in various 
biomedical studies(20-22). In a study by Masetic Z. et 
al(23), it has been reported that RF classifiers had better 
classification performance compared to decision tree, 
k-Nearest Neighbor, support vector machine, and arti-
ficial neural networks in congestive heart failure detec-
tion. In another study by Zahangir Alam Md. et al(21), it 
has been suggested that other classifiers, unlike RF, do 
not perform equally well over all used medical datasets. 
Similar to our study, they used k-fold cross validation 
for the model evaluation. The k-fold cross validation is 
a tool for evaluating a predictive model that splits the 
initial dataset into training sets and a validation sets for 
training and evaluating the model. It can also be used 
for the purpose for tuning the hyperparameters(24).
RF classifiers can also been usedfor analysis of the 
SNPs(19).  Regarding SNPs, numerous studies used 
RF algorithm for analysis of the SNPs(19,25-27). Using 

RF, Van Dyke A. L, et al.(25) suggested that IL1A SNP 
is an important risk factor in predicting risk for non-
small cell lung cancer among women using SNPs data. 
Staiano, A. et al(26) used RF algorithm to find SNPs as-
sociated with high cardiovascular risk.
RF has a valuable characteristic that enables a prompt 
calculable internal measure of variable importance. 
This feature can be applied to rank variables particu-
larly in assessment of high-throughput genomic data. 
Node impurity indices (including the Gini index) are 
frequently used to appraise the importance measures 
(13). In the current study, we calculated the Gini index 
importance according to the node impurity degree for 
node splitting. This approach led to the identification of 
the ANRIL-rs1333048: A/A (Gini index= 0.0967) and 
ANRIL-rs10757278: G/G (Gini index= 0.0599) geno-
types as the most important genotypes in conferring risk 
of prostate cancer. The ANRIL rs1333048 SNPs have 
been previously shown to be associated with both gen-
eralized and localized aggressive periodontitis. More-
over, it resides in a common risk locus for coronary 
artery disease and periodontitis(28). The GG genotype of 
rs10757278 has been remarkably associated with carot-
id plaque in female subjects(29). The G allele of this SNP 
interferes with the binding site for STAT1. This SNP 
also alters expression of ANRIL and its nearby genes 
(30,31) in a way that the GG genotype confers the most 
decreased expression levels(30). This SNP also affects 
alternative splicing of ANRIL(32). Future studies are 
needed to unravel the molecular mechanisms leading to 
the importance of ANRIL rs1333048 and rs10757278 
SNPs in the susceptibility to prostate cancer in the Ira-
nian population. 
Based on outcomes of training and test datasets accu-
racy, sensitivity, specificity and F1 score values were 

Figure 6. Features Importance of the assessed SNPs.

SNPs for prostate cancer-Omrani et al.

Vol 18 No 6  November-December 2021  643


slightly lower in the test dataset. Moreover, the AUC 
scores were decreased in test dataset, albeit it remained 
significant. Thus, the proposed panels of SNPs can pre-
dict the risk of prostate cancer in the Iranian population 
with appropriate accuracy. This panel might be used as 
a screening panel for identification of at risk individu-
als. Further assessment of accuracy of this panel in la-
ger cohorts of patients from different stages of prostate 
cancer might reveal its significance in the determination 
of disease course or prognosis. 

ACKNOWLEDGMENTS
This study was financially supported by Urology and 
Nephrology Research Center. 

CONFLICT OF INTEREST
The authors declare they have no conflict of interest.

REFERENCES
 1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, 

Torre LA, Jemal A. Global cancer statistics 
2018: GLOBOCAN estimates of incidence 
and mortality worldwide for 36 cancers in 185 
countries. CA: a cancer journal for clinicians. 
2018;68:394-424.

 2. Basiri A, Eshrati B, Zarehoroki A, Golshan S, 
Shakhssalim N, Khoshdel A, et al. Incidence, 
Gleason Score and Ethnicity Pattern of 
Prostate Cancer in the Multi-ethnicity Country 
of Iran During 2008-2010. Urol J. 2020 May 
4;17:602-6.

 3. Rawla P. Epidemiology of Prostate Cancer. 
World J Oncol. 2019;10:63-89. 

 4. Eeles RA, Al Olama AA, Benlloch S, Saunders 
EJ, Leongamornlert DA, Tymrakiewicz 
M, et al. Identification of 23 new prostate 
cancer susceptibility loci using the iCOGS 
custom genotyping array. Nature genetics. 
2013;45:385-91.

 5. Al Olama AA, Kote-Jarai Z, Berndt SI, Conti 
DV, Schumacher F, Han Y, et al. A meta-
analysis of 87,040 individuals identifies 23 
new susceptibility loci for prostate cancer. 
Nature genetics. 2014;46:1103-9.

 6. Kote-Jarai Z, Al Olama AA, Giles GG, Severi 
G, Schleutker J, Weischer M, et al. Seven 
prostate cancer susceptibility loci identified by 
a multi-stage genome-wide association study. 
Nature genetics. 2011;43:785-91.

 7. Taheri M, Pouresmaeili F, Omrani MD, Habibi 
M, Sarrafzadeh S, Noroozi R, et al. Association 
of ANRIL gene polymorphisms with prostate 
cancer and benign prostatic hyperplasia in an 
Iranian population. Biomarkers in medicine. 
2017;11:413-22.

 8. Taheri M, Habibi M, Noroozi R, Rakhshan 
A, Sarrafzadeh S, Sayad A, et al. HOTAIR 
genetic variants are associated with prostate 
cancer and benign prostate hyperplasia in an 
Iranian population. Gene. 2017;613:20-4.

 9. Taheri M, Noroozi R, Rakhshan A, Ghanbari 
M, Omrani MD, Ghafouri-Fard S. IL-6 
genomic variants and risk of prostate cancer. 
Urology journal. 2019;1:463-8.

 10. Taheri M, Noroozi R, Dehghan A, Roozbahani 
GA, Omrani MD, Ghafouri-Fard S. Interleukin 

(IL)-8 polymorphisms and risk of prostate 
disorders. Gene. 2019;692:22-5.

 11. Breiman L. Random Forests. Machine 
Learning. 2001 2001/10/01;45:5-32.

 12. Breiman L. Bagging predictors. Machine 
learning. 1996;24:123-40.

 13. Chen X, Ishwaran H. Random forests 
for genomic data analysis. Genomics. 
2012;99:323-9.

 14. Sokolova M, Lapalme G. A systematic analysis 
of performance measures for classification 
tasks. Information processing & management. 
2009;45:427-37.

 15. Menze BH, Kelm BM, Masuch R, 
Himmelreich  U, Bachert P, Petrich 
W, et al. A comparison of random forest 
and its Gini importance with standard 
chemometric methods for the feature selection 
and classification of spectral data. BMC 
Bioinformatics. 2009 2009/07/10;10:213.

 16. Lunetta KL, Hayward LB, Segal J, Van 
Eerdewegh P. Screening large-scale 
association study data: exploiting interactions 
using random forests. BMC genetics. 
2004;5:32.

 17. Bureau A, Dupuis J, Falls K, Lunetta KL, 
Hayward B, Keith TP, et al. Identifying 
SNPs predictive of phenotype using random 
forests. Genetic Epidemiology: The Official 
Publication of the International Genetic 
Epidemiology Society. 2005;28:171-82.

 18. Sun YV, Bielak LF, Peyser PA, Turner ST, 
Sheedy PF, Boerwinkle E, et al. Application 
of machine learning algorithms to predict 
coronary artery calcification with a sibship‐
based design. Genetic Epidemiology: The 
Official Publication of the International 
Genetic Epidemiology Society. 2008;32:350-
60.

 19. Meng Y, Yang Q, Cuenco KT, Cupples 
LA, DeStefano AL, Lunetta KL. Two-
stage approach for identifying single-
nucleotide polymorphisms associated with 
rheumatoid arthritis using random forests and 
Bayesian networks. BMC Proceedings. 2007 
2007/12/18;1:S56.

 20. Yang F, Wang H-z, Mi H, Cai W-w. Using 
random forest for reliable classification and 
cost-sensitive learning for medical diagnosis. 
BMC bioinformatics. 2009;10:1-14.

 21. Alam MZ, Rahman MS, Rahman MS. A 
Random Forest based predictor for medical 
data classification using feature ranking. 
Informatics in Medicine Unlocked. 2019 
2019;15:100180.

 22. Mohapatra SK, Mohanty MN. Big data analysis 
and classification of biomedical signal using 
random forest algorithm.  New Paradigm in 
Decision Science and Management: Springer; 
2020. p. 217-24.

 23. Masetic Z, Subasi A. Congestive heart failure 
detection using random forest classifier. 
Comput Methods Programs Biomed. 2016 
Jul;130:54-64. 

 24. Fushiki T. Estimation of prediction error by 
using K-fold cross-validation. Statistics and 

SNPs for prostate cancer-Omrani et al.

Urological Oncology  644


Vol 18 No 6  November-December 2021  645

Computing. 2011;21:137-46.
 25. Van Dyke AL, Cote ML, Wenzlaff AS, 

Chen W, Abrams J, Land S, et al. Cytokine 
and cytokine receptor single-nucleotide 
polymorphisms predict risk for non-small cell 
lung cancer among women. Cancer Epidemiol 
Biomarkers Prev. 2009 Jun;18:1829-40.

 26. Staiano A, Di Taranto MD, Bloise E, 
D’Agostino MN, D’Angelo A, Marotta G, 
et al. Investigation of Single Nucleotide 
Polymorphisms Associated to Familial 
Combined Hyperlipidemia with Random 
Forests. In: Apolloni B, Bassis S, Esposito 
A, Morabito FC, editors. Neural Nets and 
Surroundings: 22nd Italian Workshop on 
Neural Nets, WIRN 2012, May 17-19, Vietri 
sul Mare, Salerno, Italy. Berlin, Heidelberg: 
Springer Berlin Heidelberg; 2013. p. 169-78.

 27. Bao L, Cui Y. Prediction of the phenotypic 
effects of non-synonymous single nucleotide 
polymorphisms using structural and 
evolutionary information. Bioinformatics. 
2005;21:2185-90.

 28. Schaefer AS, Richter GM, Groessner-
Schreiber B, Noack B, Nothnagel M, El 
Mokhtari N-E, et al. Identification of a shared 
genetic susceptibility locus for coronary 
heart disease and periodontitis. PLoS Genet. 
2009;5:e1000378-e. 

 29. Zivotić I, Djurić T, Stanković A, Djordjević 
A, Končar I, Davidović L, et al. 9p21 locus 
rs10757278 is associated with advanced carotid 
atherosclerosis in a gender-specific manner. 
Exp Biol Med (Maywood). 2016;241:1210-6. 
PubMed PMID: 26941057. Epub 03/03. eng.

 30. Liu Y, Sanoff HK, Cho H, Burd CE, Torrice 
C, Mohlke KL, et al. INK4/ARF transcript 
expression is associated with chromosome 
9p21 variants linked to atherosclerosis. PloS 
one. 2009;4.

 31. Burd CE, Jeck WR, Liu Y, Sanoff HK, 
Wang Z, Sharpless NE. Expression of linear 
and novel circular forms of an INK4/ARF-
associated non-coding RNA correlates with 
atherosclerosis risk. PLoS Genet. 2010;6.

 32. Zhang W, Chen Y, Liu P, Chen J, Song 
L, Tang Y, et al. Variants on chromosome 
9p21. 3 correlated with ANRIL expression 
contribute to stroke risk and recurrence in a 
large prospective stroke population. Stroke. 
2012;43:14-21.

SNPs for prostate cancer-Omrani et al.