306 J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312 Original ISSN 2413-0516 Introduction Liver cancer is the sixth most common cancer in the world- wide.1 The mechanisms of hepatocarcinogenesis are not fully understood; however, the theory of cancer stem cells (CSCs) has recently gained traction as a potential contributor to hepatic cancer. CSCs display great plasticity and self-renewal potential and play a decisive role in tumor formation and growth.2 These cells are highly drug-resistant and metastatic, which may underlie the recurrence and drug resistance of liver cancer.3, 4 Therefore, the identification of liver CSCs markers and therapeutic targets associated with them is necessary for improving treatment outcomes.5, 6 Biomarker is defined as “a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.” Biomarkers play an important role in the early diagnosis of diseases.7, 8 Microarray technology has revolutionized gene expres- sion analysis and has been used widely for the identification of cancer biomarkers. The use of high-throughput techniques has resulted in an exponential growth in the amount of infor- mation available in biomedical databases, which then can be exploited to integrate gene expression data for applications such as biomarker discovery, disease classification, or pheno- type comparisons, among others.9 However, these types of data are characterized by high dimensionality as the number of genes is far bigger than the number of samples. One of the biggest challenges of high-dimensional data are the curse of dimensionality, which describes the exponential increase in volume associated with adding extra dimensions in the Euclidean space. It is responsi- ble for the breakdown of the optimal statistical model fitting.10 To address the problem, researchers have applied various machine learning methods to reduce both cardinality and redundancy of gene expression data during the classification process, and most of these methods were developed to facil- itate the analysis of microarray data to identify the best dis- criminative genes or biomarkers.11 Extreme gradient boosting (XGBoost) is a machine learn- ing algorithm that assigns an importance score to each feature in the training phase, and these scores can then be used as the basis for identification of importance features. Since it uses multiple subsets of features to predict outcomes in the area of dimension cursed problem ensemble models may have better performance. Moreover, XGBoost is an optimized distributed gradient boosting that achieves state-of-the-art prediction performances.12 Feature importance ranking using the common tree ensemble models such as XGBoost and gbm R packages may provide inconsistent results. These methods only consider the effect of splits along the decision path. Therefore, when the model relies more on a given feature, the importance assigned to that feature changes incorrectly.13 Regarding model inter- pretation, which is especially important when using machine learning models that are often difficult to interpret, several studies have used Shapley Additive exPlanations (SHAP).14-16 Proposed by Lundberg and Lee, 17 SHAP is based on game the- ory18 and local explanations19 that offers a means to estimate the contribution of each feature. SHAP provides a consistent importance value, which is an alternative to permutation fea- ture importance.20 In the present study, we applied machine learning to gene expression data from previous studies on liver CSCs to iden- tify putative biomarkers for identification and characterization of these cells. Potential biomarker detection for liver cancer stem cell by machine learning approach Ali Farzane1, Maryam Akbarzadeh2, Reza Ferdousi3, Mohammadreza Rashidi4, Reza Safdari1 1Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran. 2Department of Biochemistry, Erasmus University Medical Center, Rotterdam, Netherlands. 3Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Daneshgah St, Tabriz, Iran. 4Stem Cell and Regenerative Medicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran. Corresponding Author: Reza Safdari (E-mail: rsafdari@tums.ac.ir ) (Submitted: 05 October 2020 – Revised version received: 19 October 2020 – Accepted: 27 October 2020 – Published online: 30 December 2020) Abstract Objectives: In this study, we aimed to identify putative biomarkers for identification and characterization of these cells in liver cancer. Methods: We employed a supervised machine learning method, XGBoost, to data from 13 GEO data series to classify samples using gene expression data. Results: Across the 376 samples [129 cancer stem cell, CSCs and 247 non-CSCs cases], XGBoost displayed high performance in the classification of data. XGBoost feature importance scores and SHAP (Shapley Additive explanation) values were used for the interpretation of results and analysis of individual gene importance. We confirmed that expression levels of a 10-gene set (PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1) were predictive. The results showed that these 10 genes can detect CSCs robustly with accuracy, sensitivity, and specificity of 97%, 100%, and 95%, respectively. Conclusions. We suggest that the 10-gene set may be used as a biomarker set for detecting and characterizing CSCs using gene expression data. Key words: Liver cancer, cancer stem cell, machine learning, biomarker, gene expression 307 Original Potential biomarker detection for liver cancer stem cell by machine learning approachAli Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312 Materials & Methods Data Preparation The study was approved by the ethics committee at Tehran University of Medical Sciences, Tehran, Iran (ethics code: IR.TUMS.REC.1394.1589). Since we used non-identifiable information from a publicly available data set, no specific con- sent was required. Datasets Gene expression data for samples of liver CSC and non-CSC were downloaded from GEO through accession numbers GSE56771, GSE59713, GSE66529, GSE84223, GSE68778, GSE126121, GSE112788, GSE66515, GSE42318, GSE131680, GSE103866, and GSE62905, and a data series including only non-CSC samples were downloaded via accession num- ber GSE112790. The first nine of the above data series were expression profiling by array, while the remaining four were high-throughput sequencing (HiSeq). We excluded samples that had undergone intervention. Finally, a total of 129 CSC and 247 non-CSC samples were included in the study. Preprocessing R language was used for all processing steps, including prepro- cessing, modeling, and pathway analysis. The dropout effect was eliminated in the HiSeq series. We performed log trans- formation on data series, followed by quartile normalization where it was needed. Next, the Ensembl database (https:// www.ensembl.org/index.html) was used to convert IDs from different data series to Ensembl IDs. Data series were inte- grated using the Merge function, and then the ComBat func- tion from the sva R package was applied to remove batch effects. Finally, we had 8409 common transcript genes across all data series. XGBoost model for biomarker signature identification Data classification was performed using XGBoost, which is an efficient implementation of the gradient boosting framework proposed by Chen and Guestrin.21 Gradient boosted decision tree is an ensemble learning method based on sequential deci- sion trees whereby each decision tree learns from the previous tree to improve the model and build a strong learner. Model tuning In XGBoost, several parameters need to be selected to max- imize model performance. However, the multiplicity of parameters may result in a model learning noises and random fluctuations and considering them meaningful, a phenome- non referred to as overfitting. Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points.22 Parameter tuning is an essential step in avoiding overfitting or undue complexity. The hyperparameters adopted in this study were “nrounds,” “eta,” “min_child_weight,” “max_ depth,” “gamma,” “colsample_bytree,” “subsample,” “lambda,” and “alpha.”20 The parameter “nrounds” is the number of trees that are fitted in the model. The “eta” parameter refers to the learning rate, which is used to make the model more robust. The “min_ child_weight” is the minimum sum of instance weight needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than “min_child_weight,” the building process will stop further partitioning.21 The “max_depth” parameter defines the maximum num- ber of partitions, with greater maximum depth increasing the risk of overfitting. The minimum loss reduction required to make a further partition on a leaf node of the tree is defined as “gamma,” with a larger “gamma” resulting in a more con- servative algorithm. The “subsample” parameter refers to the fraction of observations randomly selected for the training instances, which is inversely related to overfitting. Another parameter useful in avoiding overfitting is “colsample_bytree.” Finally, the parameters “lambda” and “alpha” are L2 and L1 regularization terms, respectively, that keep the weights small, thus preventing overfitting. The random search method was used for model tuning. Random search means that hyperparameters are randomly picked from the predefined searching domain uniformly and the searching does not depend on the previous boosting result. It has been shown to be efficient for problems with high dimensions in some studies.23 Model evaluation In this study, accuracy, sensitivity, and specificity were assessed to evaluate model performance. They are defined in Equations (1)–(3). The goal is to develop a model with high accuracy, sensitivity, and specificity. 307 Original Potential biomarker detection for liver cancer stem cell by machine learning approachAli Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–313 Materials & Methods Data Preparation The study was approved by the ethics committee at Tehran University of Medical Sciences, Tehran, Iran (ethics code: IR.TUMS.REC.1394.1589). Since we used non-identifiable information from a publicly available data set, no specific con- sent was required. Datasets Gene expression data for samples of liver CSC and non-CSC were downloaded from GEO through accession numbers GSE56771, GSE59713, GSE66529, GSE84223, GSE68778, GSE126121, GSE112788, GSE66515, GSE42318, GSE131680, GSE103866, and GSE62905, and a data series including only non-CSC samples were downloaded via accession num- ber GSE112790. The first nine of the above data series were expression profiling by array, while the remaining four were high-throughput sequencing (HiSeq). We excluded samples that had undergone intervention. Finally, a total of 129 CSC and 247 non-CSC samples were included in the study. Preprocessing R language was used for all processing steps, including prepro- cessing, modeling, and pathway analysis. The dropout effect was eliminated in the HiSeq series. We performed log trans- formation on data series, followed by quartile normalization where it was needed. Next, the Ensembl database (https:// www.ensembl.org/index.html) was used to convert IDs from different data series to Ensembl IDs. Data series were inte- grated using the Merge function, and then the ComBat func- tion from the sva R package was applied to remove batch effects. Finally, we had 8409 common transcript genes across all data series. XGBoost model for biomarker signature identification Data classification was performed using XGBoost, which is an efficient implementation of the gradient boosting framework proposed by Chen and Guestrin.21 Gradient boosted decision tree is an ensemble learning method based on sequential deci- sion trees whereby each decision tree learns from the previous tree to improve the model and build a strong learner. Model tuning In XGBoost, several parameters need to be selected to max- imize model performance. However, the multiplicity of parameters may result in a model learning noises and random fluctuations and considering them meaningful, a phenome- non referred to as overfitting. Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points.22 Parameter tuning is an essential step in avoiding overfitting or undue complexity. The hyperparameters adopted in this study were “nrounds,” “eta,” “min_child_weight,” “max_ depth,” “gamma,” “colsample_bytree,” “subsample,” “lambda,” and “alpha.”20 The parameter “nrounds” is the number of trees that are fitted in the model. The “eta” parameter refers to the learning rate, which is used to make the model more robust. The “min_ child_weight” is the minimum sum of instance weight needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than “min_child_weight,” the building process will stop further partitioning.21 The “max_depth” parameter defines the maximum num- ber of partitions, with greater maximum depth increasing the risk of overfitting. The minimum loss reduction required to make a further partition on a leaf node of the tree is defined as “gamma,” with a larger “gamma” resulting in a more con- servative algorithm. The “subsample” parameter refers to the fraction of observations randomly selected for the training instances, which is inversely related to overfitting. Another parameter useful in avoiding overfitting is “colsample_bytree.” Finally, the parameters “lambda” and “alpha” are L2 and L1 regularization terms, respectively, that keep the weights small, thus preventing overfitting. The random search method was used for model tuning. Random search means that hyperparameters are randomly picked from the predefined searching domain uniformly and the searching does not depend on the previous boosting result. It has been shown to be efficient for problems with high dimensions in some studies.23 Model evaluation In this study, accuracy, sensitivity, and specificity were assessed to evaluate model performance. They are defined in Equations (1)–(3). The goal is to develop a model with high accuracy, sensitivity, and specificity. accuracy True positive True negative Total sample = +∑ ∑ ∑ ( )1 sensitivity True positive Predicted positive = ( )∑ ∑ 2 specificity True negative Predictednegative = ∑ ∑ ( )3 Gene selection Two gene importance ranking lists obtained from XGBoost and SHAP values were considered for the selection of candi- date genes. We selected the 10 highest-ranking genes common to both lists as the marker genes. In XGBoost, feature importance is measured using three metrics, namely, gain, cover, and frequency. Gain is the contri- bution of a feature to the accuracy of the branches on them it is located. Cover measures the relative quantity of observations concerned by a feature. Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. We used the gain score to create a ranked list of genes.24 The second ranking list was created with SHAP, which explains the prediction of an instance x by computing the con- tribution of each feature to the prediction. The Shapley value for a feature j is the feature’s contribution to the prediction, weighted and summed over all possible feature value combi- nations that determined through Formula 4: Gene selection Two gene importance ranking lists obtained from XGBoost and SHAP values were considered for the selection of candi- date genes. We selected the 10 highest-ranking genes common to both lists as the marker genes. In XGBoost, feature importance is measured using three metrics, namely, gain, cover, and frequency. Gain is the contri- bution of a feature to the accuracy of the branches on them it is located. Cover measures the relative quantity of observations concerned by a feature. Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. We used the gain score to create a ranked list of genes.24 The second ranking list was created with SHAP, which explains the prediction of an instance x by computing the con- tribution of each feature to the prediction. The Shapley value for a feature j is the feature’s contribution to the prediction, weighted and summed over all possible feature value combi- nations that determined through Formula 4: 308 Original Potential biomarker detection for liver cancer stem cell by machine learning approach Ali Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312308 Original Potential biomarker detection for liver cancer stem cell by machine learning approach Ali Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–313 ∅ = − −( ) { }( )− ( )( ) ⊆ … ∑ ∪j S x x x i p i S p S p v S x v S { , , }\{ } ! ! ! ( ) 1 1 4 SHAP values can be used as feature importance scores: features with large absolute Shapley values are important. Since we wanted global importance, we averaged the absolute Shapley values per feature across the data (Formula 5): I j i n j i = ∅ = ( )∑ 1 5| | ( ) SHAP feature importance is an alternative to permuta- tion feature importance. However, while permutation feature importance is based on the decrease in model performance, SHAP is based on the magnitude of feature attributions.25 Enrichment analysis Pathway analysis for the top 10 genes was carried out using REACTOME, an open-source, open access, manually curated, and peer-reviewed pathway database. REACTOME provides intuitive bioinformatics tools for the visualization, interpreta- tion, and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology, and education. Gene ontology analysis was performed with Enrichr web-based tools and services. Results Data Preparation Datasets. In this study, our aim was to screen and mine for specific biomarkers for liver CSCs using the online available data. At the first stage, we obtained the gene expression profile of liver CSCs cell line and tissue including 376 samples (129 liver CSCs and 247 non-CSCs) from published data in the GEO database. Preprocessing. The ID for each data series, which was cap- tured from its own platforms annotation file, was mapped to an Ensembl ID via the biomaRt package in R to unify the datasets. Then, the data series were merged, followed by a batch correc- tion accomplished using the ComBat function from the sva package. The distributions of expression values before and after the batch effect correction for the combined datasets are shown in Fig. 1a–f. The difference in gene expression distribution between CSCs (Fig. 1a) and non-CSC samples was also magni- fied after batch effect correction (Fig. 1b). The quantile–quan- tile (Q–Q) plots (Fig. 1c–d) revealed a decrease in the distance between dots and the normal distribution line after removal of the batch effect. Finally, the PCA plots display the batch effect due to the integration of data from various studies (Fig. 1e), which has been resolved after applying the batch effect correc- tion (Fig. 1f ). Therefore, the gene expression data sets would be reliable for the subsequent analysis after batch effect correction. XGBoost model for a biomarker signature Model tuning. The hyperparameters adopted in this study were “eta,” “nrounds,” “max_depth,” “gamma,” “lambda,” “alpha,” “min_child_weight,” “subsample,” and “colsample_ bytree.” Table 1 presents the search domains and optimal val- ues for the hyperparameters. Gene selection and models evaluation. We randomly selected 65% of the data to train XGBoost and used the remaining 35% to test the model. Then, we employed a sev- enfold cross-validation process to assess model performance stability. Fig. 2a displays the results of cross-validation for the three performance measures, i.e., accuracy, sensitivity, and specificity. The obtained accuracy was 88.68–94.45, sensitiv- ity 86.68–94.11, and specificity 87.89–94.87. Overall, these indicators are significantly high and suggest that XGBoost can be used to model cell classification. XGBoost was finally retrained on the 75% of the training set and tested on the 25% of the testing set. The final performance indicators achieved were as follows: accuracy: 90%, sensitivity: 94%, and specificity: 89%, which is again indicative of significantly high performance. For gene ranking, we created 1000 models using the same hyperparameters, with each model assigning an importance score to each gene. Then, the median of the 1000 scores for each gene was computed to obtain the average score for the gene. We re-ranked these genes based on SHAP values and the gain scores. The 10 top-ranking genes remained the same in both rankings. We select these genes as a potential biomarker set. The biomarker genes were PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1. To see if the selected marker genes could serve as “univer- sal” markers for cell classification using gene expression data, we trained the XGBoost model using only the selected marker genes. Interestingly, the selected genes did reasonably better than all-gene modeling. The performance indicators achieved were as follows: accuracy: 97%, sensitivity: 100%, and specific- ity: 95% (Fig. 2b). We also calculated the SHAP value for each marker gene (Fig. 3). Enrichment analysis We further tested the top 10 genes for enriched gene ontol- ogy terms by the Enrichr web service and analysis tools of the Reactome website for pathways (Fig. 4). Results A total of 57 were recruited in this study. Of 57 children with G6PD deficiency, 17 (29.8%) were diagnosed as severe cases depending on the level of hemoglobin at time of presenta- tion to the emergency room in Children Welfare Teaching Hospital. The total number is 57 patients. The mean age is 4.35 years, males were constituting 85.9%, 32 patients were living in Urban area mostly from Baghdad, two-thirds of the patients have recorded ingestion of fresh type of fava beans (64.9%), while one-third ingested dried type (35%). Ten patients were already diagnosed with G6PD prior to the presentation; six of them were diagnosed based on previous episodes of hemoly- sis, and four depending on routine assessment for a suspected family member. Analysis of the demographic and clinical characteristics of the 57 patients according to the severity of hemolysis (mild vs severe) and the results of the independent samples t-test (for continuous variables) and Chi square (for categorical vari- ables) are represented in Table 1. Younger age group patients tend to present with the severe form of hemolysis (3.59 years with a P value of 0.001). No significant gender susceptibility SHAP values can be used as feature importance scores: features with large absolute Shapley values are important. Since we wanted global importance, we averaged the absolute Shapley values per feature across the data (Formula 5): 308 Original Potential biomarker detection for liver cancer stem cell by machine learning approach Ali Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–313 ∅ = − −( ) { }( )− ( )( ) ⊆ … ∑ ∪j S x x x i p i S p S p v S x v S { , , }\{ } ! ! ! ( ) 1 1 4 SHAP values can be used as feature importance scores: features with large absolute Shapley values are important. Since we wanted global importance, we averaged the absolute Shapley values per feature across the data (Formula 5): I j i n j i = ∅ = ( )∑ 1 5| | ( ) SHAP feature importance is an alternative to permuta- tion feature importance. However, while permutation feature importance is based on the decrease in model performance, SHAP is based on the magnitude of feature attributions.25 Enrichment analysis Pathway analysis for the top 10 genes was carried out using REACTOME, an open-source, open access, manually curated, and peer-reviewed pathway database. REACTOME provides intuitive bioinformatics tools for the visualization, interpreta- tion, and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology, and education. Gene ontology analysis was performed with Enrichr web-based tools and services. Results Data Preparation Datasets. In this study, our aim was to screen and mine for specific biomarkers for liver CSCs using the online available data. At the first stage, we obtained the gene expression profile of liver CSCs cell line and tissue including 376 samples (129 liver CSCs and 247 non-CSCs) from published data in the GEO database. Preprocessing. The ID for each data series, which was cap- tured from its own platforms annotation file, was mapped to an Ensembl ID via the biomaRt package in R to unify the datasets. Then, the data series were merged, followed by a batch correc- tion accomplished using the ComBat function from the sva package. The distributions of expression values before and after the batch effect correction for the combined datasets are shown in Fig. 1a–f. The difference in gene expression distribution between CSCs (Fig. 1a) and non-CSC samples was also magni- fied after batch effect correction (Fig. 1b). The quantile–quan- tile (Q–Q) plots (Fig. 1c–d) revealed a decrease in the distance between dots and the normal distribution line after removal of the batch effect. Finally, the PCA plots display the batch effect due to the integration of data from various studies (Fig. 1e), which has been resolved after applying the batch effect correc- tion (Fig. 1f ). Therefore, the gene expression data sets would be reliable for the subsequent analysis after batch effect correction. XGBoost model for a biomarker signature Model tuning. The hyperparameters adopted in this study were “eta,” “nrounds,” “max_depth,” “gamma,” “lambda,” “alpha,” “min_child_weight,” “subsample,” and “colsample_ bytree.” Table 1 presents the search domains and optimal val- ues for the hyperparameters. Gene selection and models evaluation. We randomly selected 65% of the data to train XGBoost and used the remaining 35% to test the model. Then, we employed a sev- enfold cross-validation process to assess model performance stability. Fig. 2a displays the results of cross-validation for the three performance measures, i.e., accuracy, sensitivity, and specificity. The obtained accuracy was 88.68–94.45, sensitiv- ity 86.68–94.11, and specificity 87.89–94.87. Overall, these indicators are significantly high and suggest that XGBoost can be used to model cell classification. XGBoost was finally retrained on the 75% of the training set and tested on the 25% of the testing set. The final performance indicators achieved were as follows: accuracy: 90%, sensitivity: 94%, and specificity: 89%, which is again indicative of significantly high performance. For gene ranking, we created 1000 models using the same hyperparameters, with each model assigning an importance score to each gene. Then, the median of the 1000 scores for each gene was computed to obtain the average score for the gene. We re-ranked these genes based on SHAP values and the gain scores. The 10 top-ranking genes remained the same in both rankings. We select these genes as a potential biomarker set. The biomarker genes were PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1. To see if the selected marker genes could serve as “univer- sal” markers for cell classification using gene expression data, we trained the XGBoost model using only the selected marker genes. Interestingly, the selected genes did reasonably better than all-gene modeling. The performance indicators achieved were as follows: accuracy: 97%, sensitivity: 100%, and specific- ity: 95% (Fig. 2b). We also calculated the SHAP value for each marker gene (Fig. 3). Enrichment analysis We further tested the top 10 genes for enriched gene ontol- ogy terms by the Enrichr web service and analysis tools of the Reactome website for pathways (Fig. 4). Results A total of 57 were recruited in this study. Of 57 children with G6PD deficiency, 17 (29.8%) were diagnosed as severe cases depending on the level of hemoglobin at time of presenta- tion to the emergency room in Children Welfare Teaching Hospital. The total number is 57 patients. The mean age is 4.35 years, males were constituting 85.9%, 32 patients were living in Urban area mostly from Baghdad, two-thirds of the patients have recorded ingestion of fresh type of fava beans (64.9%), while one-third ingested dried type (35%). Ten patients were already diagnosed with G6PD prior to the presentation; six of them were diagnosed based on previous episodes of hemoly- sis, and four depending on routine assessment for a suspected family member. Analysis of the demographic and clinical characteristics of the 57 patients according to the severity of hemolysis (mild vs severe) and the results of the independent samples t-test (for continuous variables) and Chi square (for categorical vari- ables) are represented in Table 1. Younger age group patients tend to present with the severe form of hemolysis (3.59 years with a P value of 0.001). No significant gender susceptibility SHAP feature importance is an alternative to permuta- tion feature importance. However, while permutation feature importance is based on the decrease in model performance, SHAP is based on the magnitude of feature attributions.25 Enrichment analysis Pathway analysis for the top 10 genes was carried out using REACTOME, an open-source, open access, manually curated, and peer-reviewed pathway database. REACTOME provides intuitive bioinformatics tools for the visualization, interpreta- tion, and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology, and education. Gene ontology analysis was performed with Enrichr web-based tools and services. Results Data Preparation Datasets. In this study, our aim was to screen and mine for specific biomarkers for liver CSCs using the online available data. At the first stage, we obtained the gene expression profile of liver CSCs cell line and tissue including 376 samples (129 liver CSCs and 247 non-CSCs) from published data in the GEO database. Preprocessing. The ID for each data series, which was cap- tured from its own platforms annotation file, was mapped to an Ensembl ID via the biomaRt package in R to unify the datasets. Then, the data series were merged, followed by a batch correc- tion accomplished using the ComBat function from the sva package. The distributions of expression values before and after the batch effect correction for the combined datasets are shown in Fig. 1a–f. The difference in gene expression distribution between CSCs (Fig. 1a) and non-CSC samples was also magni- fied after batch effect correction (Fig. 1b). The quantile–quan- tile (Q–Q) plots (Fig. 1c–d) revealed a decrease in the distance between dots and the normal distribution line after removal of the batch effect. Finally, the PCA plots display the batch effect due to the integration of data from various studies (Fig. 1e), which has been resolved after applying the batch effect correc- tion (Fig. 1f ). Therefore, the gene expression data sets would be reliable for the subsequent analysis after batch effect correction. XGBoost model for a biomarker signature Model tuning. The hyperparameters adopted in this study were “eta,” “nrounds,” “max_depth,” “gamma,” “lambda,” “alpha,” “min_child_weight,” “subsample,” and “colsample_ bytree.” Table 1 presents the search domains and optimal val- ues for the hyperparameters. Gene selection and models evaluation. We randomly selected 65% of the data to train XGBoost and used the remaining 35% to test the model. Then, we employed a sevenfold cross-vali- dation process to assess model performance stability. Fig. 2a displays the results of cross-validation for the three perfor- mance measures, i.e., accuracy, sensitivity, and specificity. The obtained accuracy was 88.68–94.45, sensitivity 86.68–94.11, and specificity 87.89–94.87. Overall, these indicators are sig- nificantly high and suggest that XGBoost can be used to model cell classification. XGBoost was finally retrained on the 75% of the training set and tested on the 25% of the testing set. The final performance indicators achieved were as follows: accuracy: 90%, sensitivity: 94%, and specificity: 89%, which is again indicative of significantly high performance. For gene ranking, we created 1000 models using the same hyperparameters, with each model assigning an importance score to each gene. Then, the median of the 1000 scores for each gene was computed to obtain the average score for the gene. We re-ranked these genes based on SHAP values and the gain scores. The 10 top-ranking genes remained the same in both rankings. We select these genes as a potential biomarker set. The biomarker genes were PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1. To see if the selected marker genes could serve as “univer- sal” markers for cell classification using gene expression data, we trained the XGBoost model using only the selected marker genes. Interestingly, the selected genes did reasonably better than all-gene modeling. The performance indicators achieved were as follows: accuracy: 97%, sensitivity: 100%, and specific- ity: 95% (Fig. 2b). We also calculated the SHAP value for each marker gene (Fig. 3). Enrichment analysis We further tested the top 10 genes for enriched gene ontol- ogy terms by the Enrichr web service and analysis tools of the Reactome website for pathways (Fig. 4). Discussion Liver cancer is a leading global health issue associated with high morbidity and mortality rate.26 In recent years, CSCs have been reported to make important contributions to tumor recurrence, progression, and therapeutic resistance. Therefore, therapeutic targeting of liver stem cells is necessary.5 In this study, we integrated data series from GEO to iden- tify potential biomarkers for liver CSCs via machine learning classification-based gene selection. One application of inte- grated gene expression is biomarker discovery. The integra- tion of data from multiple studies increases the sample size by incorporating samples from different cohorts, increasing the statistical power and the robustness of the results. However, it should be mentioned that increasing the sample size reduces the gene count in the integrated data set, resulting in informa- tion loss.9, 27, 28 Thirteen (13) data series from GEO were integrated pro- ducing a total of 385 samples in two groups (CSCs and non- CSCs). Batch correction was conducted with ComBat of the sva R package, which is frequently used in this area.10, 29-32 We used XGBoost for cell type prediction, as it offers high pre- diction accuracy and has stronger interpretability owing to its state-of-the-art algorithms. Because of these advantages, 309 Original Potential biomarker detection for liver cancer stem cell by machine learning approachAli Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312 researchers are increasingly using XGBoost in biomarker discovery.12, 33, 34 To improve the prediction performance, we tuned hyperparameters using random search, which is more efficient than either a traditional manual or grid search and evaluates more of the search space, especially when the search space has more than three dimensions.35 As we did not have an external data set to evaluate our model, we performed a sevenfold cross-validation with accuracy as the overall metric and sensitivity and specificity as the class-specific metrics. In our model, the values for the three metrics indicate high per- formance at both training and testing stages, suggesting that XGBoost can effectively distinguish the two classes.12, 20 Table 1. Tuned hyperparameter and Searching domain in XGBoost. Name Domain Transformation function Optimal Hyperparameter Value eta [0.01 , 0.10] - 0.088 nrounds [100 , 1000] - 475 max_depth [4 , 10] - 5 gamma [-1 , 0] F(x) = 10x 0.62 lambda [-1 , 1] F(x) = 10x 1.59 alpha [-1 , 1] F(x) = 10x 1.08 min_child_weight [1 , 12] - 3.28 subsample [0.5 , 1] - 0.61 colsample_bytree [0.5 , 1] - 0.85 Fig 1. The density, Q–Q, and PCA plots for evaluating the effect of the batch removal method on overall data. (a) The density plot before batch effect removal. (b) The density plot after batch effect removal (c) The Q–Q plot before batch effect removal. (d) The Q–Q plot after batch effect removal. (e) The PCA plot before batch effect removal. (f ) The PCA plot after batch effect removal. The dashed line in the density plot represents CSCs samples, and the continuous line represents non-CSCs samples. N, the gene number in the combined data set. 309 Original Potential biomarker detection for liver cancer stem cell by machine learning approachAli Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–313 Table 1. Tuned hyperparameter and Searching domain in XGBoost. Name Domain Transformation function Optimal Hyperparameter Value eta [0.01 , 0.10] - 0.088 nrounds [100 , 1000] - 475 max_depth [4 , 10] - 5 gamma [-1 , 0] F(x) = 10x 0.62 lambda [-1 , 1] F(x) = 10x 1.59 alpha [-1 , 1] F(x) = 10x 1.08 min_child_weight [1 , 12] - 3.28 subsample [0.5 , 1] - 0.61 colsample_bytree [0.5 , 1] - 0.85 Fig 1. The density, Q–Q, and PCA plots for evaluating the effect of the batch removal method on overall data. (a) The density plot before batch effect removal. (b) The density plot after batch effect removal (c) The Q–Q plot before batch effect removal. (d) The Q–Q plot after batch effect removal. (e) The PCA plot before batch effect removal. (f ) The PCA plot after batch effect removal. The dashed line in the density plot represents CSCs samples, and the continuous line represents non-CSCs samples. N, the gene number in the combined data set. 310 Original Potential biomarker detection for liver cancer stem cell by machine learning approach Ali Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312 Fig. 2. Models performance presentation. (a) The XGBoost model’s sevenfold cross-validation plot. (b) Selected gene XGBoost model ROC generates an AUC value greater than that achieved using all-gene XGBoost. Fig. 3. SHAP summary plot. Contribution of each gene to model (XGBoost with top 10 genes) output. 311 Original Potential biomarker detection for liver cancer stem cell by machine learning approachAli Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312 XGBoost assigns an importance score for each feature, which can be used for feature selection. SHAP values also pro- vide a means of ranking features as well as providing a mea- sure of prediction consistency. Therefore, for more certainty, we used both XGBoost importance and SHAP values.13, 20 The SHAP values and gain scores for genes were imputed, as described by Riberio et al and Chen et al, respectively.19, 21 We created 1000 XGBoost models by the same tuning parameters and then obtained feature importance score (gain) and SHAP values for each of the 1000 models. Finally, the median of 1000 scores was calculated for each gene. These scores are reliable and can be used as gene ranks.12 Genes were ordered separately by gain score and SHAP value. The top 10 genes from the SHAP list were selected, 9 of which were also among the top 10 genes in the gain score list and the remaining 1 corre- sponded to the 11th gene in this list. So, the 10 highest-ranking genes from the SHAP list were selected as the biomarker set, including PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1. To ensure that this gene set can be used as a biomarker set, we trained our model using only these 10 genes, which offered better predic- tion performance compared with all-gene models. Many studies have shown that CSCs have one or more abnormalities in signaling pathways that regulate cell cycle and self-renewal. The cellular pathways in which the key genes are most involved are pathways associated with cell cycle regulation.36 For example, the aurora kinase b (Aurkb)- protein phosphatase 1 (PP1) axis has been shown to mediate the resetting of Oct4 during the cell cycle in embryonic stem cells. Aurkb-PP1 axis also plays a critical role in cell cycle-de- pendent changes in kinetochore assembly by regulating the balance between phosphorylation and dephosphorylation of kinetochore substrates.37 SMC1A has also a key role in tumor metastasis and resis- tance to radiation therapy. This gene is associated with CSCs, epithelial-to-mesenchymal transition, and DNA–damage response pathways. Yadav et al demonstrated that suppres- sion of SMC1A expression reduces the self-renewal capacity of prostate cancer cells.38 PTGER3 induces tumorigenesis and drug resistance in ovarian cancer.36 LMBR1 is a regulator of nuclear stemness marker BMI1 in gastrointestinal stromal tumors.39 Another study has reported PFKL to play a vital role in the maintenance of CSC-like phenotype in hepatocellular carcinoma.40 L1CAM has been implicated in maintaining the growth and survival of CD133+ glioma cells both in vitro and in vivo and has been suggested to be a CSC-specific therapeu- tic target for improving the treatment of malignant gliomas and other brain tumors.41 Fig. 4. Pathway and gene ontology analysis of selected genes. (a) Selected genes-involved pathways. (b) Overall view of cell cycle pathways in which the selected genes were involved. (c) Selected genes involved in biological processes. (d) Selected genes involved in molecular functions. (e) Selected genes involved in cellular component. 312 Original Potential biomarker detection for liver cancer stem cell by machine learning approach Ali Farzane J Contemp Med Sci | Vol. 6, No. 6, November–December 2020: 306–312 Conclusions These 10 key genes were found to play important roles in liver CSC maintenance. It seems that AURKB is more important for controlling the stemness and may help in the treatment of liver cancer. This gene may be a therapeutic target for inhib- iting liver cancer stemness characteristics. However, this con- clusion is based on retrospective data, and validation of these findings warrants further biological studies. Acknowledgments This work supported by the Department of Health Information Management, School of Allied Medical Science, Tehran University of Medical Sciences. Conflict of interest The authors of this paper declare that they do not have any conflict of interest. References 1. Dasgupta P, Henshaw C, Youlden DR, et al. Global trends in incidence rates of primary adult liver cancers: A systematic review and meta-analysis. Front Oncol 2020;10:171. 2. Bai KH, He SY, Shu LL, et al. Identification of cancer stem cell characteristics in liver hepatocellular carcinoma by WGCNA analysis of transcriptome stemness index. Cancer Med 2020. 3. Prasad S, Ramachandran S, Gupta N, et al. Cancer cells stemness: a doorstep to targeted therapy. Biochim Biophys Acta (BBA)-Mol Basis Dis 2020;1866:165424. 4. Shibata M and Hoque MO. Targeting cancer stem cells: a strategy for effective eradication of cancer. Cancers 2019; 11: 732. 5. Xiang Y, Yang T, Pang B-y, et al. The progress and prospects of putative biomarkers for liver cancer stem cells in hepatocellular carcinoma. Stem Cells Int 2016;2016. 6. Najafi M, Farhood B and Mortezaee K. Cancer stem cells (CSCs) in cancer progression and therapy. J Cell Physiol 2019;234:8381-8395. 7. Chang N-W, Dai H-J, Shih Y-Y, et al. Biomarker identification of hepatocellular carcinoma using a methodical literature mining strategy. Database 2017;2017. 8. Shahrjooihaghighi A, Frigui H, Zhang X, et al. An ensemble feature selection method for biomarker discovery. In: 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) 2017, pp. 416-421. IEEE. 9. Toro-Domínguez D, Martorell-Marugán J, López-Domínguez R, et al. ImaGEO: integrative gene expression meta-analysis from GEO database. Bioinformatics 2019;35:880-882. 10. Diegues I, Vinga S and Lopes MB. Identification of common gene signatures in microarray and RNA-sequencing data using network-based regularization. In: International Work-Conference on Bioinformatics and Biomedical Engineering 2020, pp.15–26. Springer. 11. Shukla AK, Singh P and Vardhan M. A two-stage gene selection method for biomarker discovery from microarray data for cancer classification. Chemomet Intell Lab Syst 2018;183:47-58. 12. Li Y, Umbach DM, Bingham A, et al. Putative biomarkers for predicting tumor sample purity based on gene expression data. BMC Genom 2019;20:1-12. 13. Lundberg SM and Lee S-I. Consistent feature attribution for tree ensembles. arXiv preprint arXiv:170606060 2017. 14. Guthrie NL, Carpenter J, Edwards KL, et al. Emergence of digital biomarkers to predict and modify treatment efficacy: machine learning study. BMJ Open 2019;9:e030710. 15. Caly H, Rabiei H, Coste-Mazeau P, et al. Pregnancy data enable identification of relevant biomarkers and a partial prognosis of autism at birth. bioRxiv 2020. 16. Hathaway QA, Roth SM, Pinti MV, et al. Machine-learning to stratify diabetic patients using novel cardiac biomarkers and integrative genomics. Cardiovasc Diabetol 2019;18:78. 17. Lundberg SM and Lee S-I. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 2017, pp.4765-4774. 18. Štrumbelj E and Kononenko I. Explaining prediction models and individual predictions with feature contributions. Knowl Inform Syst 2014;41:647-665. 19. Ribeiro MT, Singh S and Guestrin C. “ Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016, pp.1135-1144. 20. Parsa AB, Movahedi A, Taghipour H, et al. Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accident Anal Prev 2020;136:105405. 21. Chen T and Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016, pp.785-794. 22. Werpachowski R, György A and Szepesvári C. Detecting overfitting via adversarial examples. In: Advances in Neural Information Processing Systems 2019, pp.7858-7868. 23. Wang Y and Ni XS. A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv preprint arXiv:190108433 2019. 24. developers x. Understand your dataset with XGBoost R-Project. https:// xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html (2020). 25. Molnar C. Interpretable machine learning. Lulu com 2019. 26. Wang W, Smits R, Hao H, et al. Wnt/β-catenin signaling in liver cancers. Cancers 2019;11:926. 27. Toro-Domínguez D, Villatoro-García JA, Martorell-Marugán J, et al. A survey of gene expression meta-analysis: Methods and applications. Brief Bioinform 2020. 28. Ghosheh N, Küppers-Munther B, Asplund A, et al. Human pluripotent stem cell-derived hepatocytes show higher transcriptional correlation with adult liver tissue than with fetal liver tissue. ACS Omega 2020;5:4816-4827. 29. Bai J, Zhang X, Kang X, et al. Screening of core genes and pathways in breast cancer development via comprehensive analysis of multi gene expression datasets. Oncol Lett 2019;18:5821-5830. 30. Xia L, Su X, Shen J, et al. ANLN functions as a key candidate gene in cervical cancer as determined by integrated bioinformatic analysis. Cancer Manage Res 2018;10:663. 31. Kuang Y, Wang Y, Zhai W, et al. Genome-wide analysis of methylation-driven genes and identification of an eight-gene panel for prognosis prediction in breast cancer. Front Genet 2020;11:301. 32. Guo T, Ma H and Zhou Y. Bioinformatics analysis of microarray data to identify the candidate biomarkers of lung adenocarcinoma. PeerJ 2019;7:e7313. 33. Zhang X, Li T, Wang J, et al. Identification of cancer-related long non-coding RNAs using XGBoost with high accuracy. Front Genet 2019;10:735. 34. Ding W, Chen G and Shi T. Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis. Epigenetics 2019;14:67-80. 35. Si M, Xiong Y, Du S, et al. Evaluation and calibration of a low-cost particle sensor in ambient conditions using machine-learning methods. Atmos Measure Tech 2020;13. 36. Rodriguez-Aguayo C, Bayraktar E, Ivan C, et al. PTGER3 induces ovary tumorigenesis and confers resistance to cisplatin therapy through up- regulation Ras-MAPK/Erk-ETS1-ELK1/CFTR1 axis. EBioMedicine 2019;40:290- 304. 37. Shin J, Kim TW, Kim H, et al. Aurkb/PP1-mediated resetting of Oct4 during the cell cycle determines the identity of embryonic stem cells. Elife 2016;5:e10877. 38. Yadav S, Kowolik CM, Lin M, et al. SMC1A is associated with radioresistance in prostate cancer and acts by regulating epithelial-mesenchymal transition and cancer stem-like properties. Mol Carcinogen 2019;58:113-125. 39. Bai C, Liu X, Xu J, et al. Expression profiles of stemness genes in gastrointestinal stromal tumor. Human Pathol 2018;76:76-84. 40. Lin S-H, Liu T, Ming X, et al. Regulatory role of hexosamine biosynthetic pathway on hepatic cancer stem cell marker CD133 under low glucose conditions. Scient Rep 2016;6:1-10. 41. Bao S, Wu Q, Li Z, et al. Targeting cancer stem cells through L1CAM suppresses glioma growth. Cancer Res 2008;68:6043-6048. This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License which allows users to read, copy, distribute and make derivative works for non-commercial purposes from the material, as long as the author of the original work is cited properly. https://doi.org/10.22317/jcms.v6i6.898