Nova Biotechnol Chim (2020) 19(1): 52-60 DOI 10.36547/nbc.v19i1.577  Corresponding author: jose.isagani.janairo@dlsu.edu.ph Nova Biotechnologica et Chimica A sequence-dependent classification algorithm for Crohn’s Disease – causing NOD2 protein mutations Jose Isagani B. Janairo1, and Marianne Linley L. Sy-Janairo2 1Biology Department, De La Salle University, 2401 Taft Avenue, Manila 0922, Philippines 2Institute of Digestive and Liver Diseases, St. Luke’s Medical Center – Global City, Rizal Drive, Taguig 1634, Philippines Article info Article history: Received: 10th December 2019 Accepted: 3rd March 2020 Keywords: Artificial neural networks Inflammatory bowel disease Machine learning Personalized medicine Abbreviations: ANN artificial neural network CD Crohn’s disease DCM disease-causing mutation NOD2 Nucleotide-binding oligomerization domain-containing protein 2 NPV negative predictive value PPV positive predictive value RF random forest SVM support vector machine Abstract Certain NOD2 protein mutations have been associated with the onset of the inflammatory bowel disease, Crohn’s Disease (CD). NOD2 is involved in the inflammatory response of the gut to the microbial community, wherein its functional impairment through mutations may lead to CD progression. Considering the significant role that NOD2 plays in CD pathogenesis, predicting whether a specific type of NOD2 mutation is the cause of CD can greatly aid the accuracy of the disease diagnosis. Hence, a novel sequence-based classification algorithm built on artificial neural network (ANN) is herein presented that can predict whether a specific NOD2 mutation can cause CD or not. The NOD2 mutant types and their association with CD were taken from literature, and the calculated sequence-order coupling numbers were used as the classification predictors. The formulated ANN classifier exhibited satisfactory predictive ability, with 82.4 % accuracy, 62.5 % sensitivity, 100 % specificity, 100 % positive predictive value, and 75 % negative predictive value. The presented ANN classifier provides a proof-of-concept that predicting the onset of CD from NOD2 protein variant is possible.  University of SS. Cyril and Methodius in Trnava Introduction Crohn’s Disease (CD) is characterized by chronic transmural inflammation of the gastrointestinal tract. Pathogenesis of CD is multifactorial, wherein one of the key drivers of the disease involves mutations in the NOD2 protein (Yamamoto and Ma 2009). NOD2 is encoded by the CARD15 gene in the human chromosome 16 (Strober and Watanabe 2011). This protein plays a critical role in microbe / pathogen sensing, wherein the leucine-rich region of NOD2 binds to the muramyl dipeptide (MDP) of the bacterial cell wall. Once activated by MDP, NOD2 initiates downstream signaling events relevant to the host immune response. Thus, NOD2 mutations may lead to the impaired regulation and response of the host to bacterial interactions, which increases the risk to unusual ileal inflammation (Sidiq et al. 2016). Various NOD2 mutations have been associated with CD susceptibility, wherein the missense mutations and frameshift mutation appear to be the most common type of mutations associated with CD progression (Cuthbert et al. 2002; Hampe et al. 2002; Economou et al. 2004). Other less frequent mutations are also linked with CD pathogenesis, while other NOD2 mutations do not lead to CD (Lesage et al. 2002). Clearly, possible connections between the properties of NOD2 protein mutants and CD susceptibility exist, but remain to be uncovered. Thus, this study aims to use machine learning to formulate a predictive model that can classify NOD2 mutations as disease-causing or non-disease-causing based on mailto:jose.isagani.janairo@dlsu.edu.ph Nova Biotechnol Chim (2020) 19(1): 52-60 53 protein numerical representations. Artificial Neural Network (ANN) is a powerful machine learning technique that can uncover non-obvious patterns or associations from datasets of various characteristics. ANN has been widely used in medical diagnosis, particularly in cancer classification and prediction (Khan et al. 2001), tuberculosis (Er et al. 2010), among others. Having the ability to predict whether a specific NOD2 mutation maybe associated with CD can greatly improve disease detection and therapy. In addition, this ability becomes even more valuable after considering that NOD2 mutation type influences the response of the patient towards a particular treatment (Niess et al. 2012). Early detection of the disease is one of the main challenges in inflammatory bowel disease, such as CD (Flamant and Roblin 2018). Thus, such predictive model can potentially help improve disease diagnosis, as well as lay the groundwork for the greater adoption of personalized medicine for the management of CD. Experimental Data Mining NOD2 protein disease-causing mutants (DCM), and non-disease-causing mutants (NDCM) were taken from (Lesage et al. 2002). The list contains 30 DCMs, and 13 non-DCMs, both which are mostly point mutations. Additional NOD2 mutant variants were taken from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), a database that shows the relationship between genetic variants and phenotypes (Landrum et al. 2014). The archive was searched using the search string [NOD2 AND (("Crohn Disease") AND (BENIGN OR PATHOGENIC))]. The search yielded 76 results, but after removing duplications, silent mutations, and inconclusive medical assignment for each variant, 16 NOD2 mutants were added to the dataset. The resulting 59 NOD2 protein variants (Table 1) were then numerically represented using the 30 sequence-order coupling number (SOCN) based from Schneider-Wrede descriptors (Schneider and Wrede 1994), calculated using ProtrWeb (Xiao et al. 2015). This web server for calculating protein descriptors require the protein sequence as the input. The canonical sequence of human NOD2 was taken from www.uniprot.org (Uniprot ID: Q9HC29), which also served as the basis of the mutant sequences. The functional impact of the mutations on the NOD2 protein was also assessed using the PROVEAN Protein tools as implemented in the PROVEAN web server version 1.1.3 (Choi et al. 2012) (http://provean.jcvi.org/seq_submit.php). The input in the PROVEAN web server is the wild type protein sequence, the position of the mutation, and the amino acid substitution. The utilized classification threshold was the default value of -2.5. From the provided information, the web server will then determine if the submitted mutation for analysis is either deleterious or neutral. The full dataset, which contains the 59 NOD variants and the corresponding 30 SOCNs is available in the supporting information. Statistical Analysis All statistical analyses were carried out using Tibco Statistica version 13.4.0.14. Statistical difference was probed between DCM and NDCM NOD2 protein variants through ANOVA using the 30 SOCN. The same descriptor set was also utilized to segregate the 43 NOD2 protein variants through a two-cluster solution using K-means clustering. After the exploratory data analysis, the dataset was then used to create various machine learning classification models. DCM / NDCM served as the categorical response variable, and the calculated sequence-order coupling numbers served as the continuous descriptors. For the artificial neural network (ANN) based - classification model, a feed-forward multilayer perceptron architecture was adopted, wherein sum of squares was the error function, the hidden unit activation function used was tanh, and identity was the output unit. Bootstrap subsampling was employed, wherein 10 subsamples were gathered in which 50 % was dedicated for training, 30 % for testing, and 20 % for validation. For support vector machine (SVM) classification, the radial basis function (RBF) kernel was utilized, leading to the automatic selection of the best Gamma and C parameters. The Gamma and C parameters are involved in the definition of the hyperplanes which leads to the separation and classification of the cases. 75 % https://www.ncbi.nlm.nih.gov/clinvar/ http://www.uniprot.org/ http://provean.jcvi.org/seq_submit.php Nova Biotechnol Chim (2020) 19(1): 52-60 54 Table 1. Association of NOD2 mutations with CD progression as reported in Lesage et al. (2002) and information from the ClinVar dat. DCM refers to disease-causing mutation, while NDCM means non-disease-causing mutation. NOD2 Variant Association with CD Reference PROVEAN Score Functional impact of mutation based on PROVEAN prediction R138Q DCM Lesage et al. 2002 -2.120 Neutral A140T DCM Lesage et al. 2002 -0.464 Neutral W157R DCM Lesage et al. 2002 1.142 Neutral T189M DCM Lesage et al. 2002 -0.494 Neutral R235C DCM Lesage et al. 2002 -2.779 Deleterious L248R DCM Lesage et al. 2002 -4.286 Deleterious P268S NDCM Lesage et al. 2002 -0.614 Neutral N289S DCM Lesage et al. 2002 -2.417 Neutral D291N DCM Lesage et al. 2002 -2.097 Neutral T294S NDCM Lesage et al. 2002 -3.033 Deleterious A301V NDCM Lesage et al. 2002 -3.631 Deleterious R311W DCM Lesage et al. 2002 -4.602 Deleterious L348V NDCM Lesage et al. 2002 -2.291 Neutral H352R NDCM Lesage et al. 2002 -4.166 Deleterious R373C DCM Lesage et al. 2002 -2.984 Deleterious N414S DCM Lesage et al. 2002 -1.796 Neutral S431L DCM Lesage et al. 2002 -2.000 Neutral A432V NDCM Lesage et al. 2002 -1.104 Neutral E441K DCM Lesage et al. 2002 0.090 Neutral 558delLG DCM Lesage et al. 2002 -11.499 Deleterious A612T DCM Lesage et al. 2002 -3.608 Deleterious A612V NDCM Lesage et al. 2002 -3.645 Deleterious R684W DCM Lesage et al. 2002 -3.092 Deleterious R702W DCM Lesage et al. 2002 -3.285 Deleterious R703C DCM Lesage et al. 2002 -3.313 Deleterious R713C DCM Lesage et al. 2002 -2.838 Deleterious A725G NDCM Lesage et al. 2002 -1.275 Neutral A755V NDCM Lesage et al. 2002 -3.070 Deleterious A758V NDCM Lesage et al. 2002 -0.953 Neutral E778K DCM Lesage et al. 2002 -2.579 Deleterious V793M DCM Lesage et al. 2002 -0.804 Neutral E843K DCM Lesage et al. 2002 0.482 Neutral N853S DCM Lesage et al. 2002 -4.637 Deleterious M863V DCM Lesage et al. 2002 -0.070 Neutral A885T DCM Lesage et al. 2002 -1.407 Neutral G908R DCM Lesage et al. 2002 -5.822 Deleterious A918D DCM Lesage et al. 2002 -4.932 Deleterious G924D DCM Lesage et al. 2002 0.149 Neutral V955I NDCM Lesage et al. 2002 -0.435 Neutral V972I NDCM Lesage et al. 2002 -0.633 Neutral G978E DCM Lesage et al. 2002 -1.646 Neutral 1007fs DCM Lesage et al. 2002 n.d. n.d. A292V NDCM ClinVar ID 319441 -2.391 Neutral A612S NDCM ClinVar ID 319452 -2.850 Deleterious A849V NDCM ClinVar ID 97855 -3.122 Deleterious D154N NDCM ClinVar ID 319426 -0.914 Neutral G1032S NDCM ClinVar ID 319475 0.737 Neutral L682F NDCM ClinVar ID 319457 -3.467 Deleterious Q902K NDCM ClinVar ID 319471 -1.023 Neutral R391H NDCM ClinVar ID 319442 -0.414 Neutral R471C NDCM ClinVar ID 319446 -2.087 Neutral R708H NDCM ClinVar ID 319459 -1.413 Neutral R716H NDCM ClinVar ID 319460 -2.092 Neutral R791Q NDCM ClinVar ID 97850 -0.251 Neutral T245M NDCM ClinVar ID 319434 -0.886 Neutral V92I NDCM ClinVar ID 319425 -0.275 Neutral V162I NDCM ClinVar ID 319427 0.486 Neutral V955I NDCM ClinVar ID 97869 -0.435 Neutral Nova Biotechnol Chim (2020) 19(1): 52-60 55 of the data set was used for training, while the remaining 25 % served as the test set, and the results were validated by applying a 10-fold cross validation. For random forest classification, the random test data proportion was set to 0.30, and 0.5 for the subsample proportion. The stopping parameters were set as follows: minimum n cases =5, maximum n cases = 10, minimum n child in node = 5, maximum n of nodes = 100. For the boosted trees regression, the learning rate was set to 0.1, with the following conditions: number of additive terms = 200, random test data proportion = 0.30, subsample proportion = 0.4. The stopping parameters were set as follows: minimum n of cases = 5, maximum n of levels = 10, minimum n in child node = 1, maximum n of nodes = 3. The predictive performances of the constructed models were evaluated using the diagnostic indices of accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). These indices are calculated as follows Eq. 1 – 5 (Trevethan 2017): (1) (2) (3) (4) (5) The values used for the calculation were taken from the results of the validation sets of the ANN-based classification models, and test sets for the SVM and tree-based classification models. Results Variations between DCM and NDCM NOD2 mutants based on the location, and nature of the mutations were observed. Most of the DCMs were located at the leucine-rich region (LRR) of NOD2, while NDCMs occurred mostly Fig. 1. Frequency of mutations based on the domain location within the NOD2 protein. Blue columns represent disease- causing mutation and gray columns represent non-disease- causing mutations. Designation of mutation type is based on Lesage et al. 2002, and ClinVar. at the non-domain part of the protein (Fig. 1). For the nature of the mutations, conservative mutations accounted for 71 % of the NDCMs and only 3 % for the DCMs. Most NDCMs involved mutations to an aliphatic residue, while the DCMs exhibited a scattered type of mutations (Fig. 2). Fig. 2. Frequency of mutations based on the nature of the mutant amino acid. Blue columns represent disease-causing mutation and gray columns represent non-disease-causing mutations. Designation of mutation type is based on Lesage et al. 2002, and ClinVar. Table 2. Confusion matrix for the comparison between the pathogenicity and functional impact of the NOD2 mutations. Deleterious Neutral DCM 14 15 NDCM 8 20 Nova Biotechnol Chim (2020) 19(1): 52-60 56 Table 3. Model architecture and performance of artificial neural network – based classification algorithms. Model Descriptor / Selection basis Network structure Algorithm Prediction accuracy Diagnostic performance A 1.30 Multilayer perceptron BFGS 6 Training = 70.6 % Sensitivity = 40 % Full model Input layer: 30 Testing = 52.9 % Specificity = 17 % Hidden layer: 22 Validation = 27.3 % PPV = 28.6 % Output layer: 2 NPV = 25 % B 1.10 Multilayer perceptron BFGS 4 Training = 47.1 % Sensitivity = 60 % Segmented Input layer: 10 Testing = 58.8 % Specificity = 67 % Hidden layer: 8 Validation = 63.6 % PPV = 60 % Output layer: 2 NPV = 66.7 % C 11.20 Multilayer perceptron BFGS 12 Training = 77.8 % Sensitivity = 50 % Segmented Input layer: 10 Testing = 52.9 % Specificity = 60 % Hidden layer: 8 Validation = 54.5 % PPV = 60 % Output layer: 2 NPV = 50 % D 21-30 Multilayer perceptron BFGS 16 Training = 80 % Sensitivity = 62.5 % Segmented Input layer: 10 Testing = 82.4 % Specificity = 100 % Hidden layer: 8 Validation = 45.5 % PPV = 100 % Output layer: 2 NPV = 75 % The functional impact of the mutations on NOD2 was also assessed using PROVEAN. As seen in Table 2, a deleterious mutation weakly associates with NOD2 pathogenicity since it only accounts for 48 % of disease-causing mutations. This is in contrast to mutations with neutral functional impact, which account for 71 % of non- disease-causing mutations. The difference between DCM and NDCM NOD2 mutants based on the 30 SOCN was initially probed Table 4. Model architecture and performance of SVM – based classification algorithms. Model Descriptor / Selection basis Prediction accuracy Diagnostic performance Gamma Capacity Support vectors E 1.30 Training = 52.3 % Sensitivity = 0 % 0.033 1.000 44 Full model Test = 46.7 % Specificity = 0 % (44 bounded) Overall = 50.5 % PPV = 0 % Validation = 40.9 % NPV = 0 % F 1.10 Training = 52.3 % Sensitivity = 0 % 0.100 1.000 44 Segmented Test = 46.7 % Specificity = 0 % (44 bounded) Overall = 50.5 % PPV = 0 % Validation = 40.9 % NPV = 0 % G 11.20 Training = 52.3 % Sensitivity = 0 % 0.100 1.000 44 Segmented Test = 46.7 % Specificity = 0 % (44 bounded) Overall = 50.5 % PPV = 0 % Validation = 40.9 % NPV = 0 % H Segmented Test = 46.7 % Specificity = 0 % (44 bounded) Overall = 50.5 % PPV = 0 % Validation = 40.9 % NPV = 0 % Nova Biotechnol Chim (2020) 19(1): 52-60 57 Table 5. Model architecture and performance of Random Forest – based classification algorithms. Model Descriptor / Selection basis Prediction accuracy Diagnostic performance I 1.30 Training = 85 % Sensitivity = 37.5 % Full model Test = 58 % Specificity = 73 % Overall = 76 % PPV = 50 % NPV = 61.5 % J 1.10 Training = 65.7 % Sensitivity = 50 % Segmented Test = 62.5 % Specificity = 71 % Overall = 64.4 % PPV = 55.6 % NPV = 66.7 % K 11.20 Training = 77.1 % Sensitivity = 30 % Segmented Test = 41.7 % Specificity = 50 % Overall = 62.7 % PPV = 30 % NPV = 50 % L 21-30 Training = 74.2 % Sensitivity = 30 % Segmented Test = 41.7 % Specificity = 50 % Overall = 62.7 % PPV = 30 % NPV = 50 % through one way-ANOVA and k-means clustering. As anticipated, ANOVA yielded a non-significant difference (p > 0.05) between the two classes of NOD2 mutants, owing to the small variations in their respective SOCNs. A similar case was observed for the two-cluster solution created through k-means clustering. The first cluster only contained the truncated NOD2, the 1007fs, while the second cluster contained the remaining 42 NOD variants. Evidently, these two statistical methods were unable to discriminate DCM from NDCM NOD2 variants based on the 30 SOCNs. Several binary classification models using the SOCNs as the predictors were then formulated utilizing various machine learning algorithms, which include ANN (Table 3), SVM (Table 4), Random Forest (Table 5), and Boosted Trees (Table 6). The ANN-based classification algorithm and Boosted Trees yielded better predictive models compared with those produced using SVM and Random Forest. Optimization of the selection of the descriptors was done to improve the performance of the classification models, and was conducted systematically through the systematic segmentation of the 30 SOCNs. Comparing Tables 3-6, models D, M, and P yielded the reliable classification models as demonstrated by its satisfactory prediction accuracy for the training, test, and validation sets. However, model D was deemed as the best predictive model since it required only 10 predictors, and it exhibited a good balance in accuracy and diagnostic performance. The diagnostic indices of sensitivity, specificity, PPV, and NPV were used to further probe the performance of the constructed models. Thus, the overall performance of model D is satisfactory, based on the algorithm accuracy, which indicates how often the classifier is correct. Moreover, the sensitivity and specificity of the model are also satisfactory. Sensitivity demonstrates the ability of the model for positive classification, while specificity for negative identification (Wong and Lim 2011). PPV and NPV demonstrate how many were indeed true positives and true negatives from the ratings given by the classifier (Trevethan 2017). Discussion Understanding the relationship between NOD2 mutant type, and CD susceptibility is of paramount importance, considering the pivotal role of NOD2 in CD pathogenesis. However, such connection remains to be fully understood. As what has been previously demonstrated from a study of 612 Nova Biotechnol Chim (2020) 19(1): 52-60 58 Table 6. Model architecture and performance of Boosted Trees – based classification algorithms. Model Descriptor / Selection basis Prediction accuracy Diagnostic performance M 1.30 Training = 90.2 % Sensitivity = 88.9 % Full model Test = 72.2% Specificity = 56 % Overall = 84.7 % PPV = 66.7 % NPV = 83.3 % N 1.10 Training = 73.2 % Sensitivity = 55.6 % Segmented Test = 66.7 % Specificity = 78 % Overall = 71.2 % PPV = 71.4 % NPV = 63.6 % O 11.20 Training = 95.1 % Sensitivity = 55.6 % Segmented Test = 55.6 % Specificity = 56 % Overall = 83.1 % PPV = 55.6 % NPV = 55.6 % P 21-30 Training = 82.9 % Sensitivity = 77.8 % Segmented Test = 77.8 % Specificity = 78 % Overall = 81.4 % PPV = 77.8 % NPV = 77.8 % European CD patients, NOD2 mutations can either be disease-causing, or non-disease-causing (Lesage et al. 2002). Several studies have identified that certain NOD2 mutations, such as the frameshift mutation 1007fs, as markers or indicators of CD susceptibility (Ogura et al. 2001). It is believed that the frameshift mutation leads to a truncated NOD2 in the leucine-rich region (LRR), thereby impairing its function. Consequently, most known CD DCMs occur at the LRR (Fig. 1). Apart from impairing the ligand-binding function of the protein, mutations at the LRR may also lead to a destabilized protein structure resulting to a loss of function for NOD2 (Maekawa et al. 2016). However, analysis on the functional impact of the mutations revealed that less than half of analyzed DCMs have deleterious functional impact. This suggests that mutational pathogenicity observes multiple possible mechanisms apart from impairment brought by the mutation. Aside from the location of the mutation, a striking difference between DCM and NDCM is the nature of the mutation. It was observed that the known DCM are mostly non-conservative (3 % conservative mutations), as opposed to NDCM that are 71 % conservative. While several diseases are also caused by conservative mutations, it is possible that the non-conservative mutations may have greater impact on the NOD2 protein. The non- conservative mutations may significantly alter the microenvironment in which the mutation is located owing to change in property of the amino acid substitution. On the other hand, the conservative non-disease-causing mutations may have little effect on the NOD2 loss-of-function since most conservative NDCM involve aliphatic to aliphatic substitutions (Fig. 2). These findings therefore present a viable opportunity to deploy statistical methodologies in order to uncover associations between NOD2 mutant type and CD progression. However, creating predictive models of protein mutants based on the primary structure is challenging due to the minute variations introduced by the point mutations. Sequence – order coupling numbers are therefore ideal descriptors to be used, since this numerical representation of proteins reflects the sequence-order effect. For example, the NOD2 mutants A432V and A612V have identical amino acid composition, but the substitution is located at different positions. This positional difference is adeptly captured by this class of descriptors since these two mutants have different values for the 30 SOCNs. In addition, SOCN can fully describe the observed differences between DCM Nova Biotechnol Chim (2020) 19(1): 52-60 59 and NDCM with respect to the location and nature of the mutation. This frequently used protein descriptor class is based on distance matrices derived from amino acids, their sequence-order, and physicochemical properties (Schneider and Wrede 1994; Chou 2000). The 30 different SOCN represents the rank of the SOCN. For example, the first SOCN describes the coupling of adjacent residues, the second SOCN describes the coupling of between all second most contiguous residues, so forth. Thus, this protein descriptor class can potentially reveal hidden association between NOD2 mutation effect and CD susceptibility. The presented ANN classifier provides a proof-of- concept that predicting the onset of CD from NOD2 protein variant is possible. The presented classification model (model D) is reliable after considering that the other models exhibited overfitting as characterized by the high training set accuracy but extremely low test accuracy. In addition, the other models were unable to classify NDCM NOD2 variants, as demonstrated by the low scores obtained for specificity and NPV. Out of the 16 classification models created, only model D demonstrated a satisfactory accuracy for the training and test sets, in addition to respectable scores for the diagnostic indices. Statistical endeavors that aimed to enhance CD detection involved the formulation of machine learning classification algorithms based on endoscopic data (Mossotto et al. 2017), multivariate analysis of magnetic resonance spectroscopic data of gastrointestinal tissues (Bezabeh et al. 2001), neuro-fuzzy classifier based on multitudes of clinical data (Ahmed et al. 2017), and serological, genetic, and inflammatory markers-dependent random forest classifier (Plevy et al. 2013). Recently, an SVM classification model that can categorize individuals into healthy or CD patients based on exome variations was reported (Wang et al. 2019). The SVM classifier used over 10,000 genes for the classification, including the NOD2 gene. In the present study, the focus of the classifier is to categorize whether mutations are disease-causing or not, based on variations in the NOD2 protein. Thus, the present study has therefore demonstrated a new way that is solely dependent on the sequence of the NOD2 protein which can potentially enhance detection and diagnostics. While the created algorithm is presently constrained by availability of data for training and validation, it is expected that the model will improve its predictive ability as more mutation types are incorporated in the system. It should also be taken into consideration the relationship between population predisposition, NOD2 mutations, and CD progression. For example, NOD2 mutations were absent in Japanese CD patients (Yamazaki et al. 2002). Thus, the present utility of the algorithm may be restricted to the population group from which the data was taken. Conclusion Differences between NOD2 Crohn’s disease- causing mutations and non-disease-causing mutations were observed. The variations were related to the location and nature of the mutations. Based from these, a comprehensive statistical analyses were conducted which demonstrated the possibility of predicting the association of NOD2 mutations with CD susceptibility. The ANN model exhibited satisfactory capability to predict whether a specific NOD2 mutation is associated with the onset of CD, based on sequence-order coupling numbers. The presented classifier sets itself apart from previously reported algorithms by using the primary structure of NOD2 as the predictor. The formulated predictive model is potentially useful for the enhanced diagnosis and understanding of Crohn’s Disease. Conflict of Interest The authors declare that they have no conflict of interest. References Ahmed SS, Dey N, Ashour AS, Sifaki-Pistolla D, Balas- Timar D, Balas VE, Tavares JMRS (2017) Effect of fuzzy partitioning in Crohn’s disease classification: a neuro- fuzzy-based approach. Med. Biol. Eng. Comput. 55: 101- 115. Bezabeh T, Somorjai RL, Smith IC, Nikulin AE, Dolenko B, Bernstein CN (2001) The use of 1H magnetic resonance spectroscopy in inflammatory bowel diseases: distinguishing ulcerative colitis from Crohn’s disease. Am. J. Gastroenterol. 96: 442-448. Choi Y, Sims GE, Murphy S, Miller JR, Chan AP (2012) Predicting the functional effect of amino acid substitutions Nova Biotechnol Chim (2020) 19(1): 52-60 60 and indels. PLoS One 7: e46688. Chou KC (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun. 278: 477-483. Cuthbert AP, Fisher SA, Mirza MM, King K, Hampe J, Croucher PJP, Mascheretti S, Sanderson J, Forbes A, Mansfield J, Schreiber S, Lewis CM, Mathew CG (2002) The contribution of NOD2 gene mutations to the risk and site of disease in inflammatory bowel disease. Gastroenterology 122: 867-874. Economou M, Trikalinos TA, Loizou KT, Tsianos EV, Ioannidis JPA (2004) Differential effects of NOD2 variants on Crohn’s disease risk and phenotype in diverse populations: A metaanalysis. Am. J. Gastroenterol. 99: 2393-2404. Er O, Temurtas F, Tanrıkulu AÇ (2010) Tuberculosis disease diagnosis using Artificial Neural Networks. J. Med. Syst. 34: 299-302. Flamant M, Roblin X (2018) Inflammatory bowel disease: towards a personalized medicine. Therap. Adv. Gastroenterol. 11: 1-15. Hampe J, Grebe J, Nikolaus S, Solberg C, Croucher PJP, Mascheretti S, Jahnsen J, Moum B, Klump B, Krawczak M, Mirza MM, Foelsch UR, Vatn M, Schreiber S (2002) Association of NOD2 (CARD 15) genotype with clinical course of Crohn’s disease: A cohort study. Lancet 359: 1661-1665. Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7: 673-679. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR (2014) ClinVar: Public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42: D980-5. Lesage S, Zouali H, Cézard J-P, Colombel J-F, Belaiche J, Almer S, Tysk C, O'Morain C, Gassull M, Binder V, Finkel Y, Modigliani R, Gower-Rousseau C, Macry J, Merlin F, Chamaillard M, Jannot A-S, Thomas G, Hugot J-P (2002) CARD15/NOD2 mutational analysis and genotype-phenotype correlation in 612 patients with inflammatory bowel disease. Am. J. Hum. Genet. 70: 845- 857. Maekawa S, Ohto U, Shibata T, Miyake K, Shimizu T (2016) Crystal structure of NOD2 and its implications in human disease. Nat. Commun. 7: 11813. Mossotto E, Ashton JJ, Coelho T, Beattie RM, MacArthur BD, Ennis S (2017) Classification of paediatric inflammatory bowel disease using machine learning. Sci. Rep. 7: 2427. Niess JH, Klaus J, Stephani J, Pfluger C, Degenkolb N, Spaniol U, Mayer B, Lahr G, von Boyen GBT (2012) NOD2 polymorphism predicts response to treatment in Crohn’s disease-first steps to a personalized therapy. Dig. Dis. Sci. 57: 879-886. Ogura Y, Bonen DK, Inohara N, Nicolae DL, Chen FF, Ramos R, Britton H, Moran T, Karaliuskas R, Duerr RH, Achkar J-P, Brant SR, Bayless TM, Kirschner BS, Hanauer SB, Nunez G, Cho JH (2001) A frameshift mutation in NOD2 associated with susceptibility to Crohn’s disease. Nature 411: 603-606. Plevy S, Silverberg MS, Lockton S, Stockfisch T, Croner L, Stachelski J, Brown M, Triggs C, Chuang E, Princen F, Singh S (2013) Combined serological, genetic, and inflammatory markers differentiate Non-IBD, Crohn’s disease, and ulcerative colitis patients. Inflamm. Bowel Dis. 19: 1139-1148. Schneider G, Wrede P (1994) The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys. J. 66: 335-344. Sidiq T, Yoshihama S, Downs I, Kobayashi KS (2016) Nod2: A critical regulator of ileal microbiota and Crohn’s disease. Front. Immunol. 7: 367. Strober W, Watanabe T (2011) NOD2, an intracellular innate immune sensor involved in host defense and Crohn’s disease. Mucosal Immunol. 4: 484-495. Trevethan R (2017) Sensitivity, specificity, and predictive values: foundations, pliabilities, and pitfalls in research and practice. Front. Public Heal. 5: 307. Wang Y, Miller M, Astrakhan Y, Petersen B-S, Schreiber S, Franke A, Bromberg Y (2019) Identifying Crohn’s disease signal from variome analysis. Genome Med. 11: 59. Wong HB, Lim GH (2011) Measures of diagnostic accuracy: Sensitivity, specificity, PPV and NPV. Proc. Singapore Healthc. 20: 316-318. Xiao N, Cao DS, Zhu MF, Xu QS (2015) Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31: 1857-1859. Yamamoto S, Ma X (2009) Role of Nod2 in the development of Crohn’s disease. Microbes Infect. 11: 912-918. Yamazaki K, Takazoe M, Tanaka T, Kazumori T, Nakamura Y (2002) Absence of mutation in the NOD2/CARD15 gene among 483 Japanese patients with Crohn’s disease. J. Hum. Genet. 47: 469-472.