 Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 Analysis of Association between Caesarean Delivery and Gestational Diabetes Mellitus Using Machine Learning Nisana Siddegowda Prema 1,* , Mullur Puttabuddi Pushpalatha 2 1 Department of Information Science and Engineering, Vidyavardhaka College of Engineering, Mysuru, India 2 Department of Computer Science and Engineering, Sri Jayachamarajendra College of Engineering, Mysuru, India Received 16 September 2019; received in revised form 23 December 2019; accepted 09 January 2020 DOI: https://doi.org/10.46604/peti.2020.4740 Abstract The study aims to analyze the association between gestational diabetes mellitus (GDM) and other risk factors of cesarean delivery using machine learning (ML). The dataset used for the analysis is from the pregnancy risk assessment survey (PRAMS), considered in two scenarios, i.e., all the data is taken, and all the data of the women who developed GDM. Further, the data is developed in two groups Data-I and Data-II by considering multiparous and primiparous women details, respectively. The correlation analysis and major classification algorithms are applied to the data. It is founded that the top risk factors for the first time cesarean delivery are the age, height, weight, race of the women, presence of hypertension and gestational diabetes mellitus. The major risk factor for repeated cesarean delivery is the previous cesarean delivery. The presence of GDM is also one of the risk factors for cesarean delivery. Keywords: C-section, cesarean delivery, GDM, machine learning 1. Introduction A cesarean (C-section) delivery is a surgical procedure in which a fetus delivered through an incision in the mother's abdomen and uterus. Over the past 30 years, there is an increase in C-section delivery. The CDC reported that the highest ever national cesarean birth rate was 29.1% in November, 2005 [1]. The factors associated with deciding the type of deliveries are social-demographic, medical, obstetric history, antenatal history, etc. It is very important to know the presence of any of the above-mentioned risk factors at earliest which will help the doctors and the patients to make the right decision at the right time. 1.1. Complications associated with C-section delivery The potential complications to mothers are infection and bleeding which can lead to anemia. Further complications are pregnancies, pain, medicine, anesthesia negative reactions, etc. The rate of maternal mortality is higher in C-section deliveries when compared to vaginal deliveries. Complications for the baby are low birth weight and respiratory problems. There is a possibility of injury for the baby during incision. 1.2. Machine learning in healthcare Machine learning (ML) is a technique of training a machine to recognize patterns using data and an algorithm. The prediction accuracy of the machine increases with the data and the complexity of the rules fed into the machine. Healthcare is one of the leading applications of ML which has massive and enormous data. ML assists healthcare professionals in analyzing the data and helps them in the right decision making [2]. * Corresponding author. E-mail address: premans@vvce.ac.in Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 9 The purpose of this work is to apply ML and statistical techniques for the analysis of the association between cesarean delivery with GDM and other risk factors, as well as to apply the predictions model for the prediction of C-section delivery for the first time delivery and repeated deliveries. 2. Literature Survey Sunantha Sodsee [3] proposed a cephalopelvic disproportion (CPD) based on the nearest neighbors (NN) algorithm for the prediction of cesarean delivery. In the proposed NN model, two determined threshold distances were used to identify the nearest and farthest neighbors. The highest performance was achieved by the proposed model [3]. Obstetric and pregnancy factors were used in predicting the type of delivery using data mining models by Sonia Pereira et al. [4], and applied different data mining algorithms in different scenarios; they have achieved the highest accuracy of 84%. Farhad Soleimanjan et al. [5] used decision tree C4.5 for the prediction of cesarean delivery, the attributes considered are age, blood pressure, and heart problems. The obtained accuracy by the classifier is 86%. Decision tree and artificial neural network were used to classify the births into normal and caesarian by Ayesha Sana et al. [6], the classification accuracy obtained was 80% and 82%, respectively. The authors also used association rules for the identification of caesarian birth patterns. In the study, the authors showed that high blood pressure, lack of education, and pulse rate are associated with caesarian birth. Alisha Kamat et al. [7] proposed a prediction model using Naive Bayes and ID3 classifiers to determine the type of delivery based on ultrasonography, urine and blood reports of pregnant women. ML techniques like a decision tree and Naïve Bayes were applied for the prediction of pregnancy-related risk factors [8] and to predict normal or abnormal stages of pregnancy [9]. A classification model was proposed, and it allows an estimation of the interval for the value of the Apgar score depending on mother and newborn data [10]. Abbas et al. [11] proposed a decision support system using ML techniques to assist the physicians in adopting the correct decisions. They had used neural networks, kNN, Naïve Bayes, and SVM classifiers for the analysis of risk factors of C-section delivery. Md Rafiul Hassan et al. [12] proposed a feature selection algorithm coupled with automated classification using ML techniques to analyze and predict IVF pregnancy in greater accuracy. They had used five attributes to asses the prediction ability of IVF pregnancy and five different ML models, namely MLP, SVM, C4.5, CART, and random forest. Prema Pushpalatha [13] used SVM and logistic regression for the diagnosis of preterm birth in the pregnant woman having either diabetes mellitus or GDM; the highest accuracy obtained is 86%. Prema Pushpalatha [13] also conducted a review on applications of data mining techniques in preterm birth prediction. Furthermore, the most commonly used data mining technique which was founded is classification in that the usual techniques are support vector machine (SVM) and logistic regression. The most commonly considered risk factors are socio-demographic, behavioral (lifestyle), and pregnancy history. Further, the application of deep neural networks for the prediction of diabetes was proposed, for five-fold cross-validation, and achieved good results [15]. A prediction of the model using support vector machine and the nearest neighbor was proposed for the prediction of breast cancer, they have obtained the highest accuracy of 99.68% for support vector machine in the training phase [16]. 3. Materials and Methods 3.1. Dataset The data used for this work is the pregnancy risk assessment survey (PRAMS) collected from the centers for control and prevention (CDC). The PRAMS data will be helpful to identify the group of women and infants with high-risk health issues. The data set contains about 41000 instances with more than 350 attributes related to maternal and child healthcare. The data set used for this work is reduced to 13550 instances after removing missing values instances and the outliers. Outliers are detected Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 10 using the interquartile range. The PRAMS dataset has about 350 attributes, in that few are redundant and many are irrelevant to the present study, hence only the required attributes are considered. About 20 attributes (as risk factors) are considered, which are related to the following factors:  Maternal Age  Height and weight of the mother  Obesity  Diabetes Mellitus(DM & GDM)  Hypertension  Lifestyle  Obstetric history  Educational level  Maternal race  Urban or rural Most of the values are all categorical, few attributes are of type ordinal. The analysis of the data is done by considering complete data and the details of the women who developed GDM, respectively. Further, the data is developed in Data-I and Data-II groups by considering multiparous and primiparous women details, respectively. In the data set containing the details of women having GDM about 9% and 6% cases are C-section deliveries in Data-I and Data-II respectively as shown in Fig. 1. Fig. 1 C-section Delivery distribution in GDM and NON-GDM data 3.2. Association between the features Feature association refers to the statistical relationship(s) between the feature variables. Nominal variables are measured at the nominal level and have no inherent ranking. There are many statistical measures available to measure the dependency between the variables, for our data the following measures are used: (1) The correlation coefficient: The strength of the relationship between two variables is measured using the correlation coefficient. The correlation may be negative or positive is decided by the correlation coefficient values, -1 and 1, respectively; the value 0 indicates there is no relationship. (2) Cramer's V: By using Cramer’s V to calculate the strength of association, if the chi-square test is significant, the V value is between 1 and 0 which tells strong and little association between the variables respectively. (3) Feature selection: The following features selection techniques are applied to the data to figure out the most important features. Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 11  Correlation-based feature-subset selection: This technique is applied for the selection of the best subset of features. In this method subset of attributes is evaluated by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of features that are highly correlated with the class while having low inter-correlation are preferred.  CfsSubsetEval: Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred.  GainRatioAttributeEval: Evaluates the worth of an attribute by measuring the gain ratio concerning the class. Gaines(Class, Attribute)=(H(Class)-H(Class|Attribute))/H(Attribute).  InfoGainAttributeEval: Evaluates the worth of an attribute by measuring the information gain concerning the class. InfoGain(Class,Attribute)=H(Class)-H(Class|Attribute).  OneRAttributeEval: Evaluates the worth of an attribute by using the OneR classifier.  SymmetricalUncertAttributeEval: Evaluates the worth of an attribute by measuring the symmetrical uncertainty concerning the class.  SymmU(Class, Attribute)=2*(H(Class)-H(Class|Attribute))/H(Class)+H(Attribute) 3.3. Classification It is a supervised technique of grouping objects into different labeled classes. The classifiers used for this work are random forest, Naive Bayes, logistic regression and the nearest neighbor classifier(kNN). (1) NaiveBayes: This is a classifier based on Bayes theorem with independence assumptions between the features, which uses the maximum likelihood method. (2) Random forest: A random forest is an ensemble approach that can also be thought of as a form of the nearest neighbor predictor. Each decision tree is constructed by using a random subset of the training data. Ensembles are a divide-and-conquer approach used to improve performance. (3) Logistic regression: Logistics is the regression analysis when the dependent variable is binary. It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. (4) kNN: It is a non-parametric, lazy learning algorithm, which can be used for both classification and regression. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. If k = 1, then the object is simply assigned to the class of that single nearest neighbor. 4. Result and Discussion 4.1. The correlation coefficient The correlations between all the individual features are calculated for Data-I and Data II as shown in Fig. 2 and Fig. 3, respectively. It can be observed from the matrix that if a mother is diabetic then there is a strong chance of getting, GDM in the current pregnancy. Features like hypertension, age, weight, and GDM are associated with C-section delivery. For repeated C-section, a previous C-section feature has a strong correlation. Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 12 Fig. 2 Correlation matrix of Data-I Fig. 3 Correlation matrix of Data-II Cramer’s V test is done on the Data-I and Data-II separately, and the values are tabulated in Table 1. From the V values, it can be observed that there is a strong association between previous C-section delivery and repeated C-section delivery in Data-I. There is a moderate association between C-section delivery and the age, weight, height of the women in both the data sets. The maternal chronic condition like diabetes and hypertension are also associated with C-section delivery. Table 1 Top features after Cramer’s V test Data set used Feature Cramer’ V Data-I (Repeated C-section )  Previous C-section  GDM  Hypertension  Mothers Weight  Mothers Height 0.9038 0.0469 0.0602 0.1695 0.0900 Data-II (First C-section)  Mothers Age  GDM  Mothers Weight  Mothers Height  Hypertension 0.0581 0.0449 0.1454 0.0720 0.0799 Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 13 4.2. Feature selection The top-ranked attributes by the above-mentioned features selection techniques are shown in Table. 2, It can be observed in Data-II that, the most ranked feature is hypertension, body mass index (BMI), the height of the women, age as well as weight. And, in Data-I, the attributes previous C-section is the top-ranked attribute by all the techniques. Along with previous C-section, the other top-ranked attributes for Data-I is height, weight, BMI, and age of the women. Table 2 The Ranked features using various features selection techniques Sl..no Method Selected features: For Data-II Selected features: For Data-I 1 CfsSubsetEval: Bestfit  Hypertension  Mothers height  Previous live birth  Previous C-section  Mothers BMI 2 CorrelationAttributeEval  Hypertension  Mother weight  GDM  Previous C-section  Mothers weight  Mothers Age 3 GainRatioAttributeEval  Hypertension  GDM  Diabetes  Previous C-section  Mothers weight  Hypertension 4 InfoGainAttributeEval  Hypertension  Mothers BMI  Mothers Age  Previous C-section  Mothers BMI  Mothers Age 5 OneRAttributeEval  Mothers Race  Maternal Education  Mothers height  Previous C-section  Mothers Race  Smoking 6 SymmetricalUncertAttributeEval  Hypertension  Mothers BMI  Mothers Age  Previous C-section  Mothers BMI  Mothers weight 7 ReliefFAttributeEval  Mothers Race  Mothers BMI  Maternal Education  Previous C-section  Mothers BMI  Maternal Education 4.3. Classification Table 3 Accuracy obtained for the classifiers Data set used Classifier Accuracy in % For All features For Top -7 features For GDM data Data-I (Repeated C-section) KNN(k=15) 96 96 95 Naive Bayes 96.5 96.5 96.9 Logistic Regression 96.5 96.5 96.8 Random-Forest 95 96.5 95 Data-II (First C-Section) KNN(k=15) 88.9 88.8 84.8 Naive Bayes 88.6 88.9 84.6 Logistic Regression 88.9 88.9 84.8 Random-Forest 88.9 88.9 84.9 The above-mentioned classifiers were applied for Data-I and Data-II in both the scenarios by considering all the attributes, and the accuracy obtained by the algorithms are shown in Table 3. For the prediction, different classifier are used firstly; random-forest adopted about ten trees and maximum instances for splitting is taken as five. For the k-nearest neighbor classifier, the value of k chosen is 15, and the measure of the distance used is Euclidean. The different k values have experimented with KNN. If the value k is 15, the highest accuracy is achieved. The classifiers were evaluated using 10 fold cross-validation which the original sample is randomly partitioned into 10 equal-sized subsamples. Of the 10 subsamples, a single subsample is retained Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 14 as the validation data for testing the model, and the remaining subsamples were used as training data. The cross-validation process is then repeated 10 times, with each of the 10 subsamples used exactly once as the validation data. The accuracy of the classifiers is more for the Data-II compared to Data-I in both the scenarios. Logistic regression and Naïve Bayes classifiers are giving the highest accuracy for both the data sets. There is a decrease in the accuracy of the classifiers in the first C-section in the dataset containing GDM women’s details when compared with the accuracy of the data. The same classifiers are applied for the top-selected features by using the above-mentioned feature selection methods but there is not much improvement in the accuracy of the classifiers. The ROC analysis curves of both data are shown in Fig. 4 and Fig. 5. Fig. 4 ROC for Data-I Fig. 5 ROC for Data-II 5. Conclusions A machine learning-based decision support system has been presented, and the top risk factors for Cesarean delivery have been analyzed in both the first times and repeated pregnancy cases. The major risk factors for the first time Cesarean delivery are the age, height, weight, race of the mother, and the presence of hypertension. For repeated Cesarean, the delivery of previous Cesarean delivery is the major risk factor. There is an association between GDM and Cesarean delivery in the data set considered for the study. The classification model has achieved the highest accuracy of 96% for multiparous and 89% for primiparous mothers in predicting C-section delivery. In the future, the analysis will be done with paternal risk factors, including ethnicity, financial status, and stress-related factors. The data set is imbalanced, and the ratio of negative classes is more compared to the positive instances. Therefore, there is a lot of scopes to improve the classification accuracy handling imbalanced dataset problems. Conflicts of Interest The authors declare no conflict of interest. References [1] “Reasons For A Cesarean Birth,” https://americanpregnancy.org/labor-and-birth/reasons-for-a-cesarean/, October 13, 2019 [2] P. Chowriappa, S. Dua, and Y. Todorov, Introduction to machine learning in healthcare informatics, Machine Learning in Healthcare Informatics, Springer, 2014, pp. 1-23. [3] S. Sodsee, “Predicting caesarean section by applying nearest neighbor analysis,” Procedia Computer Science, vol. 31, pp. 5-14, December 2014. [4] S. Pereira, F. Portela, M. F. Santos, J. Machado, and A. Abelha, “Predicting type of delivery by identification of obstetric risk factors through data mining,” Procedia Computer Science, vol. 64, pp. 601-609, December 2015. Proceedings of Engineering and Technology Innovation, vol. 15, 2020, pp. 08 - 15 15 [5] F. Soleimanian, P. Mohammadi, and P. Hakimi, “Application of decision tree algorithm for data mining in healthcare operations: A case study,” International Journal of Computer Applications, vol. 52, no. 6, pp. 21-26, August 2012. [6] A. Sana, S. Razzaq, and J. Ferzund, “Automated diagnosis and cause analysis of cesarean section using machine learning techniques,” International Journal of Machine Learning and Computing, vol. 2, no. 5, p. 677-680, October 2012. [7] A. Kamat, V. Oswal, and M. Datar, “Implementation of classification algorithms to predict mode of delivery,” International Journal of Computer Science and Information Technologies, vol. 6, no.5, pp. 4531-4534, 2015. [8] B. Lakshmi, T. Indumathi, and N. Ravi, “A comparative study of classification algorithms for predicting gestational risks in pregnant women,” International Conference on Computers, Communications, and Systems, November 2015, pp. 42-46. [9] R. Sawant and N. Gaikwad, “Hybrid prediction method for pregnancy data set,” 2015 1st International Conference on Next Generation Computing Technologies, September 2015, pp. 918-920. [10] R. Robu and Ş. Holban, “The analysis and classification of birth data,” Acta Polytechnica Hungarica, vol. 12, no. 4, pp. 77-96, July 2015. [11] S. A. Abbas, R. Riaz, S. Z. H. Kazmi, S. S. Rizvi, and S. J. Kwon, “Cause analysis of caesarian sections and application of machine learning methods for classification of birth data,” IEEE Access, November 2018, pp. 67555-67561. [12] M. R. Hassan, S. Al-Insaif, M. I. Hossain, and J. Kamruzzaman, “A machine learning approach for prediction of pregnancy outcome following IVF treatment,” Neural Computing and Applications, September 2018, pp. 1-15. [13] N. S. Prema and M. P. Pushpalatha, “Machine learning approach for preterm birth prediction based on maternal chronic conditions,” Emerging Research in Electronics, Computer Science and Technology, Springer, 2019. [14] N. S. Prema and M. P. Pushpalatha, “Prediction of preterm birth using data mining-a survey,” IIOAB Journal, vol. 10, no. 2, pp. 13-17, January 2019. [15] S. I. Ayon and M. Islam, “Diabetes prediction: A deep learning approach,” International Journal of Information Engineering and Electronic Business, vol. 2, pp. 21-27, March 2019. [16] M. M. Islam, H. Iqbal, M. R. Haque, and M. K. Hasan, “Prediction of breast cancer using support vector machine and K-nearest neighbors,” 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC), December 2017, pp. 226-229. Copyright© by the authors. Licensee TAETI, Taiwan. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC) license (https://creativecommons.org/licenses/by-nc/4.0/).