Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2991-2997 2991 www.etasr.com Jamalian and Foukerdi: A Hybrid Data Mining Method for Customer Churn Prediction A Hybrid Data Mining Method for Customer Churn Prediction Elham Jamalian Faculty of Technology and Information Department of Information Technology University of Qom Qom, Iran Rahim Foukerdi Faculty of Management Department of Industrial Management University of Qom Qom, Iran Abstract—The expenses for attracting new customers are much higher compared to the ones needed to maintain old customers due to the increasing competition and business saturation. So customer retention is one of the leading factors in companies’ marketing. Customer retention requires a churn management, and an effective management requires an exact and effective model for churn prediction. A variety of techniques and methodologies have been used for churn prediction, such as logistic regression, neural networks, genetic algorithm, decision tree etc.. In this article, a hybrid method is presented that predicts customers churn more accurately, using data fusion and feature extraction techniques. After data preparation and feature selection, two algorithms, LOLIMOT and C5.0, were trained with different size of features and performed on test data. Then the outputs of the individual classifiers were combined with weighted voting. The results of applying this method on real data of a telecommunication company proved the effectiveness of the method. Keywords-customer churn; data mining; hybrid method; LOLIMOT; C5.0; weighted voting I. INTRODUCTION The way companies communicate with their customers has become a key point for competition in marketing. Concepts such as customer acquisition, maintenance and satisfaction are internalized in companies. According to studies, the expenses for attracting a new customer are 5-10 times more than maintaining an old one [1]. On the other hand, customer maintenance has its own expenses and it’s not possible to have these expenses for all customers, because not all customers worth maintaining. Thus, churn management systems are looking for customers that want to leave the company. The key point which shows the importance of churn management systems is that, studies show that a 5% increase in customer retention will have a 25% to 95% benefit increase for companies [2]. A lot of studies have been carried out in churn management, and different methods for recognizing downfall reasons and churn prediction and prevention have been used. Among these, data analysis methods have been used widely. Customer churn is a situation in which customers decide to leave the company. The meaning of churn is different in different areas [3-5]. Most definitions keep the related behaviors with a product and a defined threshold with business rules in mind. When customer’s transaction is less than the threshold, churn occurs [6]. Authors in [7], defined customer churn, in banking field, as those customers who close their accounts. Authors in [6] define churn as those who have less than 2500 Euro (save, and all other kind of properties) in banks. Technically according to (1) churn is the customer loss in a defined time period. Monthly churn=(C0+A1-C1)/C0 (1) Here C0 means the number of customers at the beginning of the period, C1 is the number of customers at the end of the period and A1 is the number of new customers in the period. Generally, customer churn prediction is a binary classification that its outcome shows the probability of customer churn [10, 11]. However, the special nature of the churn prediction problem causes analysis algorithms to face some limitations. For example, in these problems, data are imbalanced. This means that lost customers are just a small part of the data. In addition, extended learning programs will face some kinds of noise. In any case, churn prediction needs to classify customers according to the probability of churn [12]. Different approaches work on this problem and they have some defects themselves. For example, although algorithms based on decision trees are used for classification, it is possible that some leaves have the same class probability. In addition, this method is noise sensitive. Neural networks are looking for suboptimal solutions and when the number of parameters of models increases an overfitting happen. Although genetic algorithms present accurate prediction models, they can’t make the probability of occurrence clear. And finally methods like support vector machines usually don’t lead to the best results. In Table I, a summary of the above mentioned methods and their characteristics is shown. In this study, using data fusion and feature extraction techniques, a hybrid method is presented for a more accurate prediction of customer churn. After data preparation and feature selection, two algorithms, LOLIMOT and C5.0, are trained with different size of features and the outputs of individual classifiers are combined with weighted voting. Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2991-2997 2992 www.etasr.com Jamalian and Foukerdi: A Hybrid Data Mining Method for Customer Churn Prediction TABLE I. CHURN PREDICTION MODELS REVIEW Reference Goal Applied Techniques Data, Customer numbers, Features numbers Metrics Sampling Feature Selection [13] Explanation and application of modeling systems for automatic prediction of mobile service providers Neural network Telecommunication, 500000 customers, 200 features, (1) Lift Under sampling Forward feature selection with decision tree and genetic algorithm [14] Making churn prediction model by C4.5 algorithm with limited numbers of features C4.5 decision tree Telecommunication, 114000 customers, 12 features, (1) Lift, false positive rate, decision error Oversampli ng Visible [15] Using a new data mining method for churn prediction C4.5 decision tree, neural network, datamining by evolutionary learning Telecommunication, 100000 customers, 251 features, (1) Top 5 lift, Under sampling Visible [16] Churn prediction as a part of customer life value model (CLV) Logistic regression, decision tree, neural network Telecommunication, 16384 customers, 200 features, (2) Error rate, lift - R2 method [17] Studying variables and modeling methods for customer churn prediction Logistic regression, linear regression, random forest, regression forests Financial services, 100000 customers, 30 features, (1) AUC - - [18] The comparative and functional study of customer churn prediction Logistic regression, decision tree, applying neural network on clustered parts Telecommunication, 160000 customers, 40 features, (1) Hit rate, Top decline lift Oversampli ng Visible and EDA [10] Improving and developing churn prediction models and testing them on real customer retention campaigns Logistic regression(with markof chain), random forests Pay-TV, 143198 customers, 81 features, (1) Accuracy, lift, AUC - - [19] Using of support vector machine for churn prediction for newspaper subscribers Logistic regression, support vector machine (SVM), random forests Newspaper subscribers, 90000 customers, 82 features, (1) Accuracy, lift, AUC Under sampling - [20] Improving accuracy of churn prediction by combining the cost sensitive random weighted forests and oversampling of random forests Improved balanced random forests (IBRF) Chinese bank, 20000 customers, 27 features, (1) Accuracy, Top Decline Lift Increasing The best features selection increasingly [21] Customers churn behavior analysis for predicting churn and developing retention strategies Decision tree, neural network, C4.5, Logit, PLS, Nave bays Telecommunication, 100000 customers, 171 features, (2) Hit rate - Feature selection with PLS and by use of VIP scores [22] Presenting a framework for churn prediction in electronic commerce area Neural network, decision tree, SVM, Extended SVM Chinese E-Commerce website, 50000 customers, 27 features, (1) Accuracy, hit rate, cover rate, lift - - [23] Direct usage of behavior data in prediction model without converting them to static data by using hierarchical model Hierarchical multi kernel SVM (H- MKSVM) Telecommunication and food shop database, 633- 8842 customers, 47-79 features, (2) (1) private data, (2) public data II. PROPOSED MODEL Due to the fact that combining individual algorithms usually leads to better and more accurate predictions, in this study a hybrid method for customer churn prediction is suggested. Generally three factors should be kept in mind when combining algorithms: (1) Training set creation, (2) selection of individual algorithms that must be combined and (3) method or rules that results are combined. The suggested model for churn prediction includes the following steps: A. Data Preparation Information needed for data mining models includes demographic data like living location, age, gender, number of children, salary, other financial data and bills and data related to customer usage [14]. There is no need for customers’ personal information such as name, email address, mailbox, father’s name, etc. and they can be deleted from the database in this stage. Also missing data can be omitted or replaced with special amounts like average or predicted amount resulted from predicting techniques etc. Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2991-2997 2993 www.etasr.com Jamalian and Foukerdi: A Hybrid Data Mining Method for Customer Churn Prediction B. Features Selection or Extraction and Ordering This step is important because it uses important features and omits extra and noisy features and those which have little information helping to data cleansing and dimension reduction [8]. In this study principal component analysis (PCA) is used for this purpose. In this step, the goal is achieve to a set of features which are the most effective in prediction by the use of PCA. Fk is the k th ordered feature that: 1 2 nF F ... F (2) C. Constructing Subsets of Data According to The Extracted Features and Training Model(s): Regarding to the literature and strengthened C5.0 and LOLIMOT high performance [26], these models are used as individuals and by using weighted voting their accuracy is improved. 1) C5.0 Algorithm C5.0 decision tree is a classification tree. C5.0 algorithm is an improved form of ID3 algorithm that constructs the decision tree based on information theory. Training data are classified sets of samples in the form of S= s1, s2,…. Each sample is a vector of features s1= X1, X2,… Training data include the vector that shows the class that each set belongs to, C= C1, C2,… In each tree node the feature which makes the best classification is selected in order to put the sample set in related classes. This classification is done by computing entropy (information gain). In this case the feature that has the highest information gain will be selected as decision maker. 2) Boosting Algorithm Boosting algorithm is an ensemble method that trains a set of classification models [27]. It is the case when each train set is created based on the accuracy of previous classification models. In this algorithm, new classification methods are constructed to predict the samples, in which classification accuracy was weak, more accurately. This process is done by adaptive resampling. This means that samples that were classified incorrectly will have more chance to be selected in the next step. Each sample has a weight and at the end of the classification these weights will be up to date. Finally, the result of different methods will be ensembled with voting. Strengthening can increase the accuracy of C5.0 algorithm but it needs more time for training. 3) LOLIMOT Algorithm Linear tree model or LOLIMOT is based on the divide and conquer strategy. In this method the complicated problem is solved by dividing to several smaller problems [9]. LOLIMOT algorithm for achieving a better outcome (outcome with smaller error) will divide the problem area to several local linear models (LLMs). And after finding the worst LLM, continues the algorithm by dividing it to two LLMs. This algorithm is used as a fast learning tree algorithm, in many pattern recognition and prediction problems and it has some remarkable results. LOLIMOT algorithm was first introduced in locally linear neuro fuzzy model [28]. Dividing the incoming area to small linear subsets by using fuzzy activation functions is the basic strategy in these networks. In locally linear neuro fuzzy model, the neural network has one hidden layer. But computing operation in its neurons is much more complex than that of a regular neuron. In this network, there are (p+1)*M weights (M is the number of neurons and p is the number of inputs). In (3) and (4) the matrix of weights and inputs can be seen. Each neuron has a LLM and a validity function. Validity function will indicate validity area in the LLM. Validity function is also known as activation function, because it controls the actions of Locally Linear Models.    0 1 p[ , , ..., ]ω (3)  T1 2 p[u ,u ,...,u ]u (4) It is clear that the first output is achieved by (5). It is the output of LLM of the thi neuron.        i i0 i1 i1 ip ipy u ... u (5) Activation function can be computed by using (6) and (7).      ii M ii 1 ( )( ) ( ) uu u (6)                      22 p ip1 i1i 2 2i1 ip (u c )(u c )( ) exp ... exp2 2u (7) where i( )u is a Gaussian function which has two parameters, center ijC and standard deviation ij .      M i ii 1y y ( )u (8)     (sub) (sub) sub1 2 M, , ...,X X X X (9)                          1 p(s 1 P ub)i 1              1 u (1) u (1) u (N     ) u (N)   X (10) In LOLIMOT, two parameters, center and standard deviation, are considered to be fixed and the weights are calculated by the least squares function. With the assumption that having M neurons and N train vector, regression matrix and train vector output can be defined as (7), (8) and (9). The output vector for train data, is considered as:     i i i iy [y (1),y (2),...,y (N)] (11) Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2991-2997 2994 www.etasr.com Jamalian and Foukerdi: A Hybrid Data Mining Method for Customer Churn Prediction And, finally, the network output is calculated by (12).  y Xω (12) Now, by considering a target function (13) through changing the weights, we try to minimize the error.    N 2 j 1I e (j) min (13) That,   y(je(j) y) (j) LOLIMOT algorithm starts the training process by starting from one neuron or a specified numbers of neurons and the following three stages are done repeatedly:  The worst neuron with highest error is recognized by (14).    N 2i ii 1I e (j) ( (j))u (14)  All cases in which this neuron can be divided in one of its dimensions are considered and the case that has the least error is selected.  The worst neuron is recomputed. After selecting suitable data mining algorithms, subsets of training data are constructed based on feature sets and the proposed models train and test on each of these train sets. Models are evaluated and cleaned according to their accuracy. 4) Combining outputs of classifiers In this phase, the outputs of the remained classifiers from the previous step will ensemble with one of the voting algorithms that will be explained below. The simplest method to combine classifiers is majority voting which gathers all outputs of individual classifiers C1, C2, …, Cn and then the output with the largest number of votes is selected as the final decision.  i iCT argmaxCount (V(C )) (15) where V(Ci) is the vote of ith classifier. The other voting method is the weighted voting, that is improved but more complicated version of majority voting. In this strategy, each classifier is assigned a weight according to its classification performance. In this case, the weights can be obtained by the following formula [29].     ii 1W (1 ).( )n 2 (16) where Wi is the weight of ith ordered classifier in which the classifiers are arranged based on their accuracy performance,  is a regulatory parameter 0<α<1 and  N i1 W 1∑ W ≈ 1. Then, the voting output will become:     ik V(C )CT arg max (17) Based on this method different weights are given to outputs according to their accuracy. If the total weights of the methods that estimate customer is churner are more than the whole weights of other methods, the customer will be predicted as churner, if not the customer is loyal. III. EVALUATION CRITERIA The lift criterion is a performance measure which is the result of the ratio between the obtained outcomes with and without the use of the prediction model. The higher the lift means the model is more accurate, and instinctively, the more profitable a targeted proactive churn management program will be [5]. Calculating lift criteria for all of churn management data base is not logical and also not practical. It is also not economically possible for the organization to apply churn management for all customers, so it concentrates on those with more churn probability. For this, the two mentioned outcomes are usually kept apart, and are about 10-20% in marketing [26, 30]. As the top decile lift is the main criterion in the churn management area [5], this criterion has been used to evaluate the performance of applied algorithms. The top decile lift focuses on the customers predicted most likely to churn. The first 10% are the most critical customers (i.e. with high churn probability) and this percentage is an ideal portion for targeting the retention marketing campaign [9]. So for computing top decline lift, first customers are sorted from predicted most likely to predicted least likely to churn. Then the number of customers that were correctly recognized in the first 10% will be calculated. And finally this percent will be divided with the total churn. If all customers (in the first 10%) were recognized correctly and the ratio between churners and other customers is 50-50, the amount of top decline lift is equal to 100/50. In addition to lift, proposed algorithm was compared with some other algorithms that were mentioned in literature by the use of two other criterions, accuracy and area under ROC curve (AUC). IV. DATASET The dataset is provided by the Teradata Center at Duke University. The database contains of datasets of mature subscribers (i.e. customers who were with the company for at least six months) from a major U.S. telecommunication company. There are three different datasets in this database: calibration data, current score data and future score data. A total of 172 variables are included in the datasets, one for churn indication, and 171 variables for prediction. The prediction variables include three types of variables: behavioral data such as minutes of use, revenue, handset equipment, company interaction data such as customer calls into the customer service center, and customer household demographics. The churn response is coded as a dummy variable with churn=1 if the customer churns, and churn=0 otherwise. There are 100,000 records in the calibration dataset, 51,306 records in the current score and 100,462 records in the future score dataset. The actual average monthly churn rate is reported to be around 1.8 bet cur cal and exp eac tw chu rec cus A. bee giv Lin po mi 18 B. om fea we fea C. dat cor bet oth dif and app Th Fo wh is u Engineerin www.etasr 8%. Calibratio tween churne rrent score an libration datas d future sco periments. It ch dataset are o dimensions urners have a cognition diff stomers and bl Fi Data Prepar At first all fe en omitted. T ven and for n near interpola ssible by usi issing value. T .0 platform. Feature Sele There are 1 mitting feature atures remain ere non-numer atures were co PCA Implem The database tabases is big rrelations as tween features her hand the l fficult. So, for d the most inf plication, the hese componen or defining th hich is one of t usually betwee ng, Technology r.com on sample is a ers and non-c nd future sco set (a balanced ore dataset is is necessary t e different. Fig . In this Figu a noticeable ficult. In Fig lue squares are ig. 1. Training V. ME ration eatures with m Then for non-n numerical dat ation makes th ing of linear This procedur ection 171 features es with more of which 120 rical. For doin ded. mentation e used in this g, and with lo it can be see s reduces the a large amount reducing the d formative feat number of p nts will be us eir number, v the most impo en 70-90% [31 y & Applied Sci balanced data hurners. Also ore datasets is d one) is used s used as th to mention tha gure 1 shows ure, churner c overlapping w gure 1, red c e non-churner g set in two dimen ETHODOLOGY more than 30% numerical dat ta linear inter he replacemen r relations be re has been c in the used e than 30% m 0 features wer ng other stages study just lik ots of dimensi en in Figure accuracy of m of data make dimensions an tures, PCA wa principle comp sed as features variety ratio c ortant ones in t 1]. ience Research Jamalian set with 50-50 o churn rate i s about 1.8% d as the trainin he test set in at the custom s the training customers and which makes circles are ch rs. nsions missing value a a new valu rpolation was nt of missing etween known arried out in dataset, and missing value e numerical a s the non-num ke most other ions that have 2. The corre models, while o es the data an nd choosing th as used. After ponents is de s at the next p criterion was this field. This h n and Foukerdi 0 split in the %. The ng set n our mers in set in d non- their hurner e have ue was used. value n and SPSS after e, 144 and 24 merical CRM e high elation on the nalysis he best r PCA efined. phase. used, s ratio var acc D. are E. (R2 soft on Aft acc sho clas F. clas The par (Fig to com liter acc Fig Tab Vol. 8, No. 3, 2 i: A Hybrid Dat Fig. 2. In this phase, riety. The res cording with de Feature Subs Feature subse developed ba 1 1A {F , 3 1A {F , Model Traini LOLIMOT a 2011a) platfor ftware. At first the training s ter the testing curacy and tho ows the classif ssifiers remain TABLE II. C5.0-boos C5.0-boos C5.0-boos C5.0-boos C5.0-boos C5.0-boos C5.0-boos LOLIM LOLIMO Individual Cl For achievin ssifiers must b e results of rameter ( 0.1 gure 3). Also the different mparison betw rature of ch cording to top gure 5 represe ble III. The pr 2018, 2991-299 ta Mining Meth Correlation m 55 features w sult of this escending imp ets Developme ets are constr sed on them. 2 3F ,F } , 2A { 2 3 4 5F ,F ,F ,F } , ing algorithm wa rm and C5.0 t, LOLIMOT subsets separa the classifiers se with lower fiers with accu ned. CLASSIFIERS W sting-10 features sting-12 features sting-13 features sting-24 features sting-37 features sting-50 features sting-55 features MOT-5 features OT-10 features lassifiers Com ng the final be combined w weighted vo   1 ) were in Figure 4, th values of α ween the dif hurn predictio decline lift a ents the result roposed model 97 hod for Custom matrix with values were selected th phase is 55 portance. ent ructed and the 1 2 3 4{F ,F ,F ,F } , 50 1 2A {F ,F , as implement was perform and C5.0-boo ately and then s were ordered than 50% wer uracy higher th WITH HIGHER THAN LOLIMOT LOLIMOT LOLIMOT LOLIMOT LOLIMOT LOLIMOT LOLIMOT LOLIMOT LOLIMOT bination result the o with one of th oting with di compared wi he sensibility is shown. In fferent metho on and the and accuracy c ts of the meth l is named as 2995 mer Churn Pred s over 0.5 hat include 90 5 features ord en training su , 3 4 5 55F ,F ,F ,...,F ted in MAT med in Cleme osting were ap n they were te d according to re omitted. Ta han 50%. Final N 50% ACCURACY T-15 features T-20 features T-25 features T-30 features T-32 features T-45 features T-48 features T-50 features T-55 features outcomes of he voting met ifferent regul ith majority v of weighted v n Tables III- ods mentione proposed m criterion, is sh hods mention HWV in Figu diction 0% of dered ubsets 5} TLAB entine pplied ested. their able II lly 18 Y these thods. latory voting voting -IV a ed in model, hown. ed in ure 5. In AU hig dec sho to Fig Engineerin www.etasr Table V diff UC criterion. I ghest AUC. cline lift in co ows the comp AUC. Fig. 3. C Fig. 4. Sens g. 5. Compari Fig. 6. Com ng, Technology r.com ferent models It can be that s Also, the su omparison with parison betwee Comparison betwe sibility of weighte ison between diff mparison between y & Applied Sci are compare seen the sugge ggested meth h other hybrid en the differen een majority and ed voting to regul ferent models acco n different models ience Research Jamalian ed according t ested model h hod has bette d methods. Fig nt models acco weighted voting latory parameter ( ording to top decl s according to AU h n and Foukerdi to the has the er top gure 6 ording (α) line lift UC T mo mo usin and imp pre out tele com algo a la ens exa AN The can sele the [1] [2] [3] [4] [5] Vol. 8, No. 3, 2 i: A Hybrid Dat TABLE III. M A Sug L Ne Logi Bey TABL Alg Propo LOL Logistic C Neura Beysia TABLE V. Algo Proposed LOLIM Logistic re C5 Neural n Beysian n As churn data dels are often del was presen ng the present d applying n portant featur dictors and th tcome. This m ecommunicatio mparison wit orithms, LOLI arger number semble method amined (See f NN [35] and IC e number of n affect the pre ecting the opti model. E. Ko, S. H. K and the CRM a No. 1, pp. 65–7 S. Gupta, D. H Ravishanker, S Service Researc G. Nie, W. R forecasting by with Applicatio J. Lu, “Predicti An application Group Internati S. A. Neslin, detection: Impr Tuck School of 2018, 2991-299 ta Mining Meth MODEL COMPARIS (FIG Algorithm ggested model LOLIMOT C5.0 ural network istic regression ysian network LE IV. MODEL gorithm sed model LIMOT c regression C5.0 al network an network MODEL COMPARI orithm d model (1) MOT(2) egression (3) .0 (4) network (5) network (6) VI. CO a are usually n n not that accu nted to increas ed hybrid met necessary pre res were sele he results wer method was e on company th available IMOT and C5 r of algorithm ds that lead to fuzzy regress CA [36]) to p features, featu ediction accur imal number o REFER Kim, M. Kim, J. Y adoption process” 74, 2008 Hanssens, B. Ha . Sriram, “Model ch, Vol. 9, No. 2, Rowe, L. Zhang, logistic regressio ons, Vol. 38, No. 1 ng customer chur of survival anal ional (SUGI27) O S. Gupta, W. K roving predictive f Business, Dartm 97 hod for Custom SON ACCORDING T GURE 5) Top De 2 1 1 1 1 1 L ACCURACY COM Accura 92% 70% 60% 50% 45% 44% ISON ACCORDING AUC 0.67 0.62 0.58 0.62 0.62 0.57 ONCLUSION noisy and imba urate. So, in t se the accurac thod, after gath eprocessing p ected and w re combined to evaluated by u and its highe methods, w 5.0 were used ms can be com more accurate ion [32], AN roduce a mor ure subsets an racy. Improvin of these subset RENCES Y. Woo, “Organi ”, Journal of Busin ardie, W. Kahn, ling customer life pp. 139–155, 200 Y. Tian, Y. Sh on and decision 12, pp. 15273–15 rn in the telecom lysis modeling u Online Proceeding Kamakura, J. Lu, e accuracy of cu mouth College, 200 2996 mer Churn Pred TO TOP DECLINE L ecline Lift 2.51 1.94 1.87 1.84 1.43 1.35 MPARISON acy % % % % % % G TO AUC (FIGURE C 7 2 2 2 7 alanced, non-h this study a h cy of classifier hering suitable phases, the were given to o achieve the using data fro er effectivenes was proved. in this method mbined, and e prediction ca NFIS [33-34], e efficient sch nd training su ng the method ts can also imp izational characte ness Research, V V. Kumar, N. L etime value”, Jour 06 hi, “Credit card tree”, Expert Sy 5285, 2011 mmunications indu using SAS”, SAS gs, pp. 114–27, 20 , C. Mason, Def ustomer churn m 04 diction LIFT 6) hybrid hybrid rs. By e data most o the final om a ss, in Two d, but other an be GA- heme. ubsets ds for prove eristics Vol. 61, Lin, N. rnal of churn ystems ustry–– S User 002 fection models, Engineering, Technology & Applied Science Research Vol. 8, No. 3, 2018, 2991-2997 2997 www.etasr.com Jamalian and Foukerdi: A Hybrid Data Mining Method for Customer Churn Prediction [6] N. Glady, B. Baesens, C. Croux, “Modeling churn using customer lifetime value”, European Journal of Operational Research, Vol. 197, No. 1, pp. 402–411, 2009 [7] D. Van den Poel, B. Larivière, “Customer attrition analysis for financial services using proportional hazard models”, European Journal of Operational Research, Vol. 157, No. 1, pp. 196–217, 2004 [8] J. Hadden, A. Tiwari, R. Roy, D. Ruta, “Computer assisted customer churn management: State-of-the-art and future trends”, Computers & Operations Research, Vol. 34, No. 10, pp. 2902–2917, 2007 [9] A. Ghorbani, F. Taghiyareh, C. Lucas, “The application of the locally linear model tree on customer churn prediction”, International Conference of Soft Computing and Pattern Recognition, SOCPAR’09, Malacca, Malaysia, pp. 472–477, December 4-7, 2009 [10] J. Burez, D. Van den Poel, “CRM at a pay-TV company: Using analytical models to reduce customer attrition by targeted marketing for subscription services”, Expert Systems with Applications, Vol. 32, No. 2, pp. 277–288, 2007 [11] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2006 [12] P. C. Pendharkar, “Genetic algorithm based neural network approaches for predicting churn in cellular wireless network services”, Expert Systems with Applications, Vol. 36, No. 3, pp. 6714–6720, 2009 [13] P. Datta, B. Masand, D. R. Mani, B. Li, “Automated cellular modeling and prediction on a large scale”, Artificial Intelligence Review, Vol. 14, No. 6, pp. 485–502, 2000 [14] C.-P. Wei, I. Chiu, “Turning telecommunications call details to churn prediction: a data mining approach”, Expert Systems with Applications, Vol. 23, No. 2, pp. 103–112, 2002 [15] W.-H. Au, K. C. Chan, X. Yao, “A novel evolutionary data mining algorithm with applications to churn prediction”, IEEE Transactions on Evolutionary Computation, Vol. 7, No. 6, pp. 532–545, 2003 [16] H. Hwang, T. Jung, E. Suh, “An LTV model and customer segmentation based on customer value: a case study on the wireless telecommunication industry”, Expert Systems with Applications, Vol. 26, No. 2, pp. 181–188, 2004 [17] B. Larivière, D. Van den Poel, “Predicting customer retention and profitability by using random forests and regression forests techniques”, Expert Systems with Applications, Vol. 29, No. 2, pp. 472–484, 2005 [18] S.-Y. Hung, D. C. Yen, H.-Y. Wang, “Applying data mining to telecom churn management”, Expert Systems with Applications, Vol. 31, Vo. 3, pp. 515–524, 2006 [19] D. Anil Kumar, V. Ravi, “Predicting credit card customer churn in banks using data mining”, International Journal of Data Analysis Techniques and Strategies, Vol. 1, No. 1, pp. 4–28, 2008 [20] Y. Xie, X. Li, E. W. T. Ngai, W. Ying, “Customer churn prediction using improved balanced random forests”, Expert Systems with Applications, Vol. 36, No. 3, pp. 5445–5449, 2009 [21] H. Cho, Y. Lee, H. Lee, “Toward Optimal Churn Management: A Partial Least Square (PLS) Model”, Proceedings of the Sixteenth Americas Conference on Information Systems, Lima, Peru, AMCIS, August 12- 15, 2010 [22] X. Yu, S. Guo, J. Guo, X. Huang, “An extended support vector machine forecasting framework for customer churn in e-commerce”, Expert Systems with Applications, Vol. 38, No. 3, pp. 1425–1430, 2011 [23] Z.-Y. Chen, Z.-P. Fan, M. Sun, “A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data”, European Journal of Operational Research, Vol. 223, No. 2, pp. 461-472, 2012 [24] K. Coussement, K. W. De Bock, “Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning”, Journal of Business Research, Vol. 66, No. 9, pp. 1629-1636, 2013 [25] K. W. De Bock, K. Coussement, and D. Van den Poel, “Ensemble classification based on generalized additive models”, Computational Statistics & Data Analysis, Vol. 54, No. 6, pp. 1535–1546, 2010 [26] J. Basiri, F. Taghiyareh, B. Moshiri, “A Hybrid Approach to Predict Churn”, 2010 IEEE Asia-Pacific Services Computing Conference (APSCC), pp. 485–491, Hangzhou, China, December 6-10, 2010 [27] Y. Freund, R. E. Schapire, “Experiments with a new boosting algorithm”, Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, Bari, Italy, July 3-6, 1996 [28] M. Mirmomeni, C. Lucas, B. N. Araabi, M. Shafiee, “Forecasting sunspot numbers with the aid of fuzzy descriptor models”, Space Weather, Vol. 5, No. 8, 2007 [29] M. Siami, M. R. Gholamian, J. Basiri, “An application of locally linear model tree algorithm with combination of feature selection in credit scoring”, International Journal of Systems Science, Vol. 45, No. 10, pp. 2213-2222, 2014 [30] A. Lemmens, C. Croux, Bagging and boosting classification trees to predict churn, Journal of Marketing Research, Vol. 43, No. 2, pp. 276– 286, 2006 [31] I. Jolliffe, Principal Component Analysis, Wiley Online Library, 2005 [32] S. Nazari-Shirkouhi, A. Keramati, “Modeling customer satisfaction with new product design using a flexible fuzzy regression-data envelopment analysis algorithm”, Applied Mathematical Modelling, Vol. 50, pp. 755- 71, 2017 [33] S. Nazari-Shirkouhi, A. Keramati, K. Rezaie, “Improvement of customers’ satisfaction with new product design using an adaptive neuro-fuzzy inference systems approach”, Neural Computing and Applications, Vol. 23, No. 1, pp. 333-43, 2013 [34] V. Majazi Dalfard, M. Nazari Asli, S. Nazari-Shirkouhi, S. M. Sajadi, S. M. Asadzadeh, “Incorporating the effects of hike in energy prices intoenergy consumption forecasting: A fuzzy expert system”, Neural Computing and Applications, Vol. 23, No. 1, pp. 153-69, 2013 [35] A. Azadeh, S. M. Asadzadeh, R. Jafari-Marandi, S. Nazari-Shirkouhi, G. Baharian Khoshkhou, S. Talebi, A. Naghavi, “Optimum estimation of missing values in randomized complete block design by genetic algorithm”, Knowledge-Based Systems, Vol. 37, pp. 37-47, 2013 [36] S. Nazari-Shirkouhi, H. Eivazy, R. Ghodsi, K. Rezaie, E. Atashpaz- Gargari, “Solving the integrated product mix-outsourcing problem using the imperialist competitive algorithm”, Expert Systems with Applications, Vol. 37, No. 12, pp. 7615-7626, 2010