FACTA UNIVERSITATIS Series: Economics and Organization Vol. 18, No 1, 2021, pp. 29 - 43 https://doi.org/10.22190/FUEO201028001R © 2021 by University of Niš, Serbia | Creative Commons Licence: CC BY-NC-ND Original Scientific Paper CREDIT SCORING WITH AN ENSEMBLE DEEP LEARNING CLASSIFICATION METHODS – COMPARISON WITH TRADITIONAL METHODS1 UDC 336.77:519.2 Ognjen Radović, Srđan Marinković, Jelena Radojičić University of Niš, Faculty of Economics, Serbia Abstract. Credit scoring attracts special attention of financial institutions. In recent years, deep learning methods have been particularly interesting. In this paper, we compare the performance of ensemble deep learning methods based on decision trees with the best traditional method, logistic regression, and the machine learning method benchmark, support vector machines. Each method tests several different algorithms. We use different performance indicators. The research focuses on standard datasets relevant for this type of classification, the Australian and German datasets. The best method, according to the MCC indicator, proves to be the ensemble method with boosted decision trees. Also, on average, ensemble methods prove to be more successful than SVM. Key words: credit scoring; classifier ensemble, deep learning, support vector machine JEL Classification: C38, C45, C55, G17, G24 1. INTRODUCTION Credit scoring is a quantitative method for assessing the credit risk involved in granting a loan to a borrower. Credit scoring is one of the key stages in credit analysis. It is a method of assessing the creditworthiness of a client applying for a loan. The goal of creditworthiness assessment is to classify credit applications into acceptable and unacceptable, but also to provide the necessary inputs for the next phases of credit analysis, such as determining the credit volume, interest rates, collateral, restrictive clauses and the like. Traditional credit analysis relies on historical and verifiable information or accounting data. The most general credit analysis framework many traditional lenders use to assess credit users is the 5C approach. Credit analysts make decisions based on the following criteria: debtors’ character, capacity (ability to repay), capital, collateral and market conditions. The main advantage of this method is that it can be Received October 28, 2020 / Accepted December 02, 2020 Corresponding author: Jelena Radojičić University of Niš, Faculty of Economics, Trg kralja Aleksandra 11, 18000 Niš, Serbia E-mail: jelena.radojicic@eknfak.ni.ac.rs 30 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ used to make credit decisions in various types of business and consumer loans without significant adjustments. Traditionally, creditworthiness assessment of a corporate borrower is based on financial indicators that indicate possible problems in loan repayment and play an important role in calculating credit risk levels. These indicators are characterized by objectivity because they are calculated on the basis of data in borrowers’ financial statements. The analyst’s experience and impressions can play a significant role in creditworthiness assessment of the entity applying for a loan. Prior to the credit scoring model, the decision to grant a loan was made based on the credit analyst’s assessment. One of the disadvantages of such an approach is the inability to process a large number of applications per day, which has given rise to various credit scoring models to quantify the credit risk (Dastile, Celik, & Potsane, 2020). Banks used information received within a loan application (e.g., number of dependents, period in the current job, etc.) to calculate the borrowers’ numerical score (Lewis, 1992). Credit scoring, as a precise and automatic creditworthiness assessment tool, is a particularly important factor in the expansion of consumer lending (Thomas, Crook, & Edelman, 2002). Credit scoring is mostly applied in consumer loans, credit cards and mortgage loans (Einav, Jenkins, & Levin, 2013). Advances in information technology facilitate further development of credit scoring models for making objective and quick decisions (Thomas, Crook, & Edelman, 2002). Credit scoring is a technique that financial organizations use when making a decision to approve a loan or reject their clients’ loan application (application scoring). Credit scoring models can be applied to analyze the behavior of already existing clients, and then the score represents a numerical summary of the bank’s experience with the client (behavior scoring) (Hui, Li, & Zongfang, 2017). The Basel II agreement of 2004 expanded the field for developing more sophisticated credit scoring models, thus allowing banks to assess the probability of default on their own under the internal ratings-based (IRB) approach (Goh & Lee, 2019). 2. TRADITIONAL STATISTICAL METHODS USED IN CREDIT SCORING Credit scoring is a multi-stage process (Bequé & Lessmann, 2017), and it basically compares the borrower’s characteristics and the characteristics of other clients from the previous period. The statistical model relies on historical data. Its goal is to predict future behavior in loan repayment based on previous experience with loan users with similar characteristics. If the borrower is similar to “bad” clients (who did not repay the approved loan properly), the application is rejected, and if the borrower is similar to “good” clients (who returned the approved loan properly), the loan application is approved. The loan applicant’s score is compared with the established cut-off score. If the obtained score is higher than the cut-off score, credit is approved, and if the score is lower than the cut-off, application is rejected. The cut-off score is crucial for the usefulness of the credit scoring model and mainly relies on the credit decision makers’ attitudes towards risk. So, there is no optimal cut-off value. “It varies from one environment to another and from one bank to another inside the same country” (Abdou & Pointon, 2011). The final step is to measure the accuracy of the credit scoring model and monitor business performance indicators. “The choice of a statistical model is crucial because it affects all subsequent activities and credit scoring performance” (Bequé & Lessmann, 2017). The models that financial institutions use help them decide whether or not to grant a loan. As the final decision on approval is binary, there is a problem of binary classification. Credit scoring involves “formal statistical methods to classify loan applicants into “good” Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 31 and “bad” risk classes” (Hand & Henley, 1997). Credit scoring algorithms are basically statistical in nature: they use empirical evidence to formulate predictions about the future. Prediction models assess a continuous variable while classification models predict class membership. In credit scoring, the dependent variable is actually binary, so most algorithms can be considered classification algorithms (Abdou & Pointon, 2011). Statistical methods such as linear regression and discriminant analysis require the assumption of a linear relationship between variables. In credit scoring, linear regression is used as a binary classification problem. Discriminant analysis is a simple parametric statistical technique for classifying loans into good and bad. Fischer (1936) suggests the application of discriminant analysis as a classification technique, and in 1941 Durand used discriminant analysis to classify “good” and “bad” car loans, thus beginning the trend of applying statistical models in credit scoring. Later, Altman (1968) developed a Z-score model with financial indicators, using variables from corporate financial statements as input variables for a discriminant analysis model to predict company bankruptcies. “Discriminant analysis is one of the frequently used techniques in credit scoring” (Abdou & Pointon, 2011). Ogrler (1971) applies regression analysis for credit scoring of consumer loans in banks, after using this method somewhat earlier to evaluate the already existing commercial loans (Orgler, 1971). He concludes that “information not included in the loan application form has a greater predictive ability” to assess loan quality in the future than the information included in the form. In 1980, Ohlson proposed the use of logistic regression (LOGIT) as a creditworthiness assessment method in companies (Ohlson, 1980). The outcome variable in logistic regression is dichotomous (outcome 0/1), so this method is suitable for modeling binary outcomes and is widely used in creditworthiness assessment due to its simplicity and transparency (Dastile, Celik, & Potsane, 2020; Abdou & Pointon, 2011). Within traditional methods, logistic regression has become the standard credit scoring model due to its compliance with the Basel II standard (Goh & Lee, 2019). 3. APPLICATION OF MACHINE LEARNING IN CREDIT SCORING More sophisticated methods of credit scoring that literature has offered in recent years are machine learning algorithms and data mining methods. The point of using sophisticated techniques is their ability to model extremely complex functions (Abdou & Pointon, 2011). Machine learning models learn from available data, thus allowing the calculation of predictive value. A machine learning algorithm learns a set of rules based on the information available in a training set of examples. Machine learning models have the potential to replace the logistic regression model in credit scoring because they show great prediction accuracy. However, the impossibility of certain models to explain the predictions, i.e. lack of transparency, limits their application in regulated financial institutions. The application of machine learning algorithms may include (Dastile, Celik, & Potsane, 2020): „k-Nearest Neighbor (k-NN), Decision Trees (DTs), Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), Random Forests (RFs), Boosting, Extreme Gradient Boost (XGBoost), Bagging, Restricted Boltzmann Machines (RBMs), Deep Multi-Layer Perceptron, (DMLP), Convolutional Neural Networks (CNNs) and Deep Belief Neural Networks (DBNs)“. The Decision Tree creates a discriminant function in the form of a tree, from the root to the leaves. Each node represents a logical test of the attribute value, and the leaves denote 32 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ classes. Input observations are recursively split into subbranches, i.e. until the final result (class designation). Mathematical formulas such as the Gini index (CART) or entropy (in ID3, C4.5, C5, J4.8 decision tree algorithms) are used to determine the splitting threshold (Patil, Aghav, & Sareen, 2016; Bequé & Lessmann, 2017). The tree learns by asking questions that will solve the problem in the fastest and most accurate way. Each time the algorithm is repeated, the attribute value is compared to the threshold. Thresholds determine which attribute should be tested and when tree growth should be stopped (Bequé & Lessmann, 2017). After the training, the tree can predict the outcome if applied to data of the same type and format. The low decision tree accuracy may be affected by low depth and presence of noise (overfitting). This classification technique is widely used in credit scoring models (Dastile, Celik, & Potsane, 2020). Since its introduction into the theory of statistical learning in 1998 (Vapnik, 1998), the support vector machines method (Support Vector Machines – SVM) has been used as a binary classifier of machine learning. SVM algorithms in the field of credit scoring were introduced by Baesens et al. (2003) and through comparison with other classification algorithms (logistic regression, discriminant analysis, k-nearest neighbor, neural networks, decision tree) indicated its good performance. SVM, as a discriminant model, is based on the margin for classification between classes, i.e. it focuses on finding the boundary that separates two classes with the smallest error. In the SVM model, data is viewed as vectors in n-dimensional space. The focus is on the maximum margin between classes, i.e. the space between two hyperplanes that separate data from different classes, i.e. which belong to one or more support vectors. The separating hyperplane is located farthest from the data and is determined by the position of the data of both classes closest to it. If the classes are denoted by y=+1 and y=-1, in the version of the linear SVM classifier, the margin ‖𝛼‖ = √∑ 𝛼𝑖 2𝑚 𝑖=1 between the negative and positive hyperplanes is maximized. The following equation is used to assign a class (Dastile, Celik, & Potsane, 2020): 𝑦 = { +1, 𝑖𝑓 𝑏 + 𝛼 𝑇 𝑥 ≥ +1 −1, 𝑖𝑓 𝑏 + 𝛼 𝑇 𝑥 ≤ −1 (1) where b is bias. The SVM method can also be used for data classes that are not linearly separable. “For nonlinear classification, a kernel trick is used to modify the SVM formulation” (Dastile, Celik, & Potsane, 2020). Using the appropriate kernel function, the example is mapped to a space with a larger number of dimensions and the nonlinear problem is transformed into a linear one. Neural network consists of several neurons that work in parallel, without centralized control. An artificial neural network mimics the way a biological neural network processes information. Neurons are usually complex in layers. The neural network usually consists of three layers: input, hidden, and output parameters. There are several types of neural networks, but their common components are a set of nodes and connections between nodes. Nodes represent computer units whose task is to receive inputs, process inputs and produce and output (Bequé & Lessmann, 2017). First, the input characteristics are processed to the hidden parameters, and then the hidden parameters calculate the adequate weight before forwarding the information to the output parameters. Each layer of the neural network “consists of several elements, i.e. neurons. The number of input neurons depends on the number of predictors, the number of hidden neurons is a setting parameter determined by the analyst, and the number of output neurons is determined by the modeling task itself, Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 33 e.g. for binary classification it is one” (Bequé & Lessmann, 2017). Artificial neural networks were first applied by Odom and Sharda (1990) in credit scoring. For a given vector of the input attribute x, the three-layer neural network calculates the output value�̂� as follows (Dastile, Celik, & Potsane, 2020): �̂� = 𝑎2(𝑎1 (𝛼 (1)𝑥 + 𝛼0 (1) )𝛼 (2)𝑥 + 𝛼0 (2) ) (2) where (𝛼0 (1) , 𝛼 (1)) , (𝛼0 (2) , 𝛼 (2)) are weights, and 𝑎2 and 𝑎2 are activation functions between the input and hidden layer. Neural networks are trained through a training set, and the final decision is made by applying the decision function to �̂�. 4. ENSEMBLE OF CLASSIFIERS – LITERATURE REVIEW Elementary classifiers represent a unique set of statistical relationships. Ensemble of classifiers consist of a set of individually trained base classifiers whose decisions are combined in a certain way (weighted voting or unweighted voting) when new examples are classified. Ensemble of classifiers is a “combination of classifiers so that their fusion achieves better performance than stand-alone classifiers” (Nanni & Lumini, 2009). Combining different machine learning algorithms can improve the accuracy of results (Dastile, Celik, & Potsane, 2020). Ensemble algorithm techniques are used to aggregate the results of “unstable” algorithms in which small changes in the training set lead to large changes in the learned set of rules (e.g. neural networks, decision trees). The application of ensemble learning techniques requires the simultaneous fulfillment of the following assumptions (Pławiak, Abdar, & Acharya, 2019): “a) quality, b) statistical independence (diversity) and c) efficiency (speed)”. Ensemble of classifiers is used to achieve better performance in various research areas such as computer intelligence, statistics, and machine learning (Ren, Zhang, & Suganthan, 2016). Different ensembles of classifiers are used in literature (Pławiak, Abdar, & Acharya, 2019): a) Boosting (AdaBoost), b) Bagging (Bootstrap aggregation), c) Random Forest, d) Stacking (Stacked Generalization) and e) Mixtures of Experts. Boosting is the most commonly used method in the ensemble of classifiers (Dastile, Celik, & Potsane, 2020). It starts with a weak model (for example, a shallow decision tree), and then the models are iteratively evaluated and amplified (Freund & Schapire, 1997). Boosting produces a series of classifiers (Bequé & Lessmann, 2017). Each subsequent classifier focuses on examples that have been misclassified by the previous classifier. Each example in the training set is assigned a weight in accordance with the significance of the example in the set. Examples misclassified by the previous model are assigned a higher weight. After individual classifier learning, the weights are updated on the test set. The accuracy of individual classifiers on a test set is determined by the weight of that classifier in the classification of new examples by applying an ensemble of classifiers. One of the most well-known boosting techniques is Adaptive Boosting (AdaBoost) (Freund & Schapire, 1997). AdaBoost algorithm has one parameter T – the number of generated classifiers (iteration) and is characterized by simplicity and efficiency. AdaBoost assigns a class to an input attribute vector that is classified (x) as follows: �̂� = 𝑠𝑖𝑔𝑛 (∑ 𝛼𝑡 𝜙𝑡 𝑇 𝑡=1 (𝑥)) (3) where 𝛼𝑡 is the weight of the classifier 𝜙𝑡 (𝑥). 34 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ Bagging generates multiple versions of classifiers that are used as an aggregate predictor through a voting mechanism (Breiman, 1996). Bootstrap Aggregation is used to generate classifiers, with no iterative division into a training set and a test set, but random selection with return. The training set is formed by successive sampling (with repetition) of data from the initial set. Data never selected form a test set, while the rest is used for training. The process is repeated several times, and the overall score is obtained as the average score on all thus formed sets for verification (Breiman, 1996). Bootstrapping generates K training sets, and then one basic classifier is trained on each of them. The class rating is awarded by a majority vote of K classifiers, as follows (Dastile, Celik, & Potsane, 2020): 𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦∈{+1,−1} ∑ 1(𝑦 = 𝜙𝑖(𝑥)) 𝐾 𝑖=1 (4) where 1(𝑦 = 𝜙𝑖 (𝑥)) = { 1, 𝑖𝑓 𝑦 = 𝜙𝑖 (𝑥); 0, 𝑖𝑓 𝑦 ≠ 𝜙𝑖 (𝑥). (5) The Random Forest algorithm consists of several decision trees. The new examples are classified by the voting method based on the decisions of individual trees. Not all samples and attributes are taken for training individual trees, but a certain number of randomly selected attributes and samples from the training dataset. Each decision tree develops on a subset of randomly selected attributes. The best attribute is chosen for the decision tree node. Selecting the right attributes (questions) to be tested in a particular node reduces the entropy, or provides additional information about the sample. In a random forest algorithm, multiple decision trees learn from randomly selected data leading to greater tree diversity and depth. The greater depth of the trees makes the random forest algorithm more resistant to underfitting (insufficiently good interpretation of the relationships between variables within the dataset) and overfitting (noise in the data) compared to individual decision trees. Deep Learning, as one of the machine learning fields, is based on a hierarchical architecture that includes multiple layers of nonlinear operations and steps in information processing. Some of the deep machine learning techniques used in credit scoring are (Pławiak, Abdar, & Acharya, 2019): „(a) deep discriminant models such as: Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), as well as Convolutional Neural Networks (CNNs), (b) unsupervised learning (generative models) such as: Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Deep Boltzmann Machines (DBMs) as well as Regularized Autoencoders.” Deep learning classifiers are not widely used in credit scoring (Dastile, Celik, & Potsane, 2020). Disadvantages of deep learning are (Pławiak, Abdar, & Acharya, 2019): “(1) computationally complex training, (2) long and inefficient training and (3) the overfitting effect, which prevents its effective practical use”. Predictive accuracy is a basic measure of classification success. Prediction accuracy is the percentage of success in classifying new examples using learned rules. Even small errors in creditworthiness assessment can lead to large losses, so increasing the accuracy of credit scoring is of great importance for the profitability of banks (Pławiak, Abdar, & Acharya, 2019). In this regard, more sophisticated credit scoring models have significant potential (Dastile, Celik, & Potsane, 2020). In one of the earliest reviews of statistical methods and data mining methods applied in credit scoring, Hand and Henley (1997) conclude that further development should go towards more complex DM models, which later literature reviews confirmed (Sadatrasoul, et. al. 2013). A comparison of different approaches to credit scoring shows that advanced machine learning-based techniques may Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 35 have better predictive ability than conventional techniques such as logistic regression and discriminant analysis (Abdou & Pointon, 2011). Nanni and Lumini (2009) investigate several ensemble of classifier systems used in credit scoring and bankruptcy prediction and improve the performance obtained using the stand-alone classifiers. Wang et al. (2011) carry out a comparative performance evaluation of “three ensemble methods (Bagging, Boosting, Stacking) of basic classifiers: logistic regression, decision tree, artificial neural network, and support vector machine (SVM)”. The results show that ensemble method improves performance and that bagging gives better results than boosting. Lou et al. (2017) compare classification success of deep learning algorithms in credit scoring and widely used models such as logistic regression and SVM, to find that deep learning models have better performance. Li et al. (2017) develop a “model for credit risk assessment using deep neural networks” and show that the proposed algorithm has greater accuracy in credit risk assessment. The results of the simulation by Zhou et al. (2012) show that Extreme learning machines (ELM) is a more suitable approach for credit risk assessment than SVM. Bequé and Lessmann (2017) investigate the potential application of ELM for credit scoring and compare it with other classifiers (neural networks, k-nearest neighbor, SVM, classification and regression decision tree, logistic regression) within three dimensions: “ease of use, computational complexity and predictive performance”. They conclude that “ELM shows competitive or better results in each dimension of comparison and especially proves high discriminant power, both in isolation and within the ensemble and, therefore, represents a competitive alternative to already established classifiers in the field of credit scoring”. Neagoe et al. (2018) design a credit scoring model using a neural network classifier, namely: The Multilayer Perceptron (MLP) approach and the thirteen-layer DCNN variant. The obtained results confirm the efficiency of the proposed approach, indicating a significant advantage of DCNN over MLP. Proceeding from the idea to imitate the work of the human brain in terms of fusion and information flow, Plawiak et al. (2019) create the “Deep genetic cascade ensembles of classifiers (DGCEC) based on the fusion of stratified 10-fold CV method, ensemble learning, deep learning, layered learning and supervised training. The applied model combines three machine learning techniques: evolutionary, ensemble and deep learning.” The solution the authors propose provides a fast and efficient approach to training, which increases the accuracy of creditworthiness assessment. In the DGCEC method, each first-layer classifier is trained to increase the recognition performance of accepted or rejected borrowers based on the pre-processed data on borrowers. In other layers, based on the pre- processed user data and the classifier response from the first and previous layers using deep learning techniques and selection of genetic characteristics, a knowledge extraction process takes place that leads to the final result. The results show better performance of this approach compared to previously applied approaches in terms of the accuracy of creditworthiness assessment of borrowers in Australia. The highest accuracy of creditworthiness assessment in previously conducted studies is 91.97%, while the method proposed by the authors allows higher prediction accuracy, i.e. of 97.39%. (Pławiak, Abdar, & Acharya, 2019). Recent literature research shows that ensemble models have better performance than individual classifiers and that deep learning models give better results compared to statistical and traditional machine learning models (Dastile, Celik, & Potsane, 2020). 36 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ 5. RESEARCH METHODOLOGY Empirical research includes checking the performance of deep learning algorithms over known credit scoring datasets. In this paper, we use two datasets, the Australian Credit and the German Credit (UCI Machine Learning repository, Asuncion & Newman, 2010). Datasets include a different number of independent variables. Australian credit data consists of 307 cases of creditworthy candidates and 383 cases of candidates to whom credit should not be granted. The German dataset is somewhat more asymmetric, with many more creditworthy examples (700) than those that should not be granted credit (300). The Australian dataset has 14 attributes, while the German has 24 attributes. Both sets have two classes {approved, rejected} and are a good mix of different types of attributes: continuous and nominal. Variables can be grouped into several categories (Beque and Lessmann, 2017): financial (assets, monthly income, etc.), socio-demographic (age, place of residence, etc.), others (possession of a credit card or a mobile phone). K-fold cross-validation is used to assess the classification model, which works as follows (Dietterich, 1998): 1. Dividing a training dataset into k randomly selected non-overlapping data subsets of approximately equal size; 2. One subset is used to validate the model of trained over the remaining data subsets; 3. This procedure is repeated k-times so that each subset is used exactly once for model validation; 4. Performance is assessed for each partition and the average error on all k-partitions is reported. This is one of the most popular techniques for cross-validation and is good at assessing the predictive accuracy of the classification model. In our study, 5-fold cross-validation is used for both credit rating datasets. Also, in order to reduce the dimensionality of the predictor space, principal component analysis (PCA) is used. PCA linearly transforms predictors to remove redundant dimensions and prevent overfitting. To assess the performance of classification models, several standard indicators are used, for the calculation of which the values of correct and incorrect predictions are used: the number of borrowers correctly classified as {approved} (defaults) (True Positives -TP), the number of borrowers incorrectly classified as {approved (defaults) (False Positives - FP), the number of borrowers correctly classified as {rejected} (non-defaults) (True Negatives - TN), and the number of borrowers incorrectly classified as {rejected} (non- defaults) (False Negatives - FN). The total number of examples is N=TP+FP+FN+TN. The false positive rate (FP) is defined as the share of misclassified loan approval cases. In contrast, the rate of false negative results (FN) is defined as the share of misclassified cases refused to be given a loan (qualifying for a loan). The set of indicators consists of: PCC (Percentage Correctly Classified), AUC (area under the curve), sensitivity, specificity, precision, G-mean, F-measure and Matthews correlation coefficient - MCC. The selection of indicators is based on previous research (Oztekin, Al-Ebbini, Sevkli, & Delen, 2018; Kim et al, 2020). Percentage Correctly Classified (PCC) is the ratio of correct predictions of a case classification model in two categories {approved, rejected}. It is calculated as PCC = (TP+TN)/(TP+TN+FP+FN). PCC is the average percentage of correctly classified cases and is a measure of correct classification over sets unused for learning (subset for validation in 5-fold cross-validation). Sensitivity/Recall is the ratio of correctly classified cases in the class {approved} to the total number of examples in the class {approved} and is calculated Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 37 as SEN = TP/(TP+FN). Specificity is the ratio of correctly classified cases in the class {rejected} to the total number of examples in the class {rejected} and is obtained as TN/(TN+FP). Sensitivity and specificity show the accuracy of class-level classifiers. Geometric-Mean (G-mean) is obtained as follows: 𝐺– 𝑚𝑒𝑎𝑛 = √ 𝑇𝑁 (𝐹𝑃 + 𝑇𝑁) 𝑥 𝑇𝑃 (𝑇𝑃 + 𝐹𝑁) (6) F-measure is calculated as: 𝐹– 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2× 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (7) where Precision = TP/(TP+FP). G-mean and F-measure indicate imbalance between classes. AUC or Area Under the Receiver Operating Characteristic Curve is one of the indicators that illustrates the performance of the classification model. The larger this area, the better the model. AUC can be seen as the ability to distinguish positive from negative classification. Matthews correlation coefficient (MCC) is another performance indicator. The MCC value is -1 to 1. Perfect prediction has a value of 1, completely incorrect prediction is -1, while random prediction has a value of 0. The MCC generates a high score only if the model can correctly predict most positive credits and most correctly rejected credits. It is considered one of the best indicators of accuracy in the evaluation of machine learning algorithms. The formula for calculating the MCC is as follows (Matthews, 1975; Jurman et al, 2012): 𝑀𝐶𝐶 = 𝑇𝑃∙𝑇𝑁−𝐹𝑃∙𝐹𝑁 √(𝑇𝑃+𝐹𝑃)∙(𝑇𝑃+𝐹𝑁)∙(𝑇𝑁+𝐹𝑃)∙(𝑇𝑁+𝐹𝑁) (8) The following classifiers are used to compare the performance of ensemble models: Logistic regression and Support vectors. The SVM models used to compare classification performance use different learning algorithms (Linear, Quadratic, Cubic, Fine Gaussian, Medium Gaussian, Coarse Gaussian). The tested ensemble models use three learning algorithms (Boosted Trees (AdaBoost), Bagged Trees and RUSBoosted Trees) over decision trees (Random Forest techniques). The research is conducted in the Classification Learner application module using the Matlab 2019b software package. All calculations are performed in a Windows 10 environment (AMD Ryzen 7 3700U with 12GB RAM). 6. RESULTS AND DISCUSSION Tables 1 and 2 show the performance of predictive models for the Australian and German datasets, respectively. The tables show assessment of categorical prediction accuracy and the discriminant ability of the included classifiers. Performance assessment is average accuracy on test sets. For each model, AUC (Area Under the Receiver Operating Characteristic Curve), True Positive Classification Rate (TP), False Positive Classification Rate (FP), True Negative Classification Rate (TN), False Negative Classification Rate (FN), and PCC are shown (Percent Correctly Classified). These results are the average values determined for each of the 5 independent non-overlapping partitions of the dataset used in the 5-fold cross-validation. For both datasets, the approved case rate is higher than the rejection rate for all models tested, meaning that false negative results are less common than false positive results relative to the overall prediction (with the exception of the Fine Gaussian SVM). It is 38 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ debatable whether false negative predictions are more serious than false positive ones from the point of view of credit risk. If someone is predicted to be able to repay the loan and it turns out that they are not able to, that can lead to certain losses. Conversely, if someone is rejected, and they are able to repay the loan, it leads to a loss of earnings on the loan. At the Australian dataset, according to the AUC criterion, the best classifier is the Ensemble Bagged Trees model (Table 1) with a value of 0.92 (87.41%). The Gaussian SVM model has the lowest false positive case rate (FP), while the linear Gaussian SVM models have the lowest false negative case rate (FN). According to the criterion of false positive cases (FP), ensemble models show the best results. This means that these models prove to be less risky in terms of incorrectly granted credit. According to the PCC criteria, the Ensemble Bagged Trees model proves to be the best classifier (Table 1). However, some SVM methods (Gaussian SVM) show better results than other ensemble methods. However, in general, ensemble methods show better results than SVM methods. On average, according to the PCC indicator, ensemble models are better than other tested models. Our results confirm the Beque and Lessmann (2017), but it should be noted that our research relies on the Matlab package while Beque and Lessmann (2017) use the R programming environment. Our research on the Australian dataset shows that logistic regression is inferior to SVM and ensemble methods. Table 1 Summary of results of individual classifiers obtained in predicting the Australian credit scoring dataset Classification Learner AUC TP FP TN FN PCC Logistic regression Logistic regression 0.90 (0.13,0.82) 266 (87%) 69 (18%) 314 (82%) 41 (13%) 84.06% Support Vector Machines (Box constraint level, Manual kernel scale) Linear (1,-) 0.91 (0.07,0.80) 285 (93%) 78 (20%) 305 (80%) 22 (7%) 85.51% Quadratic (1,-) 0.90 (0.16,0.84) 257 (84%) 62 (16%) 321 (84%) 50 (16%) 83.77% Cubic (1,-) 0.90 (0.18,0.84) 253 (82%) 60 (16%) 323 (84%) 54 (18%) 83.48% Fine Gaussian (1,0.94) 0.89 (0.24,0.89) 233 (76%) 43 (11%) 340 (89%) 74 (24%) 83.04% Medium Gaussian (1,3.7) 0.92 (0.07,0.80) 285 (93%) 77 (20%) 306 (80%) 22 (7%) 85.65% Coarse Gaussian (1,15) 0.92 (0.07,0.80) 285 (93%) 77 (20%) 306 (80%) 22 (7%) 85.65% Ensemble (Method, Maximum number of splits, Number of learners, Learning rate) Boosted Trees (AdaBoost, 20,30,0.1)) 0.92 (0.20,0.87) 246 (80%) 49 (13%) 334 (87%) 61 (20%) 84.06% Bagged Trees (Bag, 689,30,-) 0.92 (0.12,0.87) 269 (88%) 49 (13%) 334 (87%) 38 (12%) 87.39% RUSBoosted Trees (RUSBoost, 20,30,0.1) 0.91 (0.16,0.86) 259 (84%) 52 (14%) 331 (86%) 48 (16%) 85.51% Source: Data processed by the author Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 39 In the German dataset, according to the AUC criterion, the best classifier is the ensemble RUSBoosted Trees with a value of 0.76 (70.38%), and logistic regression, linear and medium Gaussian SVM are close to it with 0.77 (about 65%) (Table 2). The RUSBoosted Trees model has the lowest rate of false positive cases (FP), while the Gaussian SVM models have the lowest rate of false negative cases (FN). According to the criterion of false positive cases (FP), ensemble models give the best results. According to the PCC criteria, the Medium Gaussian SVM model (Table 2) proves to be the best classifier with accuracy of 74.10%. On average, according to the PCC indicator, ensemble models are equal to SVM models. In the German dataset, logistic regression yields results on a par with SVM and ensemble methods. Table 2 Summary of results of individual classifiers obtained in predicting the German credit scoring dataset Classification Learner AUC TP FP TN FN PCC Logistic regression Logistic regression 0.77 (0.54,0.86) 602 (86.0%) 163 (54.3%) 137 (45.7%) 98 (14.0%) 73.90% Support Vector Machines (Box constraint level, Manual kernel scale) Linear (1,-) 0.77 (0.60,088) 616 (88.0%) 179 (59.7%) 121 (40.3%) 84 (12.0%) 73.70% Quadratic (1,-) 0.75 (0.57,0.85) 592 (84.6%) 171 (57.0%) 108 (43.0%) 129 (15.4%) 70.00% Cubic (1,-) 0.70 (0.53,0.78) 547 (78.1%) 158 (52.7%) 142 (47.3%) 153 (21.9%) 68.90% Fine Gaussian (1,0.94) 0.71 (0.99,1.0) 699 (99.9%) 296 (98.7) 4 (1.3%) 1 (0.1%) 70.30% Medium Gaussian (1,3.7) 0.77 (0.68,0.92) 645 (92.1%) 204 (68.0%) 96 (32.0%) 55 (7.9%) 74.10% Ensemble (Method, Maximum number of splits, Number of learners, Learning rate) Boosted Trees (AdaBoost, 20,30,0.1)) 0.76 (0.61,0.87) 608 (86.9%) 182 (60.7%) 118 (39.3%) 92 (13.1%) 72.60% Bagged Trees (Bag, 689,30,-) 0.75 (0.58,0.86) 605 (86.4%) 173 (57.7%) 127 (42.3%) 95 (13.6%) 73.20% RUSBoosted Trees (RUSBoost, 20,30,0.1) 0.76 (0.25,0.65) 458 (65.4%) 74 (24.7%) 226 (75.3%) 242 (34.6%) 68.40% Source: Data processed by the author Tables 3 and 4 show the performance indicators of different classifiers for Australian and German datasets, respectively. The models with the best performance indicators are in bold. All classifiers in both tested datasets record a high value of sensitivity, with large differences observed in terms of specificity indicator in German dataset. In both datasets, the Fine Gaussian SVM has the highest sensitivity value. In the Australian dataset, SVM algorithms have the highest specificity. However, the best ensemble in the German dataset is the Ensemble RUSBoosted Trees. The sensitivity of SVM methods is on average higher than ensemble methods in both datasets. However, the specificity of ensemble methods is on average higher than SVM in both datasets. Similarly, in terms of G-average and F- measure, which show a balance between sensitivity and specificity, ensemble methods, on average, show slightly better results than SVM in terms of G-average but lower in F- measure. In terms of MCC indicators, as the most relevant for the assessment of binary 40 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ classification techniques of machine learning, in both datasets, the best results are recorded with ensemble techniques. In the Australian dataset, the MCC correlation of 74.60% Ensemble Bagged Trees shows that the predicted class and the correct class are highly correlated. However, the MCC/Ensemble RUSBoosted Trees correlation of 37.44% shows that the predicted class and the correct class are not as highly correlated (as with the Australian dataset). Taking into account all indicators, the analysis of classifiers and their predictive abilities do not identify an individual classifier with high predictive power for all, or at least most performance indicators. Table 3 Performance comparison of individual classifiers for the Australian credit scoring dataset Method Sensitivity/Recall Specificity G-mean F-measure MCC Logistic regression 81.98% 86.64% 84.28% 83.01% 68.24% Linear SVM 79.63% 92.83% 85.98% 82.47% 72.13% Quadratic SVM 83.81% 83.71% 83.76% 83.79% 67.31% Cubic SVM 84.33% 82.41% 83.37% 83.90% 66.63% Fine Gaussian SVM 88.77% 75.90% 82.08% 85.81% 65.60% Medium Gaussian SVM 79.90% 92.83% 86.12% 82.67% 72.37% Coarse Gaussian SVM 79.90% 92.83% 86.12% 82.67% 72.37% Ensemble Boosted Trees 87.21% 80.13% 83.59% 85.60% 67.64% Ensemble Bagged Trees 87.21% 87.62% 87.41% 87.30% 74.60% Ensemble RUSBoosted Trees 86.42% 84.36% 85.39% 85.96% 70.70% Source: Data processed by the author Table 4 Performance comparison of individual classifiers for the German credit scoring dataset Method Sensitivity/ Recall Specificity G-mean F-measure MCC Logistic regression 86.00% 45.67% 62.67% 79.49% 34.23% Linear SVM 88.00% 40.33% 59.58% 80.22% 32.16% Quadratic SVM 82.11% 38.71% 56.38% 75.57% 21.96% Cubic SVM 78.14% 47.33% 60.82% 73.23% 25.60% Fine Gaussian SVM 99.86% 1.33% 11.54% 82.51% 7.73% Medium Gaussian SVM 92.14% 32.00% 54.30% 82.14% 30.90% Ensemble Boosted Trees 86.86% 39.33% 58.45% 79.09% 29.47% Ensemble Bagged Trees 86.43% 42.33% 60.49% 79.27% 31.71% Ensemble RUSBoosted Trees 65.43% 75.33% 70.21% 66.88% 37.44% Source: Data processed by the author 7. CONCLUSION Credit scoring is a widely used technique that helps banks decide when granting loans to applicants. In addition to using standard statistical decision-making techniques, such as logistic regression or decision tree, credit scoring is a very interesting task for machine learning and artificial intelligence methods. In recent years, machine learning technologies have been developing rapidly and ensemble learning is being studied more and more. Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 41 Several papers have shown the advantages of deep learning over traditional credit scoring methods. In this paper, we investigated the predictive capabilities of ensemble algorithms over credit scoring decision trees and compared them with traditional methods – logistic regression and SVM support vectors. Ensemble methods are promising classifier and predictive techniques and represent an alternative to classical artificial neural networks (ANN). A large number of studies have shown that in problems of credit scoring classification, ensemble techniques are better than SVM techniques, as well as than traditional techniques such as logistic regression and decision tree. According to each comparison criterion, ensemble methods show better results or at least competitive results with tested predictive techniques. Based on Australian and German credit scoring data, performance of different classification models is compared. The performance of logistic regression (LR), support vector machine (SVM), and deep learning based on decision tree ensembles is analyzed. Several different performance indicators are used. According to the MCC indicator, which is considered to be the most adequate for classification problems, on average, deep learning methods prove to be the best. Individually, ensembles with boosted trees work best in the Australian dataset, and ensembles with RUSBoosted trees in the German dataset. Logistic regression performance is relatively poor compared to the ensemble method and SVM. Logistic regression, as the best of the traditional methods, fails to match the tools of machine learning. The results of this paper confirm previous research on the advantages of machine learning methods over traditional models. Also, the division regarding the obvious advantage of deep learning over SVM methods is confirmed. According to certain criteria, SVM shows better characteristics compared to ensembles. Nevertheless, ensemble methods are promising tools and provide potential for future research. REFERENCES Abdou, H., & Pointon, J. (2011). Credit Scoring, Statistical Techniques and Evaluation Criteria: A Review of the Literature. Intelligent Systems in Accounting Finance & Management, 18, 59–88. Altman, E. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589-609. Asuncion, A., & Newman, D. J. (2010). UCI machine learning repository. School of information and computer science, Retrieved from: http://archive.ics.uci.edu/ml/ Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring. The Journal of the Operational Research Society, 54(6), 627-635. Bequé, A., & Lessmann, S. (2017). Extreme learning machines for credit scoring: An empirical evaluation. Expert Systems With Applications, 86, 42-53. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Dastile, X., Celik, T., & Potsane, M. (2020). Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing Journal, 91, 1-21. https://doi.org/10.1016/j.asoc.2020.106263 Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning. Neural Computation, 10(7), 1895–1923. Durand, D. (1941). Risk Elements in Consumer Instalment Financing. New York: National Bureau of Economy Research. Einav, L., Jenkins, M., & Levin, J. (2013). The impact of credit scoring on consumer lending. The RAND Journal of Economics, 44(2), 249–274. Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188. 42 O. RADOVIĆ, S. MARINKOVIĆ, J. RADOJIČIĆ Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139. Goh, R., & Lee, L. (2019). Credit Scoring: A Review on Support Vector Machines and Metaheuristic Approaches. Advances in Operations Research, 1-30. https://doi.org/10.1155/2019/1974794 Hand, D. J., & Henley, W. E. (1997). Statistical classifcation methods in consumer credit scoring: a review. Journal of the Royal Statistical Society, 160(3), 523-541. Hui, L., Li, S., & Zongfang, Z. (2017). The Model and Empirical Research of Application Scoring Based on Data Mining Methods. Procedia Computer Science, 17, 911-918. Jurman, G. R. S. (2012). A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE, 7(8), 41882. Kim, A. Y.-C. (2020). Can deep learning predict risky retail investors? A case study in financial risk behavior forecasting. European Journal of Operational Research, 283(1), 217-234. Lewis, E. (1992). An Introduction to Credit Scoring. San Rafael: Fair, Isaac and Co., Inc. Li, Y., Lin, X., Wang, X., Shen, F., & Gong, Z. (2017). Credit Risk Assessment Algorithm Using Deep Neural Networks with Clustering and Merging. 13th International Conference on Computational Intelligence and Security (CIS) (pp. 173-176). Hong Kong: IEEE. Lou, C., Wu, D., & Wu., D. (2017). A deep learning approach for credit scoring using credit default swaps. Engineering Applications of Artificial Intelligence, 65, 465-470. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2), 442–451. Nanni, L., & Lumini, A. (2009). AAn experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36, 3028–3033. Neagoe, V., Ciotec, A., & Cucu, G. (2018). Deep Convolutional Neural Networks Versus Multilayer Perceptron for Financial Prediction. 2018 International Conference on Communications (COMM) (pp. 201-206). Bucharest: IEEE Odom, M., & Sharda, R. (1990). A neural network model for bankruptcy prediction. 1990 IJCNN International Joint Conference on Neural Networks, 2, pp. 163-168. Ohlson, J. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18(1), 109-131. Orgler, Y. (1971). Evaluation of bank consumer loans with credit scoring models. Journal of Bank Research, 2(1), 31-37. Oztekin, A., Al-Ebbini, L., Sevkli, Z., & Delen, D. (2018). A decision analytic approach to predicting quality of life for lung transplant recipients: A hybrid genetic algorithms-based methodology. European Journal of Operational Research, 266(2), 639–665. Patil, P. S., Aghav, J. V., & Sareen, V. (2016). An Overview of Classification Algorithms and Ensemble Methods in Personal Credit Scoring. International Journal of Computer Science and Technology, 7(2), 183-188. Pławiak, P., Abdar, U., & Acharya, R. (2019). Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring. Applied Soft Computing Journal, 84, 105740. Ren, Y., Zhang, L., & Suganthan, P. (2016). Ensemble Classification and Regression-Recent Developments, Applications and Future Directions. IEEE Computational Intelligence Magazine, 11(1), 41-53. Sadatrasoul, S. M., Gholamian, M. R., Siami, M., & Hajimohammadi, Z. (2013). Credit scoring in banks and fnancial institutions via data mining techniques: a literature review. Journal of AI and Data Mining, 1(2), 119-129. Thomas, L., Crook, J., & Edelman, D. (2002). Credit Scoring and Its Applications, Second Edition. Philadelphia: Society for Industrial and. and Applied Mathematics. Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley- Interscience. Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 38(1), 223-230. Zhou, H., Lan, Y., Soh, Y., Huang, G., & Zhang, R. (2012). Credit risk evaluation with extreme learning machine. IEEE International Conference on Systems, Man, and Cybernetics (SMC), (pp. 1064-1069). Seoul. Credit Scoring with an Ensemble Deep Learning Classification Methods – Comparison with Traditional... 43 KREDITNO BODOVANJE POMOĆU ANSAMBLERSKIH METODA DUBOKOG UČENJA ZA KLASIFIKACIJU – POREĐENJE SA TRADICIONALNIM METODAMA Kreditni skoring (kreditno bodovanje) privlači posebnu pažnju finansijskih institucija. Poslednjih godina, posebno su interesantni metodi zasnovani na dubokom učenju. U ovom radu, upoređujemo performanse ansamblerskim metodama dubokog učenja zasnovanih na stablima odlučivanja sa najboljom klasičnom metodom, logističkom regresijom i već postavljenim reperom za metode mašinskog učenja, mašinama sa vektorskom podrškom. Za svaku metodu testirano je više različitih algoritama. Takođe, korišćeni su različiti indikatori performansi. Istraživanje je izvršeno nad standardnim bazama za ovu vrstu klasifikacije, Australijskim i Nemačkim skupom podataka. Kao najbolja metoda, prema MCC indikatoru, pokazala se ansemblerska metoda sa boosted stablima odlučivnja. Takođe, u proseku, ansemblerske metode su se pokazele uspešnijim od SVM. Ključne reči: kreditno bodovanje, ansambli za klasifikaciju, duboko učenje, mašine za vektorsku podršku