INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (2012), No. 4 (November), pp. 733-743 Structural Regular Multiple Criteria Linear Programming for Classification Problem Z. Qi, Y. Shi Zhiquan Qi Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100190, China E-mail: qizhiquan@gucas.ac.cn Yong Shi 1. Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100190, China and 2.College of Information Science & Technology, University of Nebraska at Omaha Omaha, NE 68182, USA E-mail: yshi@gucas.ac.cn Abstract: Classification problem has attracted an increasing amount of interest. Various classi- fiers have been proposed in the last decade, such as ANNs, LDA, and SVM. Regular Multiple Criteria Linear Programming (RMCLP) is an effective classification method, which was proposed by Shi and his colleagues and have been applied to handle different real-life data mining problems. In this paper, inspired by the application potential of RMCLP, we propose a novel Structural RMCLP (called SRMCLP) method for classification problem. Unlike RMCLP, SRMCLP is sensitive to the structure of the data distribution and can construct more reasonable classifiers by exploiting these prior data distribution information within classes. The corresponding optimization problem of SRMCLP can be solved by a standard quadratic programming. The ef- fectiveness of the proposed method is demonstrated via experiments on synthetic and available benchmark datasets. Keywords: classification, RMCLP, structural information of data, SVM 1 Introduction For the last decade, the researchers have extensively developed various optimization tech- niques to deal with the classification problem in data mining or machine learning. Support Vec- tor Machine (SVM) ( [1,2]) is one of the most popular methods. However, Applying optimization techniques to solve classification has seventy years history. Linear Discriminant Analysis(LDA) ( [3]) was first proposed in 1936. Mangasarian ( [4]) has proposed a large margin classifier based on linear programming in 1960’s. From 1980’s to 1990’s, Glover proposed a number of linear programming models to solve discriminant problems with a small sample size of data ( [5,6]). Re- cently, Shi and his colleagues( [7]) extend Glover’s method into classification via multiple criteria linear programming (MCLP), and then various improved algorithms were proposed one after the other ( [8, 9]). These mathematical programming approaches to classification have been applied to handle many real world data mining problems, such as credit card portfolio management ( [11, 12]), bioinformatics ( [13]), firm bankruptcy ( [14]), and etc. Recently, how to apply the structural information of data to build a good classifier is a new research focus. Many new large margin classifiers based on structural information have been proposed. Exploiting clustering algorithms to extract the structural information embedded with classes is one popular strategy [15–17]. The structured large margin machine (SLMM) [15] is a representative work based on the strategy. Firstly, SLMM explores the structural information within classes by Ward’s agglomerative hierarchical clustering method on input Copyright c⃝ 2006-2012 by CCC Publications 734 Z. Qi, Y. Shi data [18], and then introduces the related structure information into the constraints. Finally, SLMM can be solved by a sequential second order cone programming (SOCP). Experimentally, SLMM is superior to support vector machine minimax probability machine (MPM) [19] and maxi- min margin machine(M4) [20]. However, as we all know, solving the involved SOCP problem is more difficult than the Quadratic Programming Problem (QPP) as in SVM, so SLMM has more higher computational complexity than traditional SVM. Consequently, a novel structural support vector machine (SRSVM) was proposed by Xue et. al [17]. Unlike SLMM, SRSVM exploits the classical framework of SVM rather than as constraints in SLMM and the corresponding optimization problem can still be solved by the QPP. SRSVM has been shown to be theoretically and empirically better in generalization than SVM and SLMM. In this paper, inspired by the success of SRSVM and the application potential of RMCLP, we propose a novel Structural RMCLP (called SRMCLP) method for classification problem. Unlike RMCLP, SRMCLP is sensitive to the structure of the data distribution and can construct more reasonable classifiers by exploiting these prior data distribution information within classes. The remaining parts of the paper are organized as follows. Section 2 introduces the basic notions and formulation of MCLP; Section 3 describes in detail our proposed Algorithms; All experimental results are shown in section 4; Conclusions are given in the last section. 2 Background We give a brief introduction of MCLP in the following. For classification about the training data T = {(x1,y1), · · · ,(xl,yl)}∈ (ℜn ×Y)l, (1) where xi ∈ ℜn,yi ∈ Y = {1,−1}, i = 1, · · · , l, data separation can be achieved by two opposite objectives. The first objective separates the observations by minimizing the sum of the devia- tions (MSD) among the observations. The second maximizes the minimum distances (MMD) of observations from the critical value [6]. The overlapping of data u should be minimized while the distance v has to be maximized. However, it is difficult for traditional linear programming to optimize MMD and MSD simultaneously. According to the concept of Pareto optimality, we can seek the best trade-off of the two measurements [11, 12]. So MCLP model can be described as follows: min u eT u & max v eT v, (2) s.t. (w ·xi) + (ui −vi) = b, for {i|yi = 1}, (3) (w ·xi)− (ui −vi) = b, for {i|yi = −1}, (4) u,v ≥ 0, (5) where e ∈ Rl is a vector whose all elements are 1, w and b are unrestricted, ui is the overlapping and vi the distance from the training sample xi to the discriminator (w ·xi) = b (classification separating hyperplane). By introducing penalty parameter c,d > 0, MCLP has the following version min u,v ceT u−deT v, (6) s.t. (w ·xi) + (ui −vi) = b, for {i|yi = 1}, (7) (w ·xi)− (ui −vi) = b, for {i|yi = −1}, (8) u,v ≥ 0, (9) Structural Regular Multiple Criteria Linear Programming for Classification Problem 735 Figure 1: Geometric meaning of MCLP. The geometric meaning of the model is shown in Figure 1. A lot of empirical studies have shown that MCLP is a powerful tool for classification. However, we cannot ensure this model always has a solution under different kinds of training samples. To ensure the existence of solution, recently, Shi et al proposed a RMCLP model by adding two regularized items 1 2 wT Hw and 1 2 uT Qu on MCLP as follows (more theoretical explanation of this model can be found in [8]): min z 1 2 wT Hw + 1 2 uT Qu + deT u− ceT v, (10) s.t. (w ·xi) + (ui −vi) = b, for {i|yi = 1}, (11) (w ·xi)− (ui −vi) = b, for {i|yi = −1}, (12) u,v ≥ 0, (13) where z = (wT ,uT ,vT ,b)T ∈ Rn+l+l+1, H ∈ Rn×n, Q ∈ Rl×l are symmetric positive definite matrices. Obviously, the regularized MCLP is a convex quadratic programming. Compared with traditional SVM , we can find that the RMCLP model is similar to the Support Vector Machine model in terms of the formation by considering the minimization of overlapping of the data. However, RMCLP tries to measure all possible distances v from the training samples xi to separating hyperplane, while SVM fixes the distance as 1 (through bound- ing planes (w · x) = b ± 1) from the support vectors. Although the interpretation can vary, RMCLP addresses more control parameters than the SVM, which may provide more flexibility for better separation of data under the framework of the mathematical programming. In ad- dition, different with SVM, RMCLP considers all the samples to solve classification problem. These make RMCLP have stronger insensitivity to outliers. 3 Structural Regular Multiple Criteria Linear Programming for Classification Problem 3.1 Extracting Structural Information within Classes Following the strategy of the SLMM and SRSVM, SRMCLP also has two steps. The first step is to extract structural information within classes by some clustering method; the second step is the model learning. In order to compare the main different of the second step between SRMCLP and the other two methods, here we adopt the same clustering method: Ward’s linkage clustering(WIL) [15–18], which is one of the hierarchical clustering analysis. A main advantage of WIL is that clusters derived from this method are compact and spherical, which provides a 736 Z. Qi, Y. Shi meaningful basis for the computation of covariance matrices [15]. Concretely, if S and T are two clusters with means µS and µT , the Ward’s linkage W(S,T) between clusters S and T can be computed as [15] W(S,T) = |S| · |T| · ∥µS −µT∥ |S|+ |T| . (14) Initially, each sample is considered as a cluster. The Ward’s linkage of two samples xi and xj is W(xi,xj) = ∥xi −xj∥2/2. When two clusters are being merged to a new cluster A′, the linkage W(A′,C) can be conveniently derived from W(A,C),W(B,C) and W(A,B) by [15] W(A′,C) = (|A|+ |C|)W(A,C) + (|B|+ |C|)W(B,C)−|C|W(A,B) |A|+ |B|+ |C| . (15) During the hierarchical clustering, the Ward’s linkage between clusters to be merged increases as the number of clusters decreases [15]. A relation curve between the merge distance and the number of clusters can be drawn to represent this process. The optimal number of clusters can be determined by finding the knee point. Furthermore, the WIL can also be extended to the kernel space. More details of WIL can be found in [15]. 3.2 Model Learning We obtained two groups of P and N clusters in class CP and CN by the first step, i.e., P = P1 ∪ · · ·Pi ∪ · · ·PCP ,N = N1 ∪ · · ·Nj ∪ · · ·NCN . Consider the optimization(10), choosing H,Q to be identity matrix and introducing 1 2 b2 and 1 2 w⊤Σw into the object function, SRMCLP can be be formulated as: min z 1 2 ∥w∥2 + 1 2 c1w ⊤Σw + 1 2 ∥u∥2 + 1 2 b2 + c2e T u− c3eT v, s.t. (w ·xi) + (ui −vi) = b, for {i|yi = 1}, (w ·xi)− (ui −vi) = b, for {i|yi = −1}, u,v ≥ 0, (16) where z = (wT ,uT ,vT ,b)T ∈ Rn+l+l+1, c1,c2,c3 ≥ 0 are the pre-specified penalty factors, Σ = Σ+ + Σ−, where Σ+ = ΣP1 + · · · ,+ΣPCP , Σ− = ΣN1 · · · ,+ΣNCN , ΣPi and ΣNj are respectively the covariance matrices corresponding to the i th and j th clusters in the two classes, i = 1, · · · ,CP , j = 1, · · · ,CN . Obviously, the regularized SRMCLP is a convex quadratic programming. By introducing its Lagrange function L(w,u,v,b,α,β,η) = 1 2 ∥w∥2 + 1 2 c1w ⊤Σw + 1 2 ∥u∥2 + 1 2 b2 + c2e T u− c3eT v+ (17) l∑ i=1 αi(yi((w ·xi)− b) + ui −vi)− l∑ i=1 βiui − l∑ i=1 ηivi, (18) where αi,βi,ηi ∈ R are the Lagrange multipliers, Therefore the dual problem of (39) can be formulated as max w,u,v,b,α,β,η L(u,v,w,b,α,β,η), s.t.∇w,u,v,bL(u,v,w,b,α,β,η) = 0, βi,ηi ≥ 0, i = 1, · · · , l. (19) Structural Regular Multiple Criteria Linear Programming for Classification Problem 737 From (19) we get ∇wL = (I + c1Σ)w + l∑ i=1 yiαixi = 0, (20) ∇uiL = ui + c2 + αi −βi = 0, i = 1, · · · , l, (21) ∇viL = −c3 −αi −ηi = 0, i = 1, · · · , l, (22) ∇bL = b− l∑ i=1 yiαi = 0, (23) where I is an identity matrix. Substituting the above equations into problem (19), the dual problem can be expressed as max α,u,b − 1 2 l∑ i=1 l∑ j=1 yiyjαiαj[x ⊤ i (2I + c1Σ)xj]− 1 2 l∑ i=1 u2i , (24) s.t. l∑ i=1 yiαi = 0, i = 1, · · · , l, (25) −c2 −ui ≤ αi ≤−c3, i = 1, · · · , l. (26) Solving the convex quadratic programming problem , we can obtain its solution. w = −(I + c1Σ)−1 l∑ i=1 yiαixi, (27) b = l∑ i=1 yiαi. (28) (29) So the decision function can be formulated as follows f(x) = − l∑ i=1 yiαix ⊤ i (I + c1Σ) −1x + b. (30) Applying the kernel trick, we also extend the linear SRMCLP to the nonlinear case. Introduce the kernel function K(x,x′) = (Φ(x) · Φ(x′)), where Φ(·) is a mapping from the input space Rn to Hilbert space H Φ : Rn →H, x → Φ(x). (31) Then the optimization problem of SRMCLP in the kernel space can be described as: max α,u,b − 1 2 l∑ i=1 l∑ j=1 yiyjαiαj[Φ(xi) ⊤(2I + c1Σ Φ)Φ(xj)]− 1 2 l∑ i=1 u2i , (32) s.t. l∑ i=1 yiαi = 0, i = 1, · · · , l, (33) −c2 −ui ≤ αi ≤−c3, i = 1, · · · , l. (34) 738 Z. Qi, Y. Shi 2Φ(xi) ⊤IΦ(xj) = K(xi,xj), so we only need to consider how to compute the kernel matrix c1Φ(xi) ⊤ΣΦΦ(xj). Suppose TPi is a matrix corresponding to the cluster Pi, TPi ∈ ℜ Pi×n, in which the k-th row is x⊤k . OPi is a mean matrix of cluster Pi, OPi ∈ℜ Pi×n. Each row of OPi is the same, i.e. µPi = 1 Pi ∑ xk∈Pi xk. (35) The related covariance matrix for cluster Pi can be expressed as ΣΦPi = 1 Pi (Φ(TPi)−Φ(OPi)) ⊤(Φ(TPi)−Φ(OPi)). (36) So we obtain Φ(xi) ⊤ ΣΦ+Φ(xj) =( 1 √ Pi (Φ(TPi)−Φ(OPi))Φ(xi)) ⊤ ( 1 √ Pi (Φ(TPi)−Φ(OPi))Φ(xj)) =( 1 √ Pi (K(TPi,xi)−K(OPi,xi)) ⊤ ( 1 √ Pi (K(TPi,xj)−K(OPi,xj)). (37) Similarly, Φ(M)⊤ΣΦ−Φ(M) of FΦ can computed Φ(xi) ⊤ ΣΦ−Φ(xj) =( 1 √ Pi (K(TNi,xi)−K(ONi,xi)) ⊤ ( 1 √ Pi (K(TNi,xj)−K(ONi,xj)), (38) where TNi is a matrix of cluster Ni, ONi is a mean matrix of cluster Ni. SRMCLP has a similar structure of RMCLP, We can easily proof that RMCLP are the special case of SRMCLP. Suppose the variance-covariance matrix of each cluster is ΣPi = σNj = I, i = 1, · · ·CP , j = 1, · · · ,CN . For an example of linear SRMCLP, the primal optimization problem (39) of SRMCLP becomes min z 1 2 ∥w∥2 + c1 CP + CN 2 ∥w∥2 + 1 2 ∥u∥2 + 1 2 b2 + c2e T u− c3eT v, s.t. (w ·xi) + (ui −vi) = b, for {i|yi = 1}, (w ·xi)− (ui −vi) = b, for {i|yi = −1}, u,v ≥ 0, (39) It is not difficult to see that the optimization problem (10) is equivalent to one of the primal problem of RMCLP. 4 Experiments We compare the SRMCLP against RMCLP and SRSVM [16, 17] on various data sets in this section. The testing accuracies of all experiments are computed using standard 10-fold cross valida- tion. c1,c2,c3 and RBF kernel parameter σ are all selected from the set {2i|i = −7, · · · ,7} Structural Regular Multiple Criteria Linear Programming for Classification Problem 739 by 10-fold cross validation on the tuning set comprising of random 10% of the training data. Once the parameters are selected, the tuning set is returned to the training set to learn the final decision function. The “quadprog" function is used to solve the QP problems in SRMCLP, RMCLP and SRSVM. The “1 vs r" method [2] is used to solve the multi-class classification. All algorithms are implemented by using MATLAB 2010. The experiment environment: Intel Core i7-2600 CPU, 4 GB memory. 4.1 Toy data In the subsection, we use a 2-D toy data to show the intuitive performance of SRMCLP. The 2-D toy data is the synthetic XOR dataset [17], which is a typical linearly nonseparable problem in classification and randomly generated under two Gaussian distributions in each class. In practise, samples in each class are designed to two clusters P1,P2 and N1,N2(the number of samples in each cluster is equal), and each gaussian distribution contains 100 samples. We respectively use 10%,20%,30%,50% of data in each cluster as the training set, and others for testing. The comparative results of SRMCLP and SRSVM are shown in Figure 2. In the XOR dataset, the positive class and negative class have both the horizontal distribution and the vertical distribution. How to fully exploit these prior knowledge will be a very difficult task. From Figure 2, we can find that SRMCLP’s discriminant boundaries basically enclose those of RMCLP, which means that SRMCLP has better generalization performance than RMCLP. Figure 3, we can also find that the accuracy’s different of these methods decreases with the increase of training samples. Those show that SRMCLP can fully exploit these prior structural information to design a more reasonable classifier. 4.2 UCI datasets In this subsection, we perform these methods on the UCI datasets [21]. For each dataset, we randomly select the same number of data from different classes to compose a dataset. 50% percent of each extracted dataset are for training, 50% for testing. The results are shown in the Table 1. From the Table 1, we can draw the conclusion as follows: 1) SRMCLP and SRSVM have the better predictive ability than RMCLP in all cases. This shows that these priori structural information embedded in classes has a great help to improve the classification performance of the classifier. 2) SRMCLP is superior to SRSVM in most cases. This shows SRMCLP is a strong competitive method. 5 Conclusion In this paper, we proposed a novel Structural RMCLP (called SRMCLP) method for classi- fication problem. Unlike RMCLP, SRMCLP is sensitive to the structure of the data distribution and can construct more reasonable classifiers by exploiting these prior data distribution infor- mation within classes. The corresponding optimization problem of SRMCLP can be solved by a standard quadratic programming. The effectiveness of the proposed method is demonstrated via experiments on synthetic and available benchmark datasets and applications on the decision supporting system. In the future work, we will apply SRMCLP into other actual classification problems such as stock forecast, credit card analysis to further test its effectiveness. 740 Z. Qi, Y. Shi −10 −5 0 5 10 15 −4 −2 0 2 4 6 8 10 12 14 16 −10 −5 0 5 10 15 −4 −2 0 2 4 6 8 10 12 14 16 −10 −5 0 5 10 15 20 −5 0 5 10 15 −10 −5 0 5 10 15 20 −5 0 5 10 15 −10 −5 0 5 10 15 20 0 5 10 15 20 −10 −5 0 5 10 15 20 0 5 10 15 20 −10 −5 0 5 10 15 20 0 5 10 15 20 −10 −5 0 5 10 15 20 0 5 10 15 20 Figure 2: The performance of SRMCLP and RMCLP in the case of RBF case. The first column and second column are the results on the training set and testing set. Each row is the result on 10%,20%,30% and 50% training sets, respectively. The magenta dotted curve and red solid curve denote the hyperplanes of SRMCLP and RMCLP, respectively. Structural Regular Multiple Criteria Linear Programming for Classification Problem 741 40 60 80 100 120 140 160 180 200 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 0.955 0.96 Size of training data A cc u ra cy SRMCLP RMCLP Figure 3: Accuracy of SRMCLP and RMCLP on the XOR dataset Table 1: The testing accuracy and training times on UCI datasets Datasets SRMCLP SRSVM RMCLP Accuracy Accuracy Accuracy Hepatitis 79.91±2.55 79.83±1.27 77.82±4.22 (155×19) Australian 68.18±2.12 69.32±2.31 67.73±1.56 (690×14) BUPA liver 68.12±1.34 68.96±1.09 67.61±2.71 (345×6) CMC 65.71±2.54 65.27±2.35 64.53±3.62 (844×9) Credit 76.36±2.18 76.11±2.61 75.82±2.87 (690×19) Diabetis 63.98±1.90 64.19±2.43 62.44±3.47 (768×8) Flare-Solar 59.11 ±2.98 58.43±2.77 57.96±2.51 (1066×9) German 63.91±1.93 63.84±1.88 62.55±2.86 (1000×20) Heart-Statlog 76.13 ±2.50 76.04±2.47 76.21±2.83 (270×14) Image 83.18 ±2.57 83.44±1.40 82.64±2.88 (2310×18) Ionosphere 76.93 ±2.61 76.55±2.63 76.43± 3.26 (351×34) 742 Z. Qi, Y. Shi 6 Acknowledgment This work has been partially supported by grants from National Natural Science Foundation of China( NO.70921061, NO.10601064), the CAS/SAFEA International Partnership Program for Creative Research Teams, Major International(Ragional) Joint Research Project(NO.71110107026), the President Fund of GUCAS. Bibliography [1] Vapnik V.N. The Nature of Statistical Learning Theory. 2nd ed. New York: Springer, 2000. [2] Deng N.Y., Tian Y.J. Support vector machines: Theory, Algorithms and Extensions. Science Press, Beijing, 2009. [3] Fisher R.A. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(2): 179-188, 1936. [4] Mangasarian O.L. Generalized support vector machines. Advances in Large Margin Classi- fiers. Cambridge, MA: MIT Press, 2000. [5] Freed N., Glover F. Simple but powerful goal programming models for discriminant problems. European Journal of Operational Research 7: 44-60, 1981. [6] Freed N., Glover F. Evaluating alternative linear programming models to solve the two-group discriminant problem. Decision Science 17: 151-162, 1986. [7] Olson D., Shi Y. Introduction to Business Data Mining. McGraw-Hill/Irwin, 2007. [8] Shi Y., Tian Y.J., Chen X.J., Zhang P. A Regularized Multiple Criteria Linear Program for Classification. In: ICDM Workshops, pp 253-258, 2007. [9] Kou G., Shi Y., Wang S.Y. Multiple criteria decision making and decision support systems - Guest editor’s introduction. Decision Support Systems 51(2): 247-249, 2011. [10] Zhang D., Tian Y.J., Shi Y. A regression method by multiple criteria linear programming. In: 19th International Conference on Multiple Criteria Decision Making (MCDM), pp 7-12, 2008. [11] Shi Y., Wise W., Lou M. Multiple Criteria Decision Making in Credit Card Portfolio Man- agement. Multiple Criteria Decision Making in New Millennium, pp 427-436, 2001. [12] Shi Y., Peng Y., Xu W. Data mining via multiple criteria linear programming: applications in credit card portfolio management. International Journal of Information Technology and Decision Making 1: 131-151, 2002. [13] Zhang J., Zhuang W., Yan N. Classification of HIV-1 Mediated Neuronal Dendritic and Synaptic Damage Using Multiple Criteria Linear Programming. Neuroinformatics 2: 303- 326, 2004 [14] Kwak W., Shi Y., Eldridge S. Bankruptcy prediction for Japanese firms: using multiple criteria linear programming data mining approach. International Journal of Data Mining and Business Intelligence, 2006. Structural Regular Multiple Criteria Linear Programming for Classification Problem 743 [15] Yeung D., Wang D., Ng W., Tsang E., Wang X., Structured large margin machines: sensitive to data distributions, Machine Learning 68 (2): 171-200, 2007. [16] Xue H., Chen S., Yang Q., Structural support vector machine, in: The 15th International Symposium on Neural Networks, pp. 501–511, 2008. [17] Xue H., Chen S., Yang Q., Structural Regularized Support Vector Machine: A Framework for Structural Large Margin Classifier, Neural Networks, IEEE Transactions on 22 (4): 573– 587, 2011. [18] Ward, J.R. Hierarchical Grouping to Optimize an Objective Function, Journal of the Amer- ican Statistical Association, 58 (301): 236–244, 1963. [19] G. R. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, M. I. Jordan, A robust minimax approach to classification, Journal of Machine Learning Research, 3: 555–582, 2002. [20] Kzhuang K. H., Yang H. , King I., Learning large margin classifiers locally and globally, in: In The Twenty-First International Conference on Machine Learning, pp. 401-408, 2004. [21] Murphy P.M. and Aha D.W. UCI machine learning repository, 1992.