Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. VI (2011), No. 1 (March), pp. 175-186 A New Linear Classifier Based on Combining Supervised and Unsupervised Techniques L. State, I. Paraschiv-Munteanu Luminiţa State University of Piteşti Faculty of Mathematics and Computer Science Romania, 110040 Piteşti, 1 Târgu din Vale St. E-mail: lstate@clicknet.ro Iuliana Paraschiv-Munteanu University of Bucharest Faculty of Mathematics and Computer Science Romania, 010014 Bucharest, 14 Academiei St. E-mail: pmiulia@fmi.unibuc.ro Abstract: The aim of the research reported in the paper is to obtain an alternative approach in using Support Vector Machine (SVM) in case of non- linearly separable data based on using the k-means algorithm instead of the standard kernel based approach. The SVM is a relatively new concept in machine learning and it was introduced by Vapnik in 1995. In designing a classifier, two main problems have to be solved, on one hand the option concerning a suitable structure and on the other hand the selection of an algorithm for parameter estimation. The algorithm for parameter estimation performs the optimization of a conven- able selected cost function with respect to the empirical risk which is directly related to the representativeness of the available learning sequence. The choice of the structure is made such that to maximize the generalization capacity, that is to assure good performance in classifying new data coming from the same classes. In solving these problems one has to establish a balance between the accuracy in encoding the learning sequence and the generalization capacities because usually the over-fitting prevents the minimization of the empirical risk. Keywords: support vector machine, classification, unsupervised learning, su- pervised learning, k-means algorithm. 1 Introduction In addition to its solid mathematical foundation in statistical learning theory, SVM’s have demonstrated highly competitive performance in numerous real-world applications, such as bio- informatics, text mining, face recognition, and image processing, which has established SVM’s as one of the state-of-the-art tools for machine learning and data mining, along with other soft computing techniques. In training support vector machines the decision boundaries are determined directly from the training data so that the separating margins of decision boundaries are maximized in the high-dimensional space called feature space. This learning strategy, based on statistical learning of the training data and the unknown data. The method based on support vectors aims to increase the efficiency in approximating mul- tidimensional functions. The basic idea in a SVM approach is twofold. On one hand it aims to determine a classifier that minimizes the empirical risk, that is to encode the learning sequence as good as possible with respect to a certain architecture, and the other hand to improve the Copyright c⃝ 2006-2011 by CCC Publications 176 L. State, I. Paraschiv-Munteanu generalization capacity by minimizing the generalization error. In case of non-linear separable data the SVM is combined with kernel based technique which transforms the data in a linear separable data by mapping the initial data on to higher dimensional space of features. This mapping is performed in terms of special tailored kernels that allow to keep the computations at a reasonable complexity level. The SVM approach proves useful in classifying linear separable data as well as non-linear separable data because the mapping of the initial data onto a higher dimensional space of features determines that these classifiers behave as non-linear classifiers. The basic idea of a SVM approach is to obtain higher dimensional representations of the initial data by mapping them using a technique based on kernels in a feature space, such that for a non-linear separable learning sequence, its representation in the feature space becomes linearly separable. Being given that the representation of the learning sequence in the feature space is linearly separable, several techniques can be applied to determine in this space a separating hyperplane. Obviously, in case of linearly separable learning sequence the set of solutions is infinite, different algorithms yielding to different hyperplane solutions. Since a solution that keeps at distance as much as possible all examples assures good generalization capacities, this can be taken as a criterion in selecting a best solution among the available solutions. At first sight, it seems unreasonable to combine a supervised technique to an unsupervised one, mainly because they refer to totally different situations. On one hand, the supervised techniques are applied in case the data set consists of correctly labeled objects, and on the other hand the unsupervised methods deal with unlabeled objects. However our point is to combine the SVM and k-means algorithms, in order to obtain a new design of a linear classifier. A new linear classifier resulted as a combination of a supervised SVM method and 2-means algorithm is proposed in the paper, and its efficiency is evaluated on experimental basis in the final part of the paper. The tests were performed on simulated data generated from multi-dimensional normal repar- titions yielding to linearly separable and non-linearly separable samples respectively. 2 Supervised Learning Using SVM Let us assume that the data is represented by examples coming from two categories or classes such that the true provenance class for each example is known. We refer to such a collection of individuals as a supervised learning sequence, and it is represented as S= { (xi, yi) | xi ∈ Rd , yi ∈{−1, 1} , i = 1, N } . (1) The values 1, −1 are taken as labels corresponding to the classes. We say that the data are linearly separable if there exists a linear discriminant function g : Rd −→ R, g(u) = b + w1u1 + . . . + wdud , (2) where u = (u1, . . . , ud) ∈ Rd, such that yig (xi) > 0, ∀ (xi, yi) ∈S. Denoting by w = (w1, . . . , wd) T the vector whose entries are the coefficients of g, we say that S is separated without errors by the hyperplane Hw,b : w T u + b = 0 , (3) and Hw,b is called a solution of the separating problem because all examples coming from the class of label 1 belong to the positive semi-space, and all examples coming from the class of label A New Linear Classifier Based on Combining Supervised and Unsupervised Techniques 177 −1 belong to the negative semi-space defined by Hw,b. Obviously a hyperplane is a solution of separating problem if the functional margin min { yi ( wT xi + b ) , 1 ≤ i ≤ N } > 0. In a SVM-based approach, by imposing that the functional margin is 1, the search for a solu- tion yields to a constrained quadratic programming problem imposed on the objective function Φ(w) = 1 2 ∥w∥2, { min Φ(w) yi ( wT xi + b ) ≥ 1 , i = 1, N . (4) If w∗ is a solution of (4), then Hw∗,b∗ is called an optimal separating hyperplane. The computation of w∗ and b∗ is carried out using the SV M1 algorithm. Algorithm SV M1 ( [9]) Input: The learning sequence S; Step 1. Compute the matrix D = (dik) of entries dik = yiyk (xi) T xk , i, k = 1, N ; Step 2. Solve the constrained optimization problem  α∗ = arg ( max α∈RN ( αT 1− 1 2 αT Dα )) , αi ≥ 0 , ∀ 1 ≤ i ≤ N , N∑ i=1 αiyi = 0 , (5) If α∗i > 0 then xi is called the support vector. Step 3. Select two support vectors xr, xs such that α∗r > 0 , α ∗ s > 0 , yr = −1 , ys = 1. Step 4. Compute the parameters w∗, b∗ of the optimal separating hyperplane, and the width of the separating area ρ (w∗, b∗),  w∗ = N∑ i=1 α∗i yixi , b ∗ =− 1 2 (w∗) T (xr + xs) , ρ (w∗, b∗) = 2 ∥w∗∥ (6) Output: w∗, b∗, ρ (w∗, b∗). A linear separable sample is represented in figure 1a. The straight lines d1, d2, d3 and d4 are solutions for the separating problem of S, d4 corresponds to the optimal separating hyperplane. The examples placed at the minimum distance to the optimum separating hyperplane are the support vectors. In case of non-linearly separable samples the idea is to determine a separating hyperplane that minimizes the number of misclassified examples. The problem of finding a optimal hyperplane in case of non-linearly separable samples has been approached several ways. The approach introduced by Cortes and Vapnik ( [3]) uses the error function Φσ(ξ) = N∑ i=1 ξσi , (7) where the slack variables ξi , 1≤ i≤N, are taken as indicators for the classification errors (see figure 1b), and σ is a positive real number. The optimality is expressed in terms of the objective function Φ : Rd ×RN −→ [0, +∞) Φ(w, ξ) = 1 2 ∥w∥2 + c F ( N∑ i=1 ξσi ) , (8) 178 L. State, I. Paraschiv-Munteanu where c > 0 is a given constant, ξ = (ξ1, . . . , ξN), and F is a monotone convex function, F(0) = 0. The idea is to compute a subset of S, say {(xi1, yi1) , . . . , (xik, yik)}, by minimizing Φσ(ξ), such that there exists an optimal hyperplane for S\{(xi1, yi1) , . . . , (yik, yik)}. Such an optimal hyperplane is referred to as a soft margin hyperplane ( [3]). Figure 1: a) Optimal separating hyperplane; b)Classification errors. A soft margin hyperplane is a solution of the constrained optimization problem  arg ( min w∈Rd, b∈R, ξ∈RN (Φ(w, ξ)) ) yi ( wT xi + b ) ≥ 1− ξi , ∀ 1 ≤ i ≤ N , ξi ≥ 0 , ∀ 1 ≤ i ≤ N , (9) The samples represented in figure 1b, correspond to the non-linearly separable case. A soft margin hyperplane, the separating area, and the slack variables are indicated in figure 1b. The computation of a soft margin hyperplane is carried out by the algorithm SV M2. Algorithm SV M2 ( [9]) Input: The learning sequence S; c ∈ (0,∞). Step 1. Compute the matrix D = (dik) of entries, dik = yiyk (xi) T xk , i, k = 1, N ; Step 2. Solve the constrained optimization problem  α∗ =arg ( max α∈RN ( αT 1− 1 2 αT Dα − (αmax) 2 4 c )) , αi ≥ 0 , ∀ 1 ≤ i ≤ N , N∑ i=1 αiyi = 0 , (10) where αmax = max{α1, . . . , αN} Step 3. Select two support vectors xr, xs such that α∗r > 0 , α ∗ s > 0 , yr = −1 , ys = 1. Step 4. Compute the parameters w∗, b∗ of the soft margin hyperplane, and the width of the separating area ρ (w∗, b∗), according to (6). Output: w∗, b∗, ρ (w∗, b∗). 3 Unsupervised Learning using the k-means Method Center-based clustering algorithms are very efficient for clustering large and high-dimensional databases. They use objective functions to express the quality of any clustering solution, the A New Linear Classifier Based on Combining Supervised and Unsupervised Techniques 179 optimal solution corresponding to a solution to a constrained/unconstrained optimization prob- lem imposed of the particular objective function. Usually the clusters found have convex shapes and a one of more centers are computed for each cluster. The k-means algorithm was introduced by MacQueen ( [8]) for clustering numerical data, each of the produced clusters having a center referred as the cluster mean. Let D = {x1, . . . , xN}⊂ Rd be the data, k a given positive integer. The classes of any partition {C1, . . . ,Ck} of D are called clusters. For any {µ (C1) , . . . , µ (Ck)}⊂ Rd where each µ (Ci) in taken as the center of Ci, then the inertia momentum is, ε = k∑ i=1 ∑ x∈Ci d2 (x, µ (Ci)) , (11) where d is a convenable distance function on Rd. In the following we take d as being the Euclidean distance, d(x, y) = ∥x−y∥. The k-means method proceeds by iteratively allocate the individuals to the nearest clusters and re-computation of the centers is performed to minimize the inertia momentum, the computa- tion ending when non-significant changes of the centers/value of inertia momentum/membership functions of individuals to clusters are obtained. The k-means algorithm can be treated as an optimization problem where the goal is to minimize a given objective function under certain constraints. Let C be the set of all subsets of Rd of cardinal k, any particular Q = {q1, . . . , qk}∈ C is a possible set of cluster centers. Any partition of D into k classes can be obviously represented by a N ×k matrix W = (wil) where (i) wil ∈{0, 1} , i = 1, N , l = 1, k (ii) k∑ l=1 wil = 1 , i = 1, N . (12) The k-means algorithm can be formulated as the constrained optimization problem on the objective function P(W, Q) = N∑ i=1 k∑ l=1 wil ∥xi −ql∥ 2 as follows:   min W∈MN×k(R), Q∈C P(W, Q) wil ∈{0, 1} , i = 1, N , l = 1, k , k∑ l=1 wil = 1 , i = 1, N , Q = {q1, . . . , qk} . (13) The ’hard’ problem (13) can be solved by decomposing it into two simpler problems P1 and P2, and then iteratively solving them, where P1. Fix Q = Q̂ ∈C and solve the reduced constrained optimization problem for P ( W, Q̂ ) . P2. Fix W = Ŵ ∈ MN×k (R) and solve the reduced unconstrained optimization problem for P ( Ŵ, Q ) . The solutions of these problems can be derived by straightforward computations, and they are given by the following theorems: 180 L. State, I. Paraschiv-Munteanu Theorem 1. For any fixed set of centers Q̂ = {q̂1, . . . , q̂k} , the function P ( W, Q̂ ) is minimized in W (0) = ( w (0) ij ) if and only if W (0) satisfies the conditions w (0) il = 0 ⇐⇒∥xi − q̂l∥ > min 1≤t≤k ∥xi − q̂t∥ , w (0) il = 1 =⇒∥xi − q̂l∥ = min 1≤t≤k ∥xi − q̂t∥ , k∑ j=1 w (0) ij = 1 , for any i = 1, N , l = 1, k. Note that in general, for any given Q̂ there are more solutions because in general there exist individuals xi at minimum distance to more than one center of Q̂. Theorem 2. For any fixed Ŵ satisfying the constraints of (13), the function P ( Ŵ, Q ) is minimized there exist an unique point Q(0) = { q (0) 1 , . . . , q (0) k } if and only if q (0) l = ( N∑ i=1 ŵilxi )/( N∑ i=1 ŵil ) , l = 1, k. The scheme of the k-means algorithm viewed as search method for solving the optimization problem (13) is: The algorithm k-MOP Input: D - the data set, k - the pre-specified number of clusters, d - the data dimensionality, T - threshold on the maximum number of iterations. Initializations: Q(0), t ←− 0 Solve P ( W, Q(0) ) and get W (0) sw ←− false repeat Ŵ ←− W (t) solve P ( Ŵ, Q ) and get Q(t+1) if P ( Ŵ, Q(t) ) = P ( Ŵ, Q(t+1) ) then sw ←− true output ( Ŵ, Q(t+1) ) else Q̂ ←− Q(t+1) solve P ( W (t), Q̂ ) and get W (t+1) if P ( W (t), Q̂ ) = P ( W (t+1), Q̂ ) then sw ←− true output ( W (t+1), Q̂, ) endif endif t ←− t + 1 until sw or t > T . Output: Ŵ, Q̂. A New Linear Classifier Based on Combining Supervised and Unsupervised Techniques 181 Note that the computational complexity of the algorithm k-MOP is O(Nkd) per iteration. The sequence of values P ( W (t), Q(t) ) where W (t), Q(t) are computed by k-MOP is strictly decreasing, therefore the algorithm converges to a local minima of the objective function. 4 The Combined Separating Technique based on SVM and the k-means Algorithm At first sight, it seems unreasonable to compare a supervised technique to an unsupervised one, mainly because they refer to totally different situations. On one hand the supervised techniques are applied in case the data set consists of correctly labeled objects, and on the other hand the unsupervised methods deal with unlabeled objects. However our point is to combine SVM and k-means algorithm, in order to obtain a new design of a linear classifier. The aim of the experimental analysis is to evaluate the performance of the linear classifier resulted from the combination of the supervised SVM method and the 2-means algorithm. The method can be applied to data, either linearly separable or non-linearly separable. Ob- viously in case of non-linearly separable data the classification can not be performed without errors and in this case the number of misclassified examples is the most reasonable criterion for performance evaluation. Of a particular importance is the case of linearly separable data, in this case the performance being evaluated in terms of both, misclassified examples and the generalization capacity expressed in terms of the width of separating area. In real life situa- tions, usually is very difficult or even impossible to established whether the data represents a linearly/non-linearly separable set. In using the SV M1 approach we can identify which case the given data set belongs to. For linear separable data, SV M1 computes a separation hyperplane optimal from the point of view of the generalization capacity. In case of a non-linearly separable data SV M2 computes a linear classifier that minimizes the number of misclassified examples. A series of developments are based on non-linear transforms represented be kernel functions whose range is high dimensional spaces. The increase of dimensionality and the convenable choice of the kernel aim to transform a non-linearly separable problem into a linearly separable one. The computation complexity corresponding to kernel-based approaches is significantly large, there- fore in case the performance of the algorithm SV M1 proves reasonable good it could be taken as an alternative approach of a kernel-based SV M. We perform a comparative analysis on data consisting of examples generated from two dimensional Gaussian distributions. In case of a non-linearly separable data set, using the k-means algorithm, we get a system of pairwise disjoint clusters together with the set of their centers representing a local minimum point of the criterion (13), the clusters being linearly separable when k = 2. Consequently, the SV M1 algorithm computes a linear classifier that separates without errors the resulted clusters. Our procedure is described as follows Input: S = the learning sequence; Step 1. Compute the matrix D = (dik) of entries, dik = yiyk (xi) T xk , i, k = 1, N ; sh ←− true Step 2. If the constrained optimization problem (5) does not have solution then sh ←− false input c ∈ (0,∞), for soft margin hyperplane Solve the constrained optimization problem (10) endif Step 3. Select xr, xs such that α∗r > 0 , α ∗ s > 0 , yr = −1 , ys = 1; 182 L. State, I. Paraschiv-Munteanu Compute the parameters w∗, b∗ of the separating hyperplane, and the width of the separating area, ρ (w∗, b∗) according to (6); Step 4. If not sh then compute nr err1 = the numbers of examples incorrectly classified; compute err1 = classification error; endif Step 5. Split the set D = { xi | xi ∈ Rd , i = 1, N } into two clusters C1 and C2 using the 2-means algorithm and label the data belonging to C1 by y′i = 1, and label by y′i = −1 the data belonging to C2. Step 6. Apply the algorithm SVM1 to S′ = {( xi, y ′ i )∣∣ xi ∈ Rd, y′i ∈{−1, 1}, i = 1, N} and obtain the parameters of optimal separating hyperplane: w∗1, b ∗ 1, ρ (w ∗ 1, b ∗ 1); compute nr err2 = the number of data incorrectly classified by the algorithm 2−means; compute err2 = classification error resulted after the 2−means splitting ; Output: w∗, b∗, ρ (w∗, b∗), nr err1, err1; w∗1, b ∗ 1, ρ (w ∗ 1, b ∗ 1), nr err2, err2. 5 Comparative Analysis and Experimental Results The experimental analysis is based on a long series of tests performed on linear/non-linear separable simulated data of different volumes. The analysis aims to derive conclusions concerning: 1. The statistical properties (the empirical means, covariance matrices, eigenvalues, eigenvec- tors) of the clusters computed by the 2-means algorithm as compared to their counterparts corresponding to the true distributions they come from. 2. The comparison of the performances corresponding to the linear classifier resulted as a combination of SVM and the 2-means algorithm described in section 4 and SV M2 in terms of the empirical error. 3. The analysis concerning the influences of the samples sizes on the performance of the procedure described in section 4. 4. The quality of cluster characterization in terms of the principal directions given by a system of unit orthogonal eigenvectors of the sample covariance and empirical covariance matrices of the computed clusters. The analysis aimed to derive conclusions concerning the contribution of each principal direction, and for this reason, some tests were performed on data whose first principal component is strongly dominant, and when the principal directions are of the same importance respectively. The tests were performed on data generated from two-dimensional normal distributions N (µi, Σi) , i = 1, 2 of volumes N1 and N2, respectively. The sample covariance matrices are denoted by µ̂i, Σ̂i , i = 1, 2. The centers and the empirical covariance matrices correspond- ing to the clusters computed by the 2-means algorithm are denoted by µi, Σi , i = 1, 2. We denote by Zi , Ẑi , Zi , i = 1, 2 orthogonal matrices having as columns unit eigenvectors of Σi , Σ̂i , Σi , i = 1, 2 respectively. Test 1: N1 =N2 =50, µ1 = ( 1 1 ) , Σ1 = ( 1 0 0 0.25 ) , µ2 = ( 2 3 ) , Σ2 = ( 0.5 0 0 0.5 ) . The matrices Z1 , Z2 and their eigenvalues are λ (1) 1 =0.25 , λ (1) 2 =1, Z1 = ( 0 1 1 0 ) , λ(2)1 =0.5 , λ (2) 2 =0.5, Z2 = ( 1 0 0 1 ) . A New Linear Classifier Based on Combining Supervised and Unsupervised Techniques 183 The set is non-linear separable and it is represented in figure 2i)a. In this case we get µ̂1 = ( 0.92 1.0004 ) , Σ̂1 = ( 0.85 0.08 0.08 0.25 ) , µ̂2 = ( 1.98 2.87 ) , Σ̂2 = ( 0.44 0.09 0.09 0.63 ) . the matrices Ẑ1 , Ẑ2 and their eigenvalues being λ̂ (1) 1 =0.24 , λ̂ (1) 2 =0.86, Ẑ1 = ( 0.14 −0.98 −0.98 −0.14 ) , λ̂(2)1 =0.40 , λ̂ (2) 2 =0.67, Ẑ2 = ( −0.92 0.38 0.38 0.92 ) . Using the SVM2 with c = 70 we get the classification error class error = 14.70, the number of misclassified samples n errors = 13 and the width of separating area is ρ = 0.61. The value of the error coefficient defined as the ratio of the number of misclassified samples and total volume of the data is c error = 0.13%. The soft margin line d1 is represented in figure 2i)b. −1 0 1 2 3 0 1 2 3 4 −1 0 1 2 3 0 1 2 3 4 −1 0 1 2 3 0 1 2 3 4 −1 0 1 2 3 0 1 2 3 4 dc b d 1 d 2 a i) −1 0 1 2 3 4 5 0 1 2 3 4 −1 0 1 2 3 4 5 0 1 2 3 4 −1 0 1 2 3 4 5 0 1 2 3 4 −1 0 1 2 3 4 5 0 1 2 3 4 a c d d 1 b d 2 ii) Figure 2: i) The classification of the data set in test 1; ii) The classification of the data set in test 2. By applying the 2-means algorithm we get clusters whose sample means and covariance matrices are µ1 = ( 0.88 1.06 ) , Σ1 = ( 0.64 0.05 0.05 0.30 ) , µ2 = ( 2.13 2.96 ) , Σ2 = ( 0.41 −0.06 −0.06 0.56 ) . The matrices Z1 , Z2 and their eigenvalues are λ (1) 1 =0.29 , λ (1) 2 =0.65, Z1 = ( 0.14 −0.98 −0.98 −0.14 ) , λ (2) 1 =0.39 , λ (2) 2 =0.58, Z2 = ( −0.92 −0.37 −0.37 0.92 ) , the number of misclassified samples is 10 and the clusters are represented in figure 2i)c. Note that the computed centers and clusters are not influenced by the choice of the initial centers. The clusters computed by the 2-means algorithm for randomly selected initial centers are reprezented in figure 2i)c. The separating line d2 resulted by applying the SV M1 algorithm to the data represented by the clusters computed by the 2-means algorithm is represented in figure 2i)d. Test 2: N1 = 100 , N2 =200, µ1 = ( 1 1 ) , Σ1 = ( 1 0 0 0.25 ) , µ2 = ( 2 3 ) , Σ2 = ( 1 0 0 0.25 ) , λ (1) 1 =0.25 , λ (1) 2 =1, Z1 = ( 0 1 1 0 ) , λ(2)1 =0.25 , λ (2) 2 =1, Z2 = ( 0 1 1 0 ) , µ̂1 = ( 1.12 0.92 ) , Σ̂1 = ( 1.35 0.04 0.04 0.26 ) , µ̂2 = ( 2.01 3.00 ) , Σ̂2 = ( 0.86 0.05 0.05 0.25 ) , 184 L. State, I. Paraschiv-Munteanu λ̂ (1) 1 =0.26 , λ̂ (1) 2 =1.35, Ẑ1 = ( 0.03 −0.99 −0.99 −0.03 ) , λ̂(2)1 =0.25 , λ̂ (2) 2 =0.87, Ẑ2 = ( 0.09 −0.99 −0.99 −0.09 ) . The data set is non-linear separable and it is represented in figure 2ii)a. Applying the SV M2 for c = 5 we obtain the soft margin line d1 represented in figure 2ii)b and class error = 19.12, n errors = 13, ρ = 0.25, c error = 0.043%. The clusters computed by the 2-means algorithm are represented in figure 2ii)c and their statistical characteristics are µ1 = ( 0.96 1.004 ) , Σ1 = ( 1.19 −0.10 −0.10 0.38 ) , µ2 = ( 2.10 3.007 ) , Σ2 = ( 0.76 −0.02 −0.02 0.28 ) , λ (1) 1 =0.37 , λ (1) 2 =1.20, Z1 = ( −0.12 −0.99 −0.99 0.12 ) , λ (2) 1 =0.27 , λ (2) 2 =0.76, Z2 = ( −0.05 −0.99 −0.99 0.05 ) . In this case the number of misclassified samples is 18. Note that the initial choice of the centers does not influence significantly the computed centers and clusters. For instance in figure 2ii)c are represented the resulted clusters in case of randomly selected initial centers. The separating line d2 computed by the algorithm SV M1 applied to the data represented by these clusters is represented in figure 2ii)d. Test 3: N1 = N2 =50, µ1 = ( 1 1 ) , Σ1 = ( 1 0 0 0.25 ) , µ2 = ( 3 4 ) , Σ2 = ( 0.5 0 0 0.5 ) , λ (1) 1 =0.25 , λ (1) 2 =1, Z1 = ( 0 1 1 0 ) , λ(2)1 =0.5 , λ (2) 2 =0.5, Z2 = ( 1 0 0 1 ) , µ̂1 = ( 0.76 1.008 ) , Σ̂1 = ( 1.17 −0.06 −0.06 0.21 ) , µ̂2 = ( 2.87 4.03 ) , Σ̂2 = ( 0.56 0.009 0.009 0.31 ) , λ̂ (1) 1 =0.214 , λ̂ (1) 2 =1.180, Ẑ1 = ( −0.07 −0.99 −0.99 0.07 ) , λ̂(2)1 =0.31 , λ̂ (2) 2 =0.56, Ẑ2 = ( 0.03 −0.99 −0.99 −0.03 ) . The data set is linearly separable and it is represented in figure 3i)a. The soft margin line d1 computed by the SV M1 algorithm is represented in figure 3i)b, the value of the resulted margin being ρ = 1.196429. −2 −1 0 1 2 3 4 0 1 2 3 4 5 −2 −1 0 1 2 3 4 0 1 2 3 4 5 −2 −1 0 1 2 3 4 0 1 2 3 4 5 −2 −1 0 1 2 3 4 0 1 2 3 4 5 a b c d d 2 d 1 i) −1 0 1 2 3 4 0 1 2 3 4 5 −1 0 1 2 3 4 0 1 2 3 4 5 −1 0 1 2 3 4 0 1 2 3 4 5 −1 0 1 2 3 4 0 1 2 3 4 5 ba c d d 1 d 2 ii) Figure 3: i) The classification of the data set in test 3; ii) The classification of the data set in test 4. The clusters computed by the 2-means algorithm are represented in figure 3i)c and they are the same as in initial data set whatever the initial choice of the centers is. So, the statistical characteristics are A New Linear Classifier Based on Combining Supervised and Unsupervised Techniques 185 µ1 =µ̂1, Σ1 =Σ̂1, µ2 =µ̂2, , Σ2 =Σ̂2, λ (1) 1 =λ̂ (1) 1 , λ (1) 2 =λ̂ (1) 2 , Z1 =Ẑ1, λ (2) 1 =λ̂ (2) 1 , λ (2) 2 =λ̂ (2) 2 , Z2 =Ẑ2, and the separating line d2 computed by the algorithm SV M1 and represented in figure 3i)d coincides with d1. Test 4: N1 = 100 , N2 =150, µ1 = ( 1 1 ) , Σ1 = ( 1 0 0 0.25 ) , µ2 = ( 3 4 ) , Σ2 = ( 0.5 0 0 0.5 ) . λ (1) 1 =0.25 , λ (1) 2 =1, Z1 = ( 0 1 1 0 ) , λ(2)1 =0.5 , λ (2) 2 =0.5, Z2 = ( 1 0 0 1 ) . µ̂1 = ( 1.22 1.03 ) , Σ̂1 = ( 1.04 −0.03 −0.03 0.24 ) , µ̂2 = ( 2.98 3.99 ) , Σ̂2 = ( 0.48 −0.01 −0.01 0.43 ) . λ̂ (1) 1 =0.24 , λ̂ (1) 2 =1.04, Ẑ1 = ( −0.04 −0.99 −0.99 0.04 ) , λ̂(2)1 =0.42 , λ̂ (2) 2 =0.49, Ẑ2 = ( −0.27 −0.96 −0.96 0.27 ) . The data set is linear separable and it is represented in figure 3ii)a. Applying the SV M1 we obtain the soft margin line d1 represented in 3ii)b and ρ = 0.552508. The clusters computed by the 2-means algorithm are represented in figure 3ii)c and their statistical characteristics are µ1 = ( 1.20 1.04 ) ,Σ1 = ( 0.98 −0.04 −0.04 0.26 ) , µ2 = ( 3.00 3.98 ) , Σ2 = ( 0.48 −0.04 −0.04 0.45 ) , λ (1) 1 =0.26 , λ (1) 2 =0.98, Z1 = ( −0.05 −0.99 −0.99 0.05 ) , λ (2) 1 =0.42 , λ (2) 2 =0.51, Z2 = ( −0.60 −0.79 −0.79 0.60 ) . In this case the number of misclassified samples is 2 and the initial centers are randomly se- lected. The separating line d2 computed by the algorithm SV M1 applied to the data represented by these clusters is represented in figure 3ii)d. 6 Conclusions and future work Although to combine a supervised technique to an unsupervised one seems to be meaningless, mainly because they refer to totally different situations, the combined methodology resulted by putting together k-means and SVM methods yielded to an improved classifier. The experimental results point out good performance of the proposed method from both points of view, accuracy and computational complexity. We are optimistic that the research aiming to obtain refined methods by combining supervised, unsupervised and semi-supervised technics has good chances to provide a class of new powerful classification schemes. It has been already performed a series of tests on different types of classifiers obtained by combining PCA (Principal Component Analysis), ICA (Independent Component Analysis) and kernel based SVM’s the results being quite encouraging. Bibliography [1] S. Abe, Support Vector Machines for Pattern Classification, Springer-Verlag, 2005. [2] C.J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2, pp. 121-167, 1998. [3] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning, 20(3):273-297, 1995. 186 L. State, I. Paraschiv-Munteanu [4] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000. [5] G. Gan, C. Ma, J. Wu, Data Clustering: Theory, Algorithms and Applications, SIAM, 2007. [6] S.R. Gunn, Support Vector Machines for Classification and Regression, University of Southampton, Technical Report, 1998. [7] L. State, I. Paraschiv-Munteanu, I., Introducere in teoria statistică a recunoaşterii formelor, Editura Universitaţii din Piteşti, 2009. [8] J.B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, pp. 281-297, 1967. [9] I. Paraschiv-Munteanu, Support Vector Machine in solving pattern recognition tasks, Pro- ceedings of First Doctoral Student Workshop in Computer Science, University of Piteşti, May 2009. [10] I. Paraschiv-Munteanu, Theoretical approach in performance evaluation of a classsification system, Scientific Bulletin of the University of Piteşti, Series Mathematics and Informatics, No. 14, pp. 203-218, 2008. [11] N. Popescu-Bodorin, Fast K-Means Image Quantization Algorithm and Its Application to Iris Segmentation", Scientific Bulletin of the University of Piteşti, Series Mathematics and Informatics, No. 14, 2008. [12] R. Stoean, C. Stoean, M. Preuss, D. Dumitrescu, Evolutionary multi-class support vector machines for classification, International Journal of Computers Communications & Control, Vol.1 Supplement: Suppl. S, pp. 423-428, 2006. [13] V.N. Vapnik, The Nature of Statistical Learning Theory, New York, Springer Verlag, 1995. [14] V.N. Vapnik, Statistical Learning Theory, New York, Wiley-Interscience, 1998. [15] R. Xu, D.C.II Wunsch, Clustering, Wiley&Sons, 2009.