INT J COMPUT COMMUN, ISSN 1841-9836 8(5):699-707, October, 2013. Feature Clustering based MIM for a New Feature Extraction Method S. El Ferchichi, S. Zidi, K. Laabidi, M. Ksouri, S. Maouche Sabra El Ferchichi* 1. University of Tunis EL Manar National Engineering School of Tunis Tunisia, BP 37, LE BELVEDERE 1002, TUNIS 2. Lille1 University, Science and Technology France,Cité Scientifique, 59655 Villeneuve d’Ascq Cedex *Corresponding author: sabra.elferchichi@enit.rnu.tn Salah Zidi, Salah Maouche Lille1 University, Science and Technology France,Cité Scientifique, 59655 Villeneuve d’Ascq Cedex salah.zidi@univ-lille1.fr, salah.maouche@univ-lille1.fr Kaouther Laabidi, Moufida Ksouri University of Tunis EL Manar, National Engineering School of Tunis, Tunisia, BP 37, LE BELVEDERE 1002, TUNIS kaouther.laabidi@enit.rnu.tn, moufida.ksouri@enit.rnu.tn Abstract: In this paper, a new unsupervised Feature Extraction appoach is pre- sented, which is based on feature clustering algorithm. Applying a divisive clustering algorithm, the method search for a compression of the information contained in the original set of features. It investigates the use of Mutual Information Maximization (MIM) to find appropriate transformation of clusterde features. Experiments on UCI datasets show that the proposed method often outperforms conventional unsupervised methods PCA and ICA from the point of view of classification accuracy. Keywords: feature extraction, Mutual Information Maximization (MIM), similarity measure, clustering. 1 Introduction The capabilities of a classifier are ultimately limited by the quality of the features in each input vector. Using a large number of features can be wasteful of both computational and memory resources. Additionally, there are irrelevant and redundant features that complicate the learning process, and can lead to inaccurate prediction. Although those features may contain enough information about the output class, they can not predict the output label correctly because of the large dimension of the feature space and the reduced number of collected instances. It is important to note that for the classifier, it becomes more difficult to determine the inherent relation between the features and the class distribution [9]. This problem is commonly referred to as the curse of dimensionality [6]. A reduction of the feature space dimensionality is often necessary to alleviate this problem. To address this issue, two different approaches exist : Feature Selection which consists in selecting only the attributes which are relevant according to a pre-defined criterion [10]; And Feature Extraction which transforms the original set of feature and constructs a new one, more compact and more useful for the classification [20]. Feature Extraction methods like PCA [15], ICA [12] and Feature Selection methods, try either to find new statistically independent directions, or to eliminate totally the redundant features. An alternative approach is to gather the "similar" features into a much smaller number of feature- clusters, and use them to re-describe the data. Consequently, the potential information contained Copyright c⃝ 2006-2013 by CCC Publications 700 S. El Ferchichi, S. Zidi, K. Laabidi, M. Ksouri, S. Maouche in these features could be preserved while the size of the feature space is reduced and good performances are maintained. The crucial step in such a procedure is the characterization of the "similarity" between features. Recently, the use of clustering has been investigated for the extraction of features. The applicability of this approach was proven in the case of text classification problems [17], [1] and protein sequences analysis [4]. For each application domain, a specific functional similarity measures was determined. In this work, we develop a new unsupervised Feature Extraction method. It is based on the use of clustering technique combined with Mutual Information Maximization (MIM) to perform feature clustering. Our main interest is to reveal the underlying structure of the feature space without any prior information about probabilities density functions or class-distribution of the data. Usually, in high dimensional space there are many features that have similar tendencies along the dataset. They describe similar variations of monotonicity (increasing or decreasing trend). Those features give a related discriminative information for the learning process. Hence, an analysis of the variations of the monotonicity of each feature vector along the dataset can lead us to determine a form of redundancy in the data. By using trend analysis, each feature will be totally described by its signature, which is statistically distinguished from random behavior. Intuitively, once the groups of similar features have been settled, feature extraction can be realized through a linear or nonlinear transformation that will determine a representative feature for each feature-cluster. In the same time, the extraction has to preserve the main characteristics of each feature-cluster and to incorporate them into the new representative feature. Therefore, each feature has to be highly correlated with its corresponding group center. To satisfy this objective a reliable measure of dependency between each feature-cluster, its corresponding centroid and a search strategy are needed. Within this context, MI is a suitable dependency measure [19] for our problem: it quantifies the amount of information that the center carries about the feature-cluster. It can detect either a linear or a nonlinear relationship between two random variables [18], [16]. MI measure was exploited in feature extraction and selection method but in a supervised fashion [11], [20], [13] and [2]. In section 2, Feature Extraction based Clustering Method (FEMC) is briefly reviewed. In section 3, we focus on the formulation of the feature-cluster transformation, based on MI maximization (MIM). In section 4, the performances of the proposed Feature Extraction based Clustering Method (FEMC) is analyzed and discussed through multiple classification problems. In section 5, we offer some conclusions and suggestions for future work. 2 Feature Extraction Method based Clustering The feature extraction method FEMC was recently proposed by [7] for pattern classification problems. It aims to obtain more generalization capabilities than existing methods. It performs feature extraction without presuming any knowledge about data structure or about instances’ classes [7]. As we stated before, features that behave the same along the data may contain the same information. Grouping those features and transforming them garantee there is no loss of information, better classification accuracy and reduced dimension. Hence, we focus on identifying "similar" features in their tendencies along the dataset. The analysis of features monotonocity reveals a form of redundancy between these features. Clustering technique was used to identify complex relationships between features and to discover the inherent data structure [8], [21]. A k-means algorithm based on a new similarity measure was permormed. Analyzing the tendency of feature vectors was proposed to identify the similarity between them. This new measure was designed to overcome the limitations of Euclidean distance, usually used Feature Clustering based MIM for a New Feature Extraction Method 701 in clustering algorithms. In fact, a trend is a semi-quantitative information, describing the evolution of the qualitative state of a variable in a time interval, by using a set of symbols such as {increasing, decreasing, steady} [5]. It was used with success for process monitoring and diagnosis [5]. The procedure of feature extrcation proposed in [7] start by computing the first order derivative of a feature vector is and fixing its sign (0, 1 or −1), at each point sample. After coding each trend, the difference of the tendency between each two vectors is computed. The distance is expressed as the squared root of the sum of the absolute difference between the occurrences of a specified value of a trend for two feature vectors. This was inspired from the Value Difference Metric (VDM) [14]. Thus, the location of a feature vector within the feature space is not defined directly by the values of its components, but by the conditional distributions of the extracted trend in each component. Furthermore, the similarity measure is not affected by the ordering of samples. 3 MIM for Feature Extraction based on Clustering We consider {x1, x2, ..., xL} the D-dimensional original dataset composed of D features vj each. A clustering technique is perforrmed on the feature space to construct d < D feature- clusters. We have to define an appropriate transformation for each feature-cluster in order to obtain a representative features gk, defined by the equation 1: gk = f(Ck) = nk∑ h=1 L∑ i=1 wivh , (1) Where,Ck is the cluster composed of nk feature vectors vh. wi is the weight attributed to each component of the feature vector vh. The transform wi has to preserve any linear or nonlinear dependency between features in vh ∈ Cj and their centroid gj. MI is an appropriate measure of dependency, so the optimal transform W∗ has to maximize the MI between {Vj,gj}. Vj is the matrix containing the nk feature vectors belonging to a cluster Ck. W is the vector containing the nk weights wi. 3.1 Mutual Information Information theory provides the possibility to measure the information with MI [9], [20]. Let p(x) and p(y) be the probability density function (pdf) for random vector X and Y , and p(x,y) the joint pdf. The MI between the discrete random vectors X and Y is defined as: I(x,y) = ∑ x∈X ∑ y∈Y p(x,y)log p(x,y) p(x)p(y) . (2) Where X and Y are the corresponding alphabets of X and Y. If the MI between the two random vectors is large then, the two vectors are closely related. If the MI becomes zero then, the two random vectors are independent. MI for continuous random variables are defined as follows: I(X,Y ) = − ∫ ∫ p(x,y)log p(x,y) p(x)p(y) dxdy . (3) The determination of the pdfs (p(x,y),p(x),p(y)) and the performance of the integrations is very complicated. Consequently, the continuous input feature space is divided into several discrete 702 S. El Ferchichi, S. Zidi, K. Laabidi, M. Ksouri, S. Maouche partitions. MI is then calculated using its expression for the discrete random variables. The inherent error that exists in the quantization process poses a problem. The Parzen window method is then used to estimate the pdfs of continuous random variables [19]. The method places a kernel function on top of each sample and evaluate the density as a sum of the kernels. Given a data of n N-dimensional training vectors {x1, ..,xn}, the pdf estimated by the Parzen window method is expressed by: p̂(x) = 1 n n∑ i=1 Φ(x − xi,σI) . (4) where Φ(.) is the Gaussian window function given by Φ(x,Σ) = 1 (2π) N 2 |Σ| 1 2 exp(− 1 2 xT Σ−1x) (5) where Σ is a covariance matrix of an N-dimensional random vector Z. Quadratic Mutual Information: when the aim is not to compute an accurate value of the entropy of a particular distribution, but rather to find a distribution that maximizes or minimizes the entropy given some constraints, a large number of alternative entropy measures are produced [19]. One of these is the following continuous density: D(f,g) = ∫ x(f(x) − g(x))αdx. (6) Since MI is expressed as the divergence between the joint density and the product of the marginal, we can insert this into the relation (6) and this way, the quadratic MI measure between two continuous variables X1 and X2 can be derived: I(X1,X2) = ∫ ∫ (p(x1,x2) − p(x1)p(x2))2dx1dx2 . (7) 3.2 Problem Formulation As we stated before, our objective is to realize an appropriate transformation on each feature- cluster. Each clusters’center is usually computed as the bary-center derived by the equation (1); where Wj = 1 1nj . We look for a more appropriate transformation W∗, since the center is the representative feature of its cluster and it will be used as a new feature. Since our objectif is to maximize the MI between each cluster Cj and its corresponding center gj, we define the transformation f to apply on each feature vector vi ∈ Cj, by gij = f(w,vi), which maximizes I(gj,Vj) (MI between vi ∈ Cj and gj) as described in figure 1. By using(7) we obtain: I(gj,Vj) = ∫ ∫ (p(gj,v) − p(gj)p(v))2dgjdv . (8) We have to develop p(gj,v) to be able to compute (8). Since the center gij = f(w,vi), belongs to the cluster Cj: gj ∈ Cj, we get the final set of features: Feature Clustering based MIM for a New Feature Extraction Method 703 Figure 1: Feature Extraction procedure Sfinal = Vj ∪ {gj}. p(gj,v) will be expressed by: p(gj,v) = p(gj)p(gj/v) = p(gj)(p(v) − p(gj)) . (9) We insert (9) in equation (8) and we get: I(gj,Vj) = ∫ ∫ p(gj) 4dgjdv . (10) We have used the Parzen window estimator to determine p(gj). By using the equation (5) for the obtained set constituted of nj + 1 features Sf = Vj ∪ {gj}, the density p(gj,v) can be expressed by: p(gj,v) = 1 n + 1 n+1∑ i=1 Φ(gj − vj,σI) . (11) We know that ∫ ZcNx(µc,Σc)dx = Zc. Henceforth, the MI becomes: I(g,V ) = ∫ g ∫ v n+1∑ i=1 n+1∑ j=1 n+1∑ k=1 n+1∑ l=1 ∏ s=1 4Φg(µs,Σs)dgdv (12) = ∫ g ∫ v n+1∑ i=1 n+1∑ j=1 n+1∑ k=1 n+1∑ l=1 zΦg(µ,Σ)dgdv (13) = ∫ v n+1∑ i=1 n+1∑ j=1 n+1∑ k=1 n+1∑ l=1 zdv . (14) Where z = |2πΣd| 1 2∏4 s=1 |2πΣd| 1 2 ∏ a