Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. IV (2009), No. 2, pp. 104-117 A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron Răzvan Andonie Computer Science Department Central Washington University, Ellensburg, USA and Department of Electronics and Computers Transylvania University of Braşov, Romania E-mail: andonie@cwu.edu Angel Caţaron Department of Electronics and Computers Transylvania University of Braşov, Romania E-mail: cataron@vega.unitbv.ro Lucian Mircea Sasu Applied Informatics Department Transylvania University of Braşov, Romania E-mail: lmsasu@unitbv.ro Abstract: Fuzzy ARTMAP with Relevance factor (FAMR) is a Fuzzy ARTMAP (FAM) neural architecture with the following property: Each training pair has a rel- evance factor assigned to it, proportional to the importance of that pair during the learning phase. Using a relevance factor adds more flexibility to the training phase, allowing ranking of sample pairs according to the confidence we have in the infor- mation source or in the pattern itself. We introduce a novel FAMR architecture: FAMR with Feature Weighting (FAM- RFW). In the first stage, the training data features are weighted. In our experiments, we use a feature weighting method based on Onicescu’s informational energy (IE). In the second stage, the obtained weights are used to improve FAMRFW training. The effect of this approach is that category dimensions in the direction of relevant features are decreased, whereas category dimensions in the direction of non-relevant feature are increased. Experimental results, performed on several benchmarks, show that feature weighting can improve the classification performance of the general FAMR algorithm. Keywords: Fuzzy ARTMAP, feature weighting, LVQ, Onicescu’s informational en- ergy. 1 Introduction The FAM architecture is based upon the adaptive resonance theory (ART) developed by Carpenter and Grossberg [7]. FAM neural networks can analyze and classify noisy information with fuzzy logic, and can avoid the plasticity-stability dilemma of other neural architectures. The FAM paradigm is prolific and there are many variations of Carpenter’s et al. [7] initial model: ART-EMAP [9], dARTMAP [8], Boosted ARTMAP [27], Fuzzy ARTVar [12], Gaussian ARTMAP [28], PROBART [21], PFAM [20], Copyright © 2006-2009 by CCC Publications A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 105 Ordered FAM [11], and µARTMAP [14]. The FAM model has been incorporated in the MIT Lincoln Lab system for data mining of geospatial images because of its computational capabilities for incremental learning, fast stable learning, and visualization [25]. One way to improve the FAM algorithm is to generalize the distance measure between vectors [10]. Based on this principle, we introduced in previous work [2] a novel FAM architecture with distance mea- sure generalization: FAM with Feature Weighting (FAMFW). Feature weighting is a feature importance ranking algorithm where weights, not only ranks, are obtained. In our approach, training data feature weights were first generated. Next, these weights were used by the FAMFW network, generalizing the distance measure. Potentially, any feature weighting method can be used, and this makes the FAMFW very general. Feature weighting can be achieved, for example, by LVQ type methods. Several such techniques have been recently introduced. These methods combine the LVQ classification with feature weighting. In one of these approaches, RLVQ (Relevance LVQ), feature weights were determined to generalize the LVQ distance function [16]. A modification of the RLVQ model, GRLVQ (Generalized RLVQ), has been proposed in [18]. The SRNG (Supervised Relevance Neural Gas) algorithm [17] combines the NG (Neural Gas) algorithm [22] and the GRLVQ. NG [22] is a neural model applied to the task of vector quantization by using a neighborhood cooperation scheme and a soft-max adaptation rule, similar to the Kohonen feature map. In [1], we introduced the Energy Supervised Relevance Neural Gas (ESRNG) feature weighting algorithm. The ESRNG is based on the SRNG model. It maximizes Onicescu’s IE as a criteria for computing the weights of input features. The ESRNG is the feature weighting algorithm we used in [2], in combination with our FAMFW algorithm . FAMR is a FAM incremental learning system introduced in our previous work [4]. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The FAMR has been successfully applied to classification, probability estimation, and function approximation. In FAMR, the relevance factor of a training pair may be user-defined, or computed, and is proportional to the importance of the respective pair in the learning process. In the present paper, we focus on the FAMR neural network, the ESRNG feature weighting algorithm, and the distance measure generalization principle. We contribute the following: 1. We introduce a novel FAMR architecture with distance measure generalization: FAMR with Fea- ture Weighting (FAMRFW), adapting the FAMFW model for the FAMR case. 2. Compared to [2], we include new experiments on standard benchmarks. We first introduce the basic FAM and FAMR notations (Section 2), and the ESRNG feature weighting algorithm (Section 3). In Section 4, we describe the new FAMRFW algorithm, which uses a weighted distance measure. Section 5 contains experimental results performed with the FAMRFW method. Sec- tion 6 contains the final remarks. 2 A brief description of the FAMR We will summarize the FAM standard architecture and the FAMR learning mechanism, which dif- ferentiates it from the standard FAM. 2.1 The FAM architecture A detailed FAM description can be found in Carpenter’s et al. seminal paper [7], but more simplified presentations are given in [26] and [19]. 106 Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron Figure 1: Fuzzy ARTMAP architecture [7]. The FAM architecture consists of a pair of fuzzy ART modules, ARTa and ARTb, connected by an inter–ART module called Mapfield (see Fig. 1). ARTa and ARTb are used for coding the input and output patterns, respectively, and Mapfield allows mapping between inputs and outputs. The ARTa module contains the input layer Fa1 and the competitive layer F a 2 . A preprocessing layer F a 0 is also added before Fa1 . Analogous layers appear in ARTb. The initial input vectors have the form: a = (a1, . . . , an) ∈ [0, 1]n. A data preprocessing technique called complement coding is performed by the Fa0 layer in order to avoid node proliferation. Each input vector a produces the normalized vector A = (a, 1 − a) whose L1 norm is constant: |A| = n. Let Ma be the number of nodes in Fa1 and Na be the number of nodes in F a 2 . Due to the preprocessing step, Ma = 2n. The weight vector between Fa1 and F a 2 is w a. Each Fa2 node represents a class of inputs grouped together, denoted as a category. Each Fa2 category has its own set of adaptive weights stored in the form of a vector waj , j = 1, . . . Na, whose geometrical interpretation is a hyper-rectangle inside the unit box. Similar notations are used for the ARTb module. For a classification problem, the class index is the same as the category number in Fb2 , thus ARTb can be substituted with a vector. The Mapfield module allows FAM to perform associations between ARTa and ARTb categories. The number of nodes in Mapfield is equal to the number of nodes in Fb2 . Each node j from F a 2 is linked to each node from Fb2 via a weight vector w ab j . The learning algorithm is sketched below. For each training pattern, the vigilance parameter factor ρa is set equal to its baseline value, and all nodes are not inhibited. For each (preprocessed) input A, a fuzzy choice function is used to get the response for each Fa2 category: Tj(A) = |A ∧ waj | αa + |waj | , j = 1, . . . , Na (1) Let J be the node with the highest value computed as in (1). If the resonance condition from eq. (2) is not fulfilled: ρ(A, waJ ) = |A ∧ waJ | |A| ≥ ρa, (2) then the Jth node is inhibited such that it will not participate to further competitions for this pattern and a new search for a resonant category is performed. This might lead to creation of a new category in ARTa. A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 107 A similar process occurs in ARTb and let K be the winning node from ARTb. The Fb2 output vector is set to: ybk = { 1, if k = K 0, otherwise k = 1, . . . , Nb (3) An output vector xab is formed in Mapfield: xab = yb ∧ wabj . A Mapfield vigilance test controls the match between the predicted vector xab and the target vector yb: |xab| |yb| ≥ ρab (4) where ρab ∈ [0, 1] is a Mapfield vigilance parameter. If the test from (4) is not passed, then a sequence of steps called match tracking is initiated (the vigilance parameter ρa is increased and a new resonant category will be sought for ARTa); otherwise learning occurs in ARTa, ARTb, and Mapfield: wa(new)J = βa ( A ∧ wa(old)J ) + (1 − βa)w a(old) J (5) (and the analogous in ARTb) and wabJk = δkK, where δij is Kronecker’s delta. With respect to βa, there are two learning modes: i) fast learning for βa = 1 for the entire training process, and ii) fast-commit and slow-recode learning corresponds to setting βa = 1 when creating a new node and βa < 1 for subsequent learning. 2.2 The FAMR learning mechanism The main difference between the FAMR and the original FAM is the updating scheme of the wabjk weights. The FAMR uses the following iterative updating [4]: w ab(new) jk = w ab(old) jk if j 6= J w ab(old) JK + qt Qnew J ( 1 − w ab(old) JK ) w ab(old) Jk ( 1 − qt Qnew J ) if k 6= K (6) where qt is the relevance assigned to the tth input pattern (t = 1, 2, . . . ), and QnewJ = Q old J + qt. The relevance qt is a real positive finite number directly proportional to the importance of the experiment considered at step t. This wabjk approximation is a correct biased estimator of the posterior probability P(k|j), the probability of selecting the k-th ARTb category after having selected the j-th ARTa category [4]. Let Q be the vector [Q1 . . . QNa ]; initially, each Qj (1 ≤ j ≤ Na) has the same initial value q0. Na and Nb are the number of categories in ARTa and ARTb, respectively. These are initialized at 0. For incremental learning of one training pair, the FAMR Mapfield learning scheme is described by Algorithm 1. The vigilance test is: Nb w ab JK ≥ ρab (7) For a clearer presentation, not to create a confusion between vector relevancies and feature weights, we will assume in all our following experiments that relevancies are set to a constant positive value. Since we actually do not use relevances, is this FAMR equivalent to the standard FAM model, as intro- duced in [7]? The answer is no, because, unlike the standard FAM: i) the FAMR accepts one-to-many relationships; and ii) the FAMR is a conditional probability estimator, with an estimated convergence rate computed in [4]. 108 Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron Algorithm 1 The t-th iteration in the FAMR Mapfield algorithm [4]. Step 1. Accept the t-th vector pair (a, b) with relevance factor qt. Step 2. Find a resonant category in ARTb or create a new one. if |b ∧ wbk|/|b| < ρb, for k = 1, . . . , Nb then Nb = Nb + 1{add a new category to ARTb} K = Nb if Nb > 1 then wabjK = q0 NbQj , for j = 1, . . . , Na {append new component to wabj } wabjk = w ab jk − wab jK Nb−1 , for k = 1, . . . , K − 1; j = 1, . . . Na{normalize} end if else Let K be the index of the ARTb category passing the resonance condition and with maximum activation function. end if Step 3. Find a resonant category in ARTb or create a new one. if |a ∧ waj |/|a| < ρa, for j = 1, . . . , Na then Na = Na + 1{add a new category to ARTa} J = Na QJ = q0 {append new component to Q} wabJk = 1/Nb, for k = 1, . . . , Nb {append new row to w ab} else Let J be the index of the ARTa category passing the resonance condition and with maximum activation function. end if Step 4. J, K are winners or newly added nodes. Check if match tracking applies. if vigilance test (7) is passed then {learn in Mapfield} QJ = QJ + qt wabJK = w ab JK + qt QJ (1 − wabJK ) wabJk = w ab Jk ( 1 − qt QJ ) , for k = 1, . . . , Nb, k 6= K else perform match tracking and restart from step 3 end if 3 The ESRNG feature weighting algorithm We use the ESRNG feature weighting algorithm to compute the generalized distance measure in the FAMRFW. Details of the ESRNG algorithm can be found in [1]. Is is based on Onicescu’s IE, and approximates the unilateral dependency of random variables by Parzen windows approximation. Before outlining the principal steps of the ESRNG method, we review the basic properties of the IE. 3.1 Onicescu’s informational energy For a discrete random variable X with probabilities pk, the IE was introduced in 1966 by Octav Onicescu [24] as E(X) = ∑n k=1 p 2 k. For a continuous random variable Y, the IE was defined by Silviu Guiaşu [15]: E(Y) = ∫ +∞ −∞ p2(y)dy, A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 109 where p(y) is the probability density function. For a continuous random variable Y and a discrete random variable C, the conditional IE is defined as: E(Y|C) = ∫ y M∑ m=1 p(cm)p 2(y|cm)dy. In order to study the interaction between two random variables X and Y, the following measure of unilateral dependency was introduced by Andonie et al. [3]: o(Y, X) = E(Y|X) − E(Y) with the following properties: 1. o is not symmetrical with respect to its arguments; 2. o(Y, X) ≥ 0 and the equality holds iff Y and X are independent; 3. o(Y, X) ≤ 1 − E(Y) and the equality holds iff Y is completely dependent on X. This measure quantifies the unilateral dependence characterizing Y with respect to X and corresponds to the amount of information detained by X about Y. 3.2 The feature weighting procedure ESRNG is an online algorithm which adapts a set of LVQ reference vectors by minimizing the quantization error. At each iteration, it also adapts the input vector feature weights. The core of the method is based on the maximization of the o(Y, C) measure. To connect input vector xi with its class j, represented by vector wj, we use a simple transform. We consider a continuous random variable Y with its samples yi = λI(xi − wj), i = 1, . . . , N, where: • λ is the vector of weights; • xi, i = 1, . . . , N, are the training vectors, each of them from one of the classes c1, c2, . . . , cM; • wj, j = 1, . . . , P, are the LVQ determined class prototypes. Assuming that the M class labels are samples of a discrete random variable denoted by C, we can use gradient ascend to iteratively update the feature weights by maximizing o(Y, C): λ(t+1) = λ(t) + α N∑ i=1 ∂o(Y, C) ∂yi I (xi − wj) . From the definition of o(Y, X), we obtain: o(Y, C) = E(Y|C) − E(Y) = M∑ p=1 1 p(cp) ∫ y p2(y, cp)dy − ∫ y p2(y)dy. (8) This expression involves a considerable computational effort. Therefore, we approximate the prob- ability densities from the integrals using the Parzen windows estimation method. The multidimensional Gaussian kernel is [13]: G(y, σ2I) = 1 (2π) d 2 σd · e− yty 2σ2 (9) 110 Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron where d is the dimension of the definition space of the kernel, I is the identity matrix, and σ2I is the covariance matrix. We approximate the probability density p(y) replacing each data sample yi with a Gaussian kernel, and averaging the obtained values: p(y) = 1 N N∑ i=1 G(y − yi, σ2I). We denote by Mp the number of training samples from class cp. We have: ∫ y p2(y, cp)dy = 1 N2 Mp∑ k=1 Mp∑ l=1 G(ypk − ypl, 2σ2I) and ∫ y p2(y)dy = 1 N2 N∑ k=1 N∑ l=1 G(yk − yl, 2σ2I), where ypk, ypl are two training samples from class cp, whereas yk, yl represent two training samples from any class. Equation (8) can be rewritten, and we obtain the final ESRNG update formula of the feature weights: λ(t+1) = λ(t) − α 1 4σ2 G(y1 − y2, 2σ2I) · (y2 − y1)I · ·(x1 − wj(1) − x2 + wj(2)), where wj(1) and wj(2) are the closest prototypes to x1 and x2, respectively. The ESRNG algorithm has the following general steps: 1. Update the reference vectors using the SRNG scheme. 2. Update the feature weights. 3. Repeat Steps 1 and 2, for all training set samples. This algorithm uses a generalized Euclidean distance. The updating formula for the reference vectors can be found in [1]; we will not explicitly use this formula in the present paper. The ESRNG algorithm generates numeric values assigned to each input feature, quantifying their importance in the classification task: the most relevant feature receives the highest numeric value. We use these factors as feature weights in the FAMRFW algorithm. 4 FAMRFW – a novel neural model The FAMRFW is a FAMR architecture with a generalized distance measure. For an ARTa category wj, we define its size s(wj): s(wj) = n − |wj| (10) and the distance to a normalized input A: dis(A, wj) = |wj| − |A ∧ wj| = n∑ i=1 dji, (11) A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 111 where (dj1, . . . , djn) = wj − A ∧ wj. In [10] it is shown that: Tj(A) = n − s(wj) − dis(A, wj) n − s(wj) + αa (12) ρ(A, waJ ) = n − s(wj) − dis(A, wj) n (13) A generalization of dis(A, wj) is the weighted distance: dis(A, wj; λ) = n∑ i=1 λidji, (14) where λ = (λ1, . . . , λn), and λi ∈ [0, n] is the weight associated to the ith feature. We impose the constraint |λ| = n. For λ1 = · · · = λn = 1, we obtain in particular the FAMR. Charalampidis et al. [10] used the following weighted distance: dis(x, wj|λ, ref) = n∑ i=1 (1 − λ)lrefj + λ (1 − λ)lji + λ dji, (15) where lrefj is a function of category j’s lengths of the hyper-rectangle, and λ is a scalar in [0, 1]. In our case, the function dis(A, wj; λ) does not depend on sides of the category created during learning, but on the computed feature weights. This makes our approach very different than the one in [10]. The effect of using distance dis(A, wj; λ) for a bidimensional category is depicted in Fig. 2(a). The hexagonal shapes represent the points situated at constant distance from the category. These shapes are flattened in the direction of the feature with a larger weight and elongated in the direction of the feature with a smaller weight. This is in accordance with the following intuition: The category dimension in the direction of a relevant feature should be smaller than the category dimension in the direction of a non-relevant feature. Hence, we may expect that more categories will cover the relevant directions than the non-relevant ones. (a) Bounds for constant weighted distance dis(A, wj; λ) for various values of λ. The rect- angle in the middle represents a category. (b) Bounds for constant distance dis(A, wj; λ) for the null feature weight. The rectangle in the middle represents the category. Figure 2: Geometric interpretation of constant distance when using dis(A, wj; λ) for bidimensional pat- terns. For a null weight feature (Fig. 2(b)), the bounds are reduced to parallel lines on both sides of the rectangle representing the category. In this extreme case, the discriminative distance is the one along the remaining feature dimension. This is another major difference between our approach and the one in [10], where, while using function dis(x, wj|λ, ref), the contours of a constant weighted distance are inside 112 Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron some limiting hexagons. In our method, the contour is insensitive to the actual value of the null weighted feature. 5 Experimental results We test the FAMRFW for several standard classification tasks, all from the UCI Machine Learning Repository [5]. The experiments are performed on the FAMR and the FAMRFW architectures. The two FAMRFW stages are: i) the λ feature weights are obtained by the ESRNG algorithm; ii) these weights are used both for training and testing the FAMR. A nice feature of the FAM architectures and the ESRNG algorithm is the on-line (incremental) learn- ing capability, i.e., the training set is processed only once. This type of learning is especially useful when dealing with very large datasets, since it can reduce significantly the computational overhead. For FAMR training and for both FAMRFW stages we use on-line learning. 5.1 Methodology For each experiment, we use three-way data splits (i.e., the available dataset is divided into training, validation, and test sets) and random subsampling. Random subsampling is a faster, simplified version of k-fold cross validation: 1. The dataset is randomized. 2. The first 60% of the dataset is used for training and the next 20% for validation (i.e., for tuning the model parameters). The following parameters are optimized using a simple “grid-search” for ρa, ρab ∈ {0, 0.1, . . . , 0.9} and βa ∈ {0, 0.1, . . . , 1}. The goal is to allow both fast learning and fast-commit slow-recode. The optimal parameter values are the ones producing the highest PCC and the lowest number of ARTa categories. 3. The network with optimal parameters is trained with the joint training + validation data. 4. The last 20% of the dataset is used for testing. As a result, the percent of correct classification (PCC) and the number of generated ARTa categories are computed. 5. Repeat this procedure six times. The ρa value, optimized during training/validation, controls the number of generated ARTa cate- gories. After training/validation, this number does not change. For ρa > 0, some test vectors may be rejected (i.e., not classified). In all our experiments, after the ARTa categories were generated, we set ρa = 0 for testing. This has the following positive effects: • All test vectors are necessarily classified. • We obtain experimentally better classification results, both for the FAMR and the FAMRFW, com- pared to the ones with optimized ρa values. This is shown in Table 1, for all considered classifi- cation tasks. The feature weights values in the FAMRFW are the ones mentioned in the following sections. A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 113 Table 1: Average PCC test set results using the optimized ρa (computed in the validation phase) vs. using ρa = 0. FAMR FAMRFW optimized ρa ρa = 0 optimized ρa ρa = 0 Breast cancer 86.54% 91.22% 91.22% 91.22% Balance scale 75.86% 76.53% 75.92% 78.13% Wine recognition 83.33% 84.72% 83.79% 89.35% Ionosphere 85.44% 88.96% 85.91% 89.43% 5.2 Breast cancer classification This dataset (formally called Wisconsin Diagnostic Breast Cancer) includes 569 instances. The instances are described by 30 real attributes. The given features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The FAMRFW generated weights are: [0.784, 0.816, 0.795, 2.847, 0.784, 0.784, 0.784, 0.784, 0.784, 0.784, 0.784, 0.784, 0.785, 0.808, 0.784, 0.784, 0.784, 0.784, 0.784, 0.784, 0.784, 0.829, 0.828, 5.047, 0.784, 0.784, 0.784, 0.784, 0.784, 0.784]. In Table 2, we observe that the average PCC for the FAMR and the FAMRFW is the same, but the FAMRFW has much less ARTa categories than the FAMR. Table 2: Classification performance for the Breast Cancer Problem. Test FAMR FAMRFW no. No. of ARTa categories PCC No. of ARTa categories PCC 1 61 93.85% 24 87.71% 2 7 90.35% 7 93.85% 3 10 95.61% 8 91.22% 4 39 85.08% 6 88.59% 5 6 92.98% 6 94.73% 6 6 89.47% 5 91.22% Average 21.5 91.22% 9.33 91.22% 5.3 Balance scale classification This dataset was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced. The set contains 625 patterns, with a uneven distribution of the three classes; each input pattern has 4 features. The ESRNG generated feature weights are λ = [1.002, 1.113, 0.827, 1.058]. The FAMRFW has better classification accuracy and less ARTa categories than the FAMR (Table 3). 5.4 Wine recognition The Wine recognition data are the results of a chemical analysis of wines grown in the same region in Italy, but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the 3 types of wines. The dataset contains 178 instances. 114 Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron Table 3: Classification performance for the Balance Scale Problem. Test FAMR FAMRFW no. No. of ARTa categories PCC No. of ARTa categories PCC 1 95 74.4% 53 75.2% 2 70 80.0% 39 80.0% 3 22 78.4% 54 81.6% 4 75 75.2% 44 85.6% 5 125 71.2% 69 72.0% 6 62 80.0% 107 74.4% Average 74.83 76.53% 61 78.13% The ESRNG algorithm produced the weights λ = [0.900, 0.757, 0.659, 1.668, 2.349, 0.702, 1.028, 0.668, 0.774, 0.874, 0.666, 0.701, 1.253]. The FAMRFW classification results are better, with less generated ARTa categories (Table 4). Table 4: Classification performance for the Wine Recognition Problem. Test FAMR FAMRFW no. No. of ARTa categories PCC No. of ARTa categories PCC 1 10 88.88% 6 86.11% 2 15 97.22% 10 97.22% 3 32 69.44% 11 86.11% 4 17 83.33% 11 86.11% 5 55 80.55% 39 94.44% 6 12 88.88% 8 86.11% Average 23.5 84.71% 14.16 89.35% 5.5 Ionosphere This binary classification problem starts from collected radar datasets. The data come from 16 high- frequency antennas, targeting the free electrons in the ionosphere. “Good” radar returns are those show- ing evidence of some type of structure in the ionosphere. “Bad” returns are those passing through the ionosphere. There are 351 instances and each input pattern has 34 features. The ESRNG generated λ vector is: [0.551,0.520,1.179,1.168,1.301,1.180,0.940,1.272,1.024,0.903, 0.843,0.976,0.870,0.844,0.807,0.877,0.893,1.012,0.994,1.012,0.964,1.061,1.029,1.227,0.978,1.020,0.943, 1.027,1.087,1.032,0.978,1.117,0.999,1.374]. On average, FAMRFW produced much less ARTa cate- gories than the FAMR. This time, the FAMR produced a slightly better PCC (Table 5). 6 Conclusions According to our experiments, using the feature relevances and the generalized distance measure may improve the classification accuracy of the FAMR algorithm. In addition, the FAMRFW uses less ARTa categories, which is an important factor. The number of categories controls the generalization A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 115 Table 5: Classification performance for the Ionosphere Problem. Test FAMR FAMRFW no. No. of ARTa categories PCC No. of ARTa categories PCC 1 28 81.69% 8 90.14% 2 20 81.69% 8 85.91% 3 17 91.54% 7 83.09% 4 9 94.36% 8 88.73% 5 5 90.14% 5 94.36% 6 9 94.36% 5 94.36% Average 14.66 88.96% 6.83 89.43% capability and the computational complexity of a FAM architecture. This generalization is a trade-off between overfitting and underfitting the training data. It is good to minimize the number of categories if this does not decrease too much the classification accuracy. The ESRNG feature weighting algorithm can be replaced by other weighting methods. We have not tested the function approximation capability of the FAMRFW neural network because the ESRNG weighting algorithm is presently restricted to classification tasks. LVQ methods can be extended to function approximation [23] and we plan to adapt the ESRNG algorithm in this sense. This would enable us to test the FAMRFW + ESRNG procedure on standard feature approximation and prediction benchmarks. Our approach is at the intersection of two major computational paradigms: 1. Carpenter and Grossberg’s adaptive resonance theory, an advanced distributed model where par- allelism is intrinsic to the problem, not just a mean to speed up [6]. 2. Onicescu’s informational energy and the unilateral dependency measure. To the best of our knowledge, we are the only ones using Onicescu’s energy in neural processing systems. Bibliography [1] R. Andonie and A. Caţaron. Feature ranking using supervised neural gas and informational energy. In Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN2005), Canada, Montreal, July 31 - August 4, 2005. [2] R. Andonie, A. Caţaron, and L. Sasu. Fuzzy ARTMAP with feature weighting. In Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2008), Innsbruck, Austria, Febr. 11-13, 2008, 91–96. [3] R. Andonie and F. Petrescu. Interacting systems and informational energy. Foundation of Control Engineering, 11, 1986, 53–59. [4] R. Andonie and L. Sasu. Fuzzy ARTMAP with input relevances. IEEE Transactions on Neural Networks, 17, 2006, 929–941. [5] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. Uni- versity of California, Irvine, School of Information and Computer Sciences http://www.ics.uci.edu/∼mlearn/MLRepository.html 116 Răzvan Andonie, Lucian Mircea Sasu, Angel Caţaron [6] I. Dziţac and B. E. Bărbat. Artificial intelligence + distributed systems = agents. International Journal Computers, Communications, and Control, 4, 2009, 17–26. [7] G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, and D. B. Rosen. Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Transactions on Neural Networks, 3, 1992, 698–713. [8] G. A. Carpenter, B. L. Milenova, and B. W. Noeske. Distributed ARTMAP: A neural network for fast distributed supervised learning. Neural Networks, 11, 1998, 793–813. [9] G. A. Carpenter and W. Ross. ART-EMAP: A neural network architecture for learning and predic- tion by evidence accumulation. IEEE Transactions on Neural Networks, 6, 1995, 805–818. [10] D. Charalampidis, G. Anagnostopoulos, M. Georgiopoulos, and T. Kasparis. Fuzzy ART and Fuzzy ARTMAP with adaptively weighted distances. In Proceedings of the SPIE, Applications and Sci- ence of Computational Intelligence, Aerosense, 2002. [11] I. Dagher, M. Georgiopoulos, G. L. Heileman, and G. Bebis. An ordering algorithm for pattern pre- sentation in Fuzzy ARTMAP that tends to improve generalization performance. IEEE Transactions on Neural Networks, 10, 1999, 768–778. [12] I. Dagher, M. Georgiopoulos, G. L. Heileman, and G. Bebis. Fuzzy ARTVar: An improved fuzzy ARTMAP algorithm. In Proceedings IEEE World Congress Computational Intelligence WCCI’98, Anchorage, 1998, 1688–1693. [13] J. C. Principe et al. Information-theoretic learning. In S. Haykin, editor, In Unsupervised Adaptive Filtering. Wiley, New York, 2000. [14] E. Gomez-Sanchez, Y. A. Dimitriadis, J. M. Cano-Izquierdo, and J. Lopez-Coronado. µARTMAP: Use of mutual information for category reduction in fuzzy ARTMAP. IEEE Transactions on Neural Networks, 13, 2002, 58–69. [15] S. Guiaşu. Information theory with applications. McGraw Hill, New York, 1977. [16] B. Hammer, D. Schunk, T. Bojer, and T. K. von Toschanowitz. Relevance determination in learning vector quantization. In Proceedings of the European Symposium on Artificial Neural Networks (ESANN 2001), Bruges, Belgium, 2001, 271–276. [17] B. Hammer, M. Strickert, and T. Villmann. Supervised neural gas with general similarity measure. Neural Processing Letters, 21, 2005, 21–44. [18] B. Hammer and T. Villmann. Generalized relevance learning vector quantization. Neural Networks, 15, 2002, 1059–1068. [19] C. P. Lim and R. Harrison. ART-Based Autonomous Learning Systems: Part I - Architectures and Algorithms. In L. C. Jain, B. Lazzerini, and U. Halici, editors, Innovations in ART Neural Networks. Springer, 2000. [20] C. P. Lim and R. F. Harrison. An incremental adaptive network for on-line supervised learning and probability estimation. Neural Networks, 10, 1997, 925–939. [21] S. Marriott and R. F. Harrison. A modified fuzzy ARTMAP architecture for the approximation of noisy mappings. Neural Networks, 8, 1995, 619–641. A Novel Fuzzy ARTMAP Architecture with Adaptive Feature Weights based on Onicescu’s Informational Energy 117 [22] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. Neural-gas network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4, 1993, 558–569. [23] S. Min-Kyu, J. Murata, and K. Hirasawa. Function approximation using LVQ and fuzzy sets. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Tucson, AZ, 2001, 1442–1447. [24] O. Onicescu. Theorie de l’information. Energie informationnelle. C. R. Acad. Sci. Paris, Ser. A–B, 263, 1966, 841—842. [25] O. Parsons and G. A. Carpenter. ARTMAP neural networks for information fusion and data mining: map production and target recognition methodologies. Neural Networks, 16, 2003, 1075–1089. [26] M. Taghi, V. Baghmisheh, and P. Nikola. A Fast Simplified Fuzzy ARTMAP Network. Neural Processing Letters, 17, 2003, 273–316. [27] S. J. Verzi, G. L. Heileman, M. Georgiopoulos, and M. J. Healy. Boosted ARTMAP. In Proceedings IEEE World Congress Computational Intelligence WCCI’98, 1998, 396–400. [28] J. Williamson. Gaussian ARTMAP: A neural network for fast incremental learning of noisy multi- dimensional maps. Neural Networks, 9, 1996, 881–897.