INT J COMPUT COMMUN, ISSN 1841-9836 7(5):957-967, December, 2012. Function Approximation with ARTMAP Architectures L.M. Sasu, R. Andonie Lucian M. Sasu 1. Transilvania University of Braşov Mathematics and Computers Department Romania, 500091 Braşov, Iuliu Maniu, 50 lmsasu@unitbv.ro 2. Siemens Corporate Technology Romania, 500096 Braşov, 15 Noiembrie, 46 E-mail: lucian.sasu@siemens.com Răzvan Andonie 1. Computer Science Department USA, Central Washington University, Ellensburg 400 East University Way Ellensburg, WA 98926, USA 2. Transilvania University of Braşov Electronics and Computers Department Romania, 500024 Braşov, Politehnicii, 1 E-mail: andonie@cwu.edu Abstract: We analyze function approximation (regression) capability of Fuzzy ARTMAP (FAM) architectures - well-known incremental learning neural networks. We focus especially on the universal approximation property. In our experiments, we compare the regression performance of FAM networks with other standard neu- ral models. It is the first time that ARTMAP regression is overviewed, both from theoretical and practical points of view. Keywords: fuzzy ARTMAP, universal approximation, regression. 1 Introduction The approximation of functions that are known only at a certain number of discrete points is a classical application of neural networks. Almost all approximation schemes can be mapped into some kind of network that can be dubbed as a “neural network” [1]. A neural network has the universal approximation property if it can approximate with arbitrary accuracy an arbitrary function of a certain set of functions (usually the set of continuous function) on a compact domain. The drawback is that such an approximation may need an unbounded number of “building blocks” (i.e., fuzzy sets or hidden neurons) to achieve the prescribed accuracy. Therefore it is reasonable to make a trade-off between accuracy and the number of the building blocks, by determining the functional relationship between them. Historically, of fundamental importance was the discovery [2] that a classical mathematical result of Kolmogorov (1957) was actually a statement that for any continuous mapping f : [0, 1]n ⊂ ℜn −→ ℜm there must exist a three layered feedforward neural network of continuous type neurons that implements f exactly. This existence result was the first step. Cybenko [3] showed that any continuous function defined on a compact subset of ℜn can be approximated to any desired degree of accuracy by a feedforward neural network with one hidden layer using sigmoidal nonlinearities. Many other papers have investigated the approximation capability of three layered networks in various ways. In addition to sigmoid functions, more general functions can be used as activation functions of universal approximator feedforward networks [4]. Copyright c⃝ 2006-2012 by CCC Publications 958 L.M. Sasu, R. Andonie Girosi and Poggio proved that radial basis function (RBF) networks also have universal approximation property [1]. Hartman and Kowalski [5] proved that a one hidden layer neural network with Gaussian hidden nodes is a universal approximator for real-valued maps defined on convex, compact sets of ℜn. Additional related papers are [6] and [7]. The Fuzzy ARTMAP (FAM) family of neural networks is one of the best known incremental learning systems. There are many variations of Carpenter’s et al. [8] initial FAM model, including Gaussian ARTMAP (GAM) [9], PROBART [10], FAMR [11], GART [12], [13], and AppART [14]. Compared to FAM classification, the function approximation (regression) capability of FAM was less frequently addressed. It is our goal here to discuss FAM regression capability for different FAM architectures. The FAM maps subsets of ℜn to ℜm, accepting both binary and analogue inputs in the form of pattern pairs. The initial FAM, PROBART, and the FAMR architectures have been used for incremental regression estimation. Since the initial FAM was proved to be universal approximator [15], it is reasonable to believe that members of the FAM family may also have the universal approximation capability. However, since some of the FAM variations are quite different than the initial FAM, each model should be considered individually. The Bayesian theory allows for elaboration of general neural network training methods [16].Recently, Vigdor and Lerner have combined the Bayesian theory and the FAM introduc- ing the Bayesian ARTMAP (BA) [17]. Like the GAM and the GART networks, during training, the BA uses Gaussian categories and FAM competitive learning. However, the BA prediction phase is very different than the FAM competitive algorithm, being a Bayesian approach. Vigdor and Lerner have compared the BA performance with respect to classification accuracy, learning curves, number of categories, sensitivity to class overlapping and risk with those of the FAM. Generally, the BA outperformed the FAM in classification tasks. Up to our contribution, the BA regression capability was not discussed or tested. Our paper is the first overview of both theoretical and practical aspects of FAM regression, considering several major FAM architectures: the initial FAM of Carpenter et al., PROBART, FAMR, BA, GAM, and AppART. We discuss universal approximation capabilities of these FAM models. In our experiments, we compare the regression performance of FAM networks with stan- dard neural networks: Multi Layer Perceptron (MLP), RBF, General Regression Neural Network (GRNN), and FasBack. Section 2 reviews the main notations and paradigms of FAM. In Section 3, we discuss the universal approximation capability of the following FAM architectures: the orig- inal FAM, PROBART, FAMR, BA, and AppART. We synthesize our comparative experiments in Section 4. Section 5 contains the final remarks. 2 Fuzzy ARTMAP A FAM consists of a pair of fuzzy ART modules, ARTa and ARTb, connected by an inter-ART module called Mapfield, F ab. ARTa contains a preprocessing layer F a0 , an input (or short-term memory) layer F b1 and a competitive layer F b 2 . The following notations apply: Ma is the number of nodes in F a1 , Na is the number of nodes in F a 2 , and w a is the weight vector between F a1 and F a 2 . We say that a node – also called a category – from F a2 is uncommitted if it has not learned yet an input pattern, and committed otherwise. Analogous layers and notations are used in ARTb. Each node j from F a2 is linked to each node from F b 2 via a weight vector w ab j from F ab, the jth row of the matrix wab, 1 ≤ j ≤ Na. All weights are initialized to 1. All input vectors are complement-coded by the F a0 layer in order to avoid category prolifer- ation [8], [18], [19]: the input vector a = (a1, . . . , an) ∈ [0, 1]n produces the normalized vector A = (a1, . . . , an, 1 − a1, . . . , 1 − an). During pattern processing, the operator ∧ used is the fuzzy Function Approximation with ARTMAP Architectures 959 AND operator defined as (p ∧ q)i = min(pi, qi), where p = (p1, . . . , pn) and q = (q1, . . . , qn). | · | denotes the L1 norm. Before learning a normalized input vector A, the vigilance parameter factor ρa is reset to its baseline value ρa and each input category is considered as not inhibited, competing for the current input pattern. A fuzzy choice function is computed for every ARTa category: Tj(A) = |A∧waj | αa+|waj | , for 1 ≤ j ≤ Na. The non-inhibited node of index J having the maximum fuzzy choice function value is further checked whether it passes the resonance condition, i.e. if the input is similar enough to the winner’s prototype: |A ∧ waJ|/|A| ≥ ρa. If this condition is not fulfilled, then the node having index J is inhibited and another non-inhibited node maximizing the fuzzy choice function is considered as above. If no such node exists, a new node with index J is created to represent the input vector. In parallel, a similar step is performed in the ARTb module; we obtain output vector yb = (δiK)1≤i≤Nb , where K is the index of the output winner node (1 ≤ K ≤ Nb) and δij is Kronecker’s delta. If input node J is newly added, then we associate it with the current output: wabJk = δkK and this association becomes permanent. Each time input node J is activated, it predicts as output value the only index k for which wabJk = 1. If node J is not new, then we check whether its predicted value is K. If the prediction is incorrect, a new activity (called match tracking) is triggered in ARTa solely. Otherwise, learning occurs in both ARTa and ARTb: w a(new) J = βa ( A ∧ wa(old)J ) + (1 − βa)w a(old) J (1) where βa ∈ (0, 1] is the learning rate parameter. A similar learning step takes place in ARTb. The match tracking raises the ρa threshold for the current input pattern: ρa = δ + |A ∧ waJ|/|A|. If ρa > 1 then the current input pattern is rejected; otherwise, the search for an appropriate input category is continued, as described above. For each F a2 category we have the following geometrical interpretation. Node w a j is a hy- perrectangle Rj inside the n-dimensional hypercube, having size n − |wj| [8]. Learning, as in equation (1), is equivalent to expanding the hyperrectangle towards the current input pattern, unless this pattern is not already in Rj. If βa = 1, then Rj expands to Rj ⊕ a, the minimal hyperrectangle containing both Rj and input pattern a. A similar geometrical interpretation applies to ARTb. 3 FAM Architectures used in Regression 3.1 The initial FAM for regression The FAM regression capability was first tested by Carpenter et al. for univariate real func- tions [8]. Input categories were considered to predict not real values, but real intervals. The experiments targeted the study of predicted output intervals’ geometry and the number of re- sulted categories for various values of ρb. For the test set, the authors counted the matchings between predicted output categories and actual output values. A matching between f(a) and the predicted output category (a rectangle) RbK was established if the size of R b K ⊕ f(a) did not exceed (1 − ρb). As expected, the number of matchings increased with ρb. Verzi et al. [15] proved that a slightly modified FAM version can be used to universally approximate any measurable function in Lp ([0, 1]). More specifically, given 1 ≤ p < ∞, for every f ∈ Lp ([0, 1]), f ≥ 0, a series of FAM computable functions sn with the following property were determined: functions sn approximate f in the limit and sn are dense in Lp ([0, 1]). One can extend this result to the initial FAM. 960 L.M. Sasu, R. Andonie 3.2 PROBART for function approximation PROBART is a modification of FAM motivated by empirical findings on the operational characteristics of FAM under certain conditions [10]. The authors replaced the Mapfield update rule FAM by wabJ = { yb + wabJ if the J-th F a 2 node is active and F b 2 is active wabJ if the J-th F a 2 node is active and F b 2 is inactive (2) Thus, wabjk indicates the number of associations between the j-th ARTa node and k-th ARTb node. Initially, wabjk = 0, i.e. no association has been made yet. There is no match tracking phase. The predicted value for an input pattern activating the Jth ARTa category is µJl = 1 |wabJ | Nb∑ k=1 ϵklw ab Jk, 1 ≤ l ≤ Mb (3) where µJl is the expected value of the l-th component of the predicted output pattern associated with the current input pattern, |wabJ | is the total number of associations of the J-th ARTa category and each category from ARTb, and ϵkl represents the kth ARTb category. Specifically, for PROBART the authors considered ϵkl as the lth component of the kth ARTb category exemplar. Only the first m components of each output category wbk are meaningful for computing the prediction corresponding to the current input pattern. Equation (3) can be written as µJl = ∑Nb k=1 ϵklpJk, where pJk is the empirically estimated association probability between the Jth ARTa category and the kth ARTb category: pJk = wabJk/|w ab J |. 3.3 The FAMR Model for Function Approximation The FAMR (Fuzzy ARTMAP with Relevance factor), a version of the FAM, has a novel learning mechanism. We will review here the FAMR basic notations (details in [11]) and discuss its function approximation capabilities. The main difference between the FAMR and the initial FAM is the update method of the wabjk weights. The FAMR uses the following updating formula [11]: w ab(new) jk =   w ab(old) jk if j ̸= J w ab(old) JK + qt Qnew J ( 1 − wab(old)JK ) w ab(old) Jk ( 1 − qt Qnew J ) if k ̸= K (4) where qt is the relevance assigned to the t-th input pattern (t = 1, 2, . . . ) and QnewJ = Q old J + qt. The relevance qt is a real positive finite number directly proportional to the importance of the experiment considered at step t. Initially, each Qj (1 ≤ j ≤ Na) has the same initial value q0. To maintain the stochastic nature of each wabj row in Mapfield, we modified the Mapfield dynamics: when a new input category is created, a new row filled with 1/Nb is added to wab; when a new ARTb category indexed by K is added, each existing input category is linked to it by wabjK = q0 NbQj , and the rest of elements wabjk are decreased by wab jK Nb−1 , for 1 ≤ j ≤ Na, 1 ≤ k ≤ Nb, k ̸= K. The update in eq. (4) preserves the stochastic property of each row. Finally, the vigilance test is changed to: Nb wabJK ≥ ρab. According to [11], this wabjk approximation is a correct biased estimator of posterior probability P(k|j), the probability of selecting the k-th ARTb category after having selected the j-th ARTa. Function Approximation with ARTMAP Architectures 961 To estimate the corresponding output value for a given input pattern, FAMR uses the same formula as in eq. (3), but in this case ϵk contains the coordinates of the kth ARTb category centroid. During the FAMR training process, the l-th component of the centroid can be updated by Kohonen’s learning rule: ϵb(new)kl = ϵ b(old) kl + (bl − ϵ b(old) kl )/size b J . This rule incorporates an idea from [20]. The value sizebJ is the number of output vectors of the k-th ARTb category and bl is the l-th component of b, the output vector of the current training pair (a, b). 3.4 The Bayesian ARTMAP Function Approximation Algorithm In BA, in contrast to FAM, waj is not a weight vector (a prototype), but simply a category label. Also, the ART categories are Gaussians, similar to the GAM. Each BA category j is characterized by the n-dimensional vector µ̂aj (mean), the n × n covariance matrix Σ̂ a j , and the count number of training patterns clustered to category j, naj . Analogous notations appear in ARTb, where one provides m-dimensional vectors. The associations between input and output categories are stored inside the Mapfield mod- ule, as PROBART does, and one can approximate the conditional probability P(wbk|w a j ) as P̂(wbk|w a j ) = w ab jk/ ∑Nb l=1 w ab jl . The following description uses ARTa notations; analogous notations are used for ARTb. All existent ARTa categories compete to represent the current input pattern. The posterior probability of category j given input a is estimated according to Bayes’ theorem: P̂(waj |a) = p̂(a|waj )P̂(w a j ) Na∑ i=1 p̂(a|wai )P̂(w a i ) (5) where P̂(waj ) is the estimated prior probability of the j-th ARTa category, P̂(w a j ) = n a j / ∑Na i=1 n a i . The conditional probability p(a|waj ) is estimated using all patterns already associated with Gaussian category waj : p̂(a|waj ) = 1 (2π)n/2 ∣∣∣Σ̂aj∣∣∣1/2 · exp { − 1 2 (a − µ̂aj ) t(Σ̂aj ) −1(a − µ̂aj ) } (6) During the category choice step in ARTa, the winning category J is the one maximizing the posterior probability P̂(waj |a). The following vigilance test is performed: SaJ ≤ S a MAX, where S a J = ∣∣∣Σ̂aJ∣∣∣ is the hyper-volume of the winning category, and SaMAX is an upper bound threshold. During processing a training pattern, SaMAX may decrease from its initial value S a MAX. In contrast, S b MAX remains unchanged. Every newly recruited category inside an ARTa (ARTb) module is centered in the current pattern and has the initial covariance matrix set to λ(SbMAX) 1/m ·Im (and λ(SbMAX) 1/n ·In, respectively), where λ is a small positive constant. This is done when none of the categories fulfills the vigilance test. Adding a new input (output) category triggers the addition of a new zero-filled line (column) to the association matrix wab. If the connection strength P̂(wbK|w a J) between winning categories w a J and w b K is below a fixed threshold Pmin, then SaMAX is slightly decreased under the current winner input category’s SJ , and the quest for another input category is continued. Otherwise, if the current winner input category was not newly added during processing the current pattern, ARTa learns the current pattern: 962 L.M. Sasu, R. Andonie µ̂aJ(new) = naJ naJ + 1 µ̂aJ(old) + 1 naJ + 1 a , (7) Σ̂aJ(new) = naJ naJ + 1 Σ̂aJ(old) + 1 naJ + 1 (a − µ̂aJ(new))(a − µ̂ a J(new)) t ∗ In (8) naJ = n a J + 1 (9) Unless wbK is a newly added category for the current training pattern, an analogous learning process in ARTb takes place. Finally, the Mapfield association counter wabJK is updated. After learning, the BA can be used for prediction. We estimate the probabilistic association of an output category wbk with input test pattern a: P̂(wbk|a) = Na∑ j=1 P̂(wbk|w a j )p̂(a|w a j )P̂(w a j ) Nb∑ l=1 Na∑ j=1 P̂(wbl |w a j )p̂(a|w a j )P̂(w a j ) (10) As in [21], we assume the conditional independence of activating categories wbk and w a j , given input pattern a. For function approximation the following average formula is used: f̂(a) = Nb∑ k=1 P̂(wbk|a) · µ̂ b k (11) Since, under certain mild conditions on the kernel function, RBF networks are universal approximators [1], [5], [6], [7], and the FAM also has universal approximation capability [15], it looks natural for the BA, which is essentially a FAM architecture with Gaussian categories, to be universal approximator. However, this statement can not be directly deducted from the RBF and FAM results. This is was a good reason for us to proof the following theoretical result [22]: Theorem 1. BA is a universal approximator on a compact set X ⊂ ℜn. 3.5 AppART: Hybrid Stable Learning for Universal Function Approximation AppART [14] is an ART-based neural network model that incrementally approximates continuous- valued multidimensional functions through a higher-order Nadaraya–Watson regression. An input pattern x is feedforwarded from input layer F1 to the F2 layer. The F2 layer consists of N categories, modeling a local density of the input space using Gaussian receptive fields with mean µj and standard deviation σj. A match criterion is used to detect whether the current leaning pattern activates an existing F2 category or a new one should be added. The match function is: Gj = exp ( − 1 2 n∑ i=1 ( xi − µji σji )2) , 1 ≤ j ≤ N (12) If all Gj values are below threshold ρF2, a new node is recruited to represent the current input pattern. Otherwise, the input strength of each F2 node is computed as gj = I(Gj > ρF2) · (ηjGj/ ∏n i=1 σji), where ηj is a measure of the prior activation probability of the jth category, and I is the binary indicator function: I(P) = 1 iff P is true. The activation values vj of the F2 nodes are obtained by normalizing gj. One can use vj as an approximation of the posterior probability P(j|x) of category j given input pattern x. Function Approximation with ARTMAP Architectures 963 The P and O layers together compute the prediction of the network. In the P layer, there are m + 1 nodes whose corresponding values are computed as: ak = ∑N j=1 αkjvj (1 ≤ k ≤ m), b =∑N j=1 βjvj where αkj and βj are weights connecting each F2 category to the each node in the P layer. Each αkj is the sum of values of output feature k, learned when the jth F2 node was active. βj counts how many patterns the jth F2 category has learned. Output layer O has m output nodes, whose predictions are ok = I(b > 0) · ak/b. Incorrect predictions are detected by comparing a threshold ρO with the degree of closeness between the prediction of the network and the desired output. If an incorrect prediction is produced, a match tracking mechanism (similar to the one in FAM) is triggered. This might produce a new F2 node or find a more suitable node for the current input pattern. The learning process takes place for µj, σj, ηj, αj and βj: ηj(t + 1) = ηj(t) + vj, µji(t + 1) = (1 − η−1j vj)µji(t) + η −1 j vjxi λji(t + 1) = (1 − η−1j vj)λji(t) + η −1 j vjx 2 i , σji(t + 1) = √ λji(t + 1) − µji(t + 1)2 αkj(t + 1) = αkj(t) + ϵ −1vjyk, βj(t + 1) = βj(t) + ϵ −1vj A common value γi = γcommon may be used for the standard deviation in case of all input features. An important theoretical result of AppART is [14]: Theorem 2. AppART with ρF2 = 0, ρO = 0 and γi = γcommon, 1 ≤ i ≤ n behaves as GRNN. Since the GRNN can be viewed as a normalized RBF expansion, one can transitively apply to AppART two important properties of RBF networks: the universal approximation and the best approximation properties [1]. 4 Experimental Results For the first test, we consider function [10] f : [0, 1] → [0, 1] defined by f(x) = (10 +∑7 t=1 sin(10tx))/20. We use independent, randomly generated datasets for training, validation and testing, con- sisting of 800, 200, and 1000 patterns, respectively. Each training pattern is a (x, f(x)) input- output pair. The testing set was not used in the training phase, but only to assess generalization performance. The BA parameters SaMAX, S b MAX, and Pmin are optimized on the validation set by trial and error, for SaMAX, S b MAX ∈ {10 −3, 5·10−4, 10−4, 5·10−5, 10−5, 5·10−6} and Pmin ∈ {0, 0.1, . . . , 0.9}. The BA with optimized parameters (i.e., generating the lowest RMSE on the validation set) was trained on the training+validation dataset. The generalization performance of the trained BA was assessed on the testing set in two ways (see Table 4): 1. The “BA(1)”, corresponds to a BA network with unbounded number of categories. 2. For “BA(2)”, we considered only BA models with similar number of input categories as for PROBART. 964 L.M. Sasu, R. Andonie ARTa categories no. ARTb categories no. RMSE FAM 312 53 0.0074 PROBART 110 53 0.0169 BA(1) 185.6 57.8 0.0076 BA(2) 111.0 35.8 0.0106 Table 1: FAM, PROBART, and BA performance for regression on data generated by function f. The RMSE for BA(1) and BA(2) were each averaged for five different runs, using each time randomly generated training, validation, and test sets. The results for PROBART and FAM are from [10]. FAM in our experiments is Carpenter’s initial FAM version. The BA(1) results are very similar to the FAM results, but for a considerably smaller number of input categories. On average, BA(2) produced one more input category than PROBART, while improving the RMSE by 40.23%. It is quite difficult to directly compare the resulted BA(2) and FAM, since BA(2) has 64.42% less input categories than the FAM. Considering both the RMSE score and the number of input categories, we may conclude that, for this experiment, the BA performs better than the FAM and PROBART. In the second test, we use the fifth-order chirp function [14]: g(x) = 0.5+0.5 sin(40πx5). Marti et al. have experimentally compared the function approximation performance of the following neural models [14]: AppART, Multi Layer Perceptron (MLP), RBF, General Regression Neural Network (GRNN), FAM, GAM, PROBART, and FasBack [23]. The reported score was the mean squared error (MSE). The authors run the training algorithms for several epochs. The data set consisted of 10000 points x ∈ [0, 1], of which 70% were used for training and the rest for testing. The cited paper does not fully describe the parameter values used for each of the networks. In our experiment, we partition a dataset of 10000 patterns into a 4000 patterns training set, a 3000 validation set, and a 3000 patterns testing set. We perform a trial and error search for SaMAX, S b MAX ∈ {10 −4, 10−3, 10−2, 10−1}, Pmin ∈ {0, 0.1, . . . , 0.9}. The values producing the best MSE on the validation set are used to train the BA on the train+validation dataset, and the testing set MSE was reported. The above procedure are repeated five times, for randomly generated datasets. We only use single epoch training. Table 2 contains the results for MLP, RBF, GRNN, FAM, GAM, PROBART, FasBack, AppART and BA. For the first eight neural networks the results are from [14]. BA produces a very good MSE score for this regression task, most likely due to the optimized parameter values obtained by trial and error. Comparing the MSE BA score, obtained by single epoch training, and those reported in [14], where multi-epoch training was used, we can state that the BA clearly performs better. 5 Conclusions Theoretical universal approximation results were obtained for several FAM architectures: • Explicit results were obtained for a slight variation of the initial FAM and for the BA. • Implicit results, derived by association with other networks: FAMR, PROBART, and AppART. Function Approximation with ARTMAP Architectures 965 Model MSE Training epochs MLP 0.4362 30000+ RBF 0.2701 10000 GRNN 0.1540 150 FAM 0.1802 140 GAM 0.1521 45 PROBART 0.1435 50 FasBack 0.0915 10000 AppART 0.0803 30 BA 0.0086 1 Table 2: BA vs. other neural networks generalization performance for data generated by function g. The result showing FAM networks to be universal approximators is an important fact in establishing the utility of FAM architectures. A learning algorithm which is known to be a universal approximator can he applied to a large class of interesting problems with the confidence that a solution is at least theoretically available. Experimentally, FAM architectures performed well compared to other neural function approximators. The FAM model, as well as other universal approximators, suffer from the curse of dimen- sionality, as defined by Bellman [24]: an exponentially large number of ART categories may be required to reach a final solution. Therefore, the universal approximation capability of a network is an generally an existential result, not a constructive procedure to obtain a guaranteed compact network approximation of an arbitrary function. An important problem we have not addressed here is that of determining the network parameters so that a prescribed degree of approximation is achieved (see [25]). The FAM and its offsprings are incremental learning models. Therefore, they may be used for fast approximation of massive streaming input data. This may be a serious plus when compared to other neural predictors. How could a neural posterior probability estimator, like the BA, be used in risk assessment and decision theory? One possibility would be to combine the inferred posterior probabilities with a loss function, as suggested for a more general framework in [26]. This way, we could obtain an incremental learning risk assessment tool capable of processing fast large amounts of data. Bibliography [1] Girosi, F.; Poggio, T. (1989); Networks and the Best Approximation Property, Biological Cybernetics, 63: 169-176. [2] Hecht-Nielsen, R. (1987); Kolmogorov’s mapping neural network existence theorem, Proceed- ings of IEEE First Annual International Conference on Neural Networks, 3: III-11–III-14. [3] Cybenko, G. (1992); Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems, 5(4): 455-455. 966 L.M. Sasu, R. Andonie [4] Chen, T.; Chen, H. (1995); Universal approximation to nonlinear operators by neural net- works with arbitrary activation functions and its application to dynamical systems, IEEE Transactions on Neural Networks, 6(4): 911-917. [5] Hartman, E; Keeler, J.D.; Kowalski, J.M. (1990); Layered neural networks with Gaussian hidden units as universal approximations, Neural Computations, 2(2): 210-215. [6] Park, J.; Sandberg, I.W. (1991); Neural Computations, 3(2): 246-257. [7] Park, J.; Sandberg, I.W. (1993); Neural Computations, Approximation and radial-basis- function networks, 5(2): 305-316. [8] Carpenter, G.A.; Grossberg, S.; Markuzon, N.; Reynolds, J.H.; Rosen, D.B. (1992); IEEE Transactions on Neural Networks, Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps, 3(5): 698-713. [9] Williamson, J. (1996); Neural Networks, Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps, 9:881–897. [10] Marriott, S.; Harrison, R.F. (1995); Neural Networks, A modified fuzzy ARTMAP architec- ture for the approximation of noisy mappings, 8(4): 619-641. [11] Andonie, R.; Sasu, L. (2006); IEEE Transactions on Neural Networks, Fuzzy ARTMAP with Input Relevances, 17: 929-941. [12] Yap, K.S.; Lim, C.P. Abidi, I.Z. (2008); IEEE Transactions on Neural Networks, A Hybrid ART-GRNN Online Learning Neural Network With a ε-Insensitive Loss Function, 19: 1641– 1646. [13] Yap, K.S.; Lim, C.P. Junita, M.S. (2010); Journal of Intelligen & Fuzzy Systems, An en- hanced generalized adaptive resonance theory neural network and its application to medical pattern classification, 21: 65-78. [14] Marti, L.; Policriti, A.; Garcia, L. (2002); Hybrid Information Systems, First International Workshop on Hybrid Intelligent Systems, Adelaide, Australia, December 11-12, 2001, Pro- ceedings, AppART: An ART Hybrid Stable Learning Neural Network for Universal Function Approximation, 93-119. [15] Verzi, S.J.; Heileman, G.L.; Georgiopoulos, M.; Anagnostopoulos, G.C. (2003); Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2003), Universal Approximation with Fuzzy ART and Fuzzy ARTMAP, (3): 1987-1992. [16] MacKay, D.J.C. (1996); Computation in Neural Systems, Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks, 6: 469 - 505. [17] Vigdor, B.; Lerner, B. (2007); IEEE Transactions on Neural Networks, The Bayesian ARTMAP, 18: 1628-1644. [18] Moore, B. (1988); Proceedings of the 1988 Connectionist Model Summer School, ART1 and Pattern Clustering, 174-185. [19] Carpenter, G.A.; Grossberg, S.; Reynolds, J.H. (1991); Neural Networks, Fuzzy ART: fast stable learning and categorization of analog patterns by an adaptive resonance system, (4): 759-771. Function Approximation with ARTMAP Architectures 967 [20] Lim, C.P.; Harrison, R.F. (1997); Neural Networks, 10(5), An Incremental Adaptive Network for On-line Supervised Learning and Probability Estimation, 925-939. [21] Lerner, B.; Guterman, H. (2008); Computational Intelligence Paradigms - Studies in Com- putational Intelligence, Springer, Advanced Developments and Applications of the Fuzzy ARTMAP Neural Network in Pattern Classification, 137: 77-107. [22] Sasu, L; Andonie, R. (2012); The Bayesian ARTMAP for Regression, under review. [23] Izquierdo, J.M.C.; Dimitriadis, Y.A.; Coronado, J.L. (1997); Proceedings of the Sixth IEEE International Conference on Fuzzy Systems, FasBack: matching-error based learning for automatic generation of fuzzy logic systems, 3: 1561 -1566. [24] Bellman, R.E. (1961), Rand Corporation Research studies, Adaptive control processes: a guided tour. [25] Andonie, R. (1997); Dealing with Complexity: A Neural Network Approach, The Psycholog- ical Limits of Neural Computation, 252-263. [26] Duda, R.O.; Hart, P.E.; David G.S (2000); Pattern Classification, 2nd edition.