Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. V (2010), No. 3, pp. 280-291 Extreme Data Mining: Inference from Small Datasets R. Andonie Răzvan Andonie Computer Science Department Central Washington University, Ellensburg, USA and Department of Electronics and Computers Transylvania University of Braşov, Romania E-mail: andonie@cwu.edu Abstract: Neural networks have been applied successfully in many fields. However, satisfactory results can only be found under large sample conditions. When it comes to small training sets, the performance may not be so good, or the learning task can even not be accomplished. This deficiency limits the applications of neural network severely. The main reason why small datasets cannot provide enough information is that there exist gaps between samples, even the domain of samples cannot be ensured. Several computational intelligence techniques have been proposed to overcome the limits of learning from small datasets. We have the following goals: i. To discuss the meaning of "small" in the context of inferring from small datasets. ii. To overview computational intelligence solutions for this problem. iii. To illustrate the introduced concepts with a real-life application. 1 Introduction Small dataset conditions exist in many applications, such as disease diagnosis, fault diagnosis or deficiency detection in biology and biotechnology, mechanics, flexible manufacturing system scheduling, drug design, and short-term load forecasting (an activity conducted on a daily basis by electrical utilities). In this section, we describe a computational chemistry problem, review a class of neural networks to be used, and summarize our previous work in this area. 1.1 A Real-World Problem: Assist Drug Discovery Current treatments for HIV/AIDS consist of co-administering a protease inhibitor and two reverse transcriptase inhibitors (usually referred to as combination therapy). This therapy is effective in reducing viremia to very low levels; however, in 30-50% of patients it is ineffective due to resistance development often caused by viral mutations. Due to resistance and poor bioavailability 1 profiles, as well as toxicity associated with these therapies, there is an urgent need for more efficient design of drugs. We focus on inhibitors to the HIV-1 protease enzyme, using the IC as the target value. A detailed description of the problem, from a computational chemistry point of view, can be found in our papers [1–3]. The IC value represents the concentration of a compound that is required to reduce enzyme activity by 50%. A low IC value indicates good inhibitory activity. The available dataset consists of 196 compounds with experimentally determined IC values. Twenty of these molecules are used as an external test set after the training is completed. The remaining 176 molecules are used for training and cross-validation. Our practical goal is to predict the (unknown) IC values for 26 novel compounds which are candidates for HIV-1 protease inhibitors. We use two IC prediction accuracy measures: the RMSE (Root Mean Squared Error) and the Symmetric Mean Absolute Percentage Error (sMAPE). 1Bioavailability is the rate at which the drug reaches the systemic circulation. Copyright c© 2006-2010 by CCC Publications Extreme Data Mining: Inference from Small Datasets 281 The easiest way to represent a molecule is by a vector of features (molecular descriptors) which may be both topological indices and physico-chemical properties. The resulting features may be numerous and inter-correlated. Using the complete set of descriptors may lead to overfitting, if it is too large compared to the size of the training set. We select 35 molecular descriptors based on their contribution to molecular entity. Although biological activity data has been obtained for many more chemical structures at various pharmaceutical companies and academic laboratories, they are not available in the public domain. Actu- ally, most classical studies for a specific enzyme system have been performed on small datasets, due to limited experimentally determined biological activity values in the public domain. The dimensionality (the number of physico-chemical features) characterizing these molecules is relatively high. Our dataset shares these undesired characteristics: it is small, with relatively many features, and highly overlapping. 1.2 Prerequisites: FAMR for IC prediction The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification, proba- bility estimation, and function approximation. We review the basic FAMR notation. Details can be found in [4]. A FAM consists of a pair of fuzzy ART modules, ARTa and ARTb, connected by an inter-ART module called Mapfield. The fuzzy ARTa module contains the input layer, F a , and the competitive layer, F a  [5]. A preprocessing layer, F a , is also added before F a  . The ART modules create stable recognition categories in response to arbitrary sequences of input patterns. The ARTa and ARTb vigilance parameters, ρa and ρb, control the matching mechanism inside the modules. During learning, the Mapfield weights are updated: the strength of the weight projecting from the selected ARTa category to the correct ARTb category is increased, while the strengths of the weights to other ARTb categories are decreased. A Mapfield vigilance parameter ρab calibrates the degree of predic- tive mismatch necessary to trigger the search for a different ARTa category. If the weight projecting from the active ARTa category through the Mapfield to the active ARTb category is smaller than ρab (vigilance test), then the system responds to the unexpected outcome through the so-called match tracking. This triggers an ARTa search for a new input category. After choosing an ARTa category whose prediction of the correct ARTb category is strong enough, match tracking is disengaged, and the network is said to be in a resonance state. In this case, Mapfield learns by updating the weights wabjk of associations between each j-th ARTa category and each k-th ARTb category. The FAMR uses the following iterative updating scheme: wab(new)jk =    wab(old)jk if j 6= J wab(old)JK + qt QnewJ (  − wab(old)JK ) wab(old)Jk (  − qt QnewJ ) if k 6= K (1) where qt is the relevance assigned to the tth input pattern (t = , , . . . ) and QnewJ = Q old J + qt . The relevance qt is a real positive finite number directly proportional to the importance of the experiment considered at step t. This wabjk approximation is a correct biased estimator of the posterior probability P(k| j), the probability of selecting the k-th ARTb category after having selected the j-th ARTa. FAM (and FAMR) networks map subsets of Rn to Rm and can be used for function approximation. The FAM has been proven to be a universal function approximator [6]. We use the FAMR to predict functions that are known only at a certain number of points. More specifically, we predict IC values. 1.3 Our previous work The present paper is based on a sequence of results, each describing new computational intelligence tools for biological activity (IC) prediction. In [7], we investigated the use of a fuzzy neural network 282 R. Andonie (FNN) for (IC) prediction. In [1] and [2], we improved this model by adding a two-stage Genetic Algorithm (GA) optimizer: the first for selecting the best subset of features and the the second for optimizing the FNN parameters. We will refer to this GA-optimized FNN as FS-GA-FNN. In [8] we also focused on the IC prediction task, using the FAMR model. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The prediction method consists of two stages. First, GA-optimization incorporating cross-validation is used to modify the training dataset. This modification consists of finding the best relevances for the data, according to some fitness criterion. The fitness criterion measures the FAMR IC prediction accuracy for a given training/validation dataset with given relevances. In stage two, the final FAMR is obtained by training it using the dataset with optimized relevances. In other words, stage one improves the generalization capability of the FAMR which will be obtained in stage two. We will refer to this model with GA- optimized relevances as GA-FAMR. We compared the GA-FAMR and the Ordered FAMR (a FAMR algorithm which optimizes the order of training data presentation) in [9]. Both methods compensate for insufficient training data by additional optimizations. A trade-off between computational overhead and generalization capability is obtained. Recently, we performed rule extraction from the trained FAMR model [10]. We post-processed the set of generated rules in order to improve generalization. We eliminated overfitting by heuristic generalization of rules and by adding new rules. This method proved to be efficient for small training sets. The present paper results from several invited talks [9, 11, 12]. In Section 2, we discuss the capability of neural network to infer from rare samples. Section 3 describes two methods for neural training on small datasets. After presenting and discussing experimental results in Section 4, we conclude with our final remarks (Section 5). 2 Neural Networks Trained on Small Datasets We aim to discuss the difficulties of inferring a Neural Network (NN) from small, or non-representative, training sets. We will look closer at the overfitting and generalization aspects of the network. But first, we need to define formally what we understand by "small training set". 2.1 What is "small"? In many multivariable classification or regression (e.g., estimation or forecasting) problems we have a training set Tp = (xi,ti) of p pairs of input/output vector x ∈