CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright © 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The Application of high-dimensional Data Classification by Random Forest based on Hadoop Cloud Computing Platform Chong Li Information Engineering School, Chongqing Vocational Institute of Engineering, Jiangjin, Chongqing, China, 402260 cqlichong@163.com The high-dimensional data has a number of uncertain factors, such as sparse features, repeated features and computational complexity. The random forest algorithm is a ensemble classifier method, and composed of numerous weak classifiers. It can overcome a number of practical problems, such as the small sample size, over-learning, nonlinearity, the curse of dimensionality and local minima, and it has a good application prospect in the field of high-dimensional data classification. In order to improve the classification accuracy and computational efficiency, a neval classification method based on the Hadoop cloud computing platform is proposed. Firstly, the processing of Bagging algorithm is done with the data sets to get the different data subsets. Secondly, the Random Forest is completed by training of the decision tree under the MapReuce architecture. Finally, the processing of data sets classification is done by the Random Forest. In our experiment, the three high-dimensional data sets are used as the subjects. The experimental results show that the classification accuracy of proposed method is higher than that of stand-alone Random Forest, and the computational efficiency is improved significantly. 1. Introduction The ensemble classifier methods have been developing rapidly in decades, since the performance is always superior to that of other single classifier methods(zhang, 2010; Sortiris, 2011). And it has become one of the main domains in machine learning and pattern recognition. The main process of constructing an ensemble classifiers system is to exploit base classifiers with different subsets of the same training data set, and then combines the outpus of these variant base classifiers (Valiant et al., 1984). In the research of ensemble classifiers method, the accuracy of base classifier and the diversity between different base classifiers are two main factors. It leads to a better classification performance with higher individual accuracy and higher diversity in ensemble classifiers system. Among the ensemble learning methods, the most popular ensemble algorithms are Bagging (Breiman, 1996), AdaBoost(Freund et al., 1995), Random Subspace(Ho et al., 1998)), Random Forest (Breiman, 2001) , and Rotation Forest(Juan et al., 2006)). In research of ensemble classifiers method for high-dimensional data classification task, there are some more problems to deal with. The common factors of ensemble classifiers method are as follows, the accuracy of base classifier, the diversity between each two base classifiers, the number of base classifiers to be combined, the strategy of combing the base classifiers and so on. In addition to these, there are some other problems for high-dimensional data classification. 1) The high dimension of data. In our research, the dimensions of data sets in our experiments are 250 to 10000. The result shows that the accuracy of base classifers and the diversity between them can decrease with the increase of dimension. 2) Computational complexity. As we know that with the increase of data’s dimension, the computational complexity increases. So, the solutions to decline the computational complexity are feature extraction and less base classifiers to be combined. Random Forest is a very popular ensemble classifier method, the main idea is to construct a strong classifier by combining multiple weak classifiers. However, when dealing with the high-dimensional data classification, the performance degrades rapidly. According to the Hadoop cloud computing platform, the Random Forest algorithm is deployed on the Hadoop platform to realize the parallel computing, which can increase the computational speed in theory. Random Forest on Hadoop platform improves both the computational DOI: 10.3303/CET1651065 Please cite this article as: Li C., 2016, The application of high-dimensional data classification by random forest based on hadoop cloud computing platform, Chemical Engineering Transactions, 51, 385-390 DOI:10.3303/CET1651065 385 efficiency and accuracy of the classification of high dimensional data. MapReduce architecture is one of the most classical Hadoop platform, and is an embodiment of "divide and rule". In the Map stage, it can decompose a complex problem into a number of sub problems that are easy to solve, and on the basis of solving the sub problems, the original complex problems is solved esasily. In the Reduce stage, the Random Forest is got to do the data classification by combining the decision trees. Finally, the experiment is done with 3 data sets from UCI, and the results show that the Random Forest on Hadoop platform gets higher classification accuracy than stand-alone Random Forest. Furthermore, it indicates that the computational speed almost increases by 40% than stand-alone Random Forest. 2. Hadoop Hadoop is a distributed system infrastructure, which is developed by the Apache Foundation (Grangier D.(2008)). The user can build a distributed computing platform without knowing the details of the distributed bottom layer. The difference between Hadoop and other parallel programming methods is that Hadoop has a distributed file system HDFS to support its data processing. In the calculation, each task processing can be stored in the local data file to avoid the transmission of data between nodes. The basic architecture model of Hadoop is composed of the distributed file system (HDFS) and MapReduce computing framework. 2.1 HDFS HDFS is a distributed storage system of Hadoop platform with high fault tolerance, and it can be deployed on low-cost hardware devices (YANG et al., 2012; ALHAM et al., 2011)). HDFS is suitable for applications of big data sets, and provides a high throughput of data. HDFS is a slave / Master structure, which consists of a name node and a number of data nodes. And name node is used as a server to save the metadata in the file system, so as to manage the namespace of file system and the access to files by clients. The name node does not store the actual data file, because the data files are divided into a number of file blocks according to the rules to be stored on different data nodes. 2.2 MapReduce MapReduce is a parallel computing framework of Hadoop platform, and Its inspiration comes from the functional programming language Map and Reduce (Dean et al., 2008; Rranger et al., 2007). In dealing with some chaotic data, the feature of each data is extracted by the analysis of Map function, and the is got as the input. In Shuttle stage, all values that belong to the same key are put into the same Reduce function for processing. In general, the MapReduce framework is divided into Map tasks and Reduce tasks. Multiple data segments are processed by using Map function on multiple machines, and the local results obtained are combined with the combiner function. And then the data is distributed to different Reduce functions to be processed with . When all the Map and Reduce jobs are completed, a number of output files of Reduce function are put into another MapReduce program to do the combination processing. Then the result is wrote on the HDFS. MapReduce programming model provides a highly abstract programming mechanism, the programmers only need to know what to do and how to program. Because data synchronization, data distribution, data communications and other details are given to the underlying framework of the system to deal with, so as to facilitate the programmer to consider how to design algorithms from the nature of the problem. In the MapReduce platform, the data set of a job is usually divided into several independent databases. And a pair is mapped into a new pair by Map parallel processing, which is as the input of Reduce. The calculation process of MapReduce is shown in Figure 1, which includes input stage, Map stage, Reduce stage and output stage. Figure 1: The flow chart of MapReduce 3. Random Forest Random forest is a new ensemble classifier proposed by Breiman in 2001. It adopts the classification and regression tree (CART) as a weak classifier, and the Bagging method is used to generate different training 386 sets (Breiman, 1984). In the construction of a single decision tree, random selection of the features of the internal nodes is used to split the properties of the data set. The combination of the Bagging method, the CART algorithm, and the random selection of features makes Random Forest can tolerate noise better and get better classification performance. 3.1 Decision Tree The base classifier of Random forest is decision tree, the data structure of decision tree is used to deal with business logic. The leaf node in the decision tree is the result of classification, and non-leaf node is the decision node (Quinlan, 1986). The classification of decision tree is divided into three parts. Firstly, a decision tree is constructed by determining the structure of the training data set on each non-leaf node. Then, the path of the root node to the leaf node is analysed to find the classification rules. Finally, the test data set is classified by these rules. Decision tree only outputs a classification result for a data, because there is only one path from the root node to any leaf node. In general, the decision tree is regarded as a single classifier. 3.2 Bagging Bagging algorithm is proposed by Breiman in 1996, it generates different component classifiers by different training samples (Breiman, 1996). Bagging algorithm is based on bootstrap sampling method, which means that the training sample set Tset is randomly selected from the original sample set S, and the number of sample set Tset is the same as that of the original sample set S. Assuming that N is the number of samples in the original sample set S , when the Bootstrap method is used to sample from the set S, the probability of each sample in S that is not extracted is (1-1/N)N. If N is very large, (1-1/N)N0.368, it means that nearly 37% samples of the original set S does not appear in the Tset. By constructing different training sets, the difference between the classifiers is increased, that is the reason why Bagging method can improve the generalization ability of the combined classifiers. Through the T round of training, a classifiers sequence  1 2, ,..., Tc c c is got. The final classification results are obtained by the simple majority voting method or simple averaging method with these classifiers sequence. 3.2 The implementation of Random Forest Random forest is based on K decision tree classification model {h(x, k),k=1,2,…,T}, where, x is the inputs, k is independent and identically distributed random vectors. For each input x, each decision tree has one vote. The classification idea of random forest classification is as follows. Assuming that the training sample set D including ND samples, and a new ND samples set are selected by Boostrap method from set D. Then, F features are selected from features of Tset to construct the decision trees, so T decision tree classifiers are generated. Finally, the decision tree is combined to form a random forest. The key steps of the random forest algorithm are as follows 1) Get data training sets. ND samples are selected by Boostrap method from D, and the iteration of the above repeats T times, this method is called Bagging. Then data training set Di(i=1,2,…,T) is got. 2) construct decision tree. k features are selected from K features of Tset to construct the decision trees, where k<