ijcccv11n4.dvi INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 11(4):567-579, August 2016. An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis H.P. Yao, Y.Q. Liu, C. Fang Haipeng Yao* 1. State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications No 10, Xitucheng Road, Haidian District, Beijing, PRC 2. Beijing Advanced Innovation Center for Future Internet Technology Beijing University of Technology 100 Ping Le Yuan, Chaoyang District, Beijing, PRC *Corresponding author: yaohaipeng@bupt.edu.cn Yiqing Liu State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications No 10, Xitucheng Road, Haidian District, Beijing, PRC colin617@qq.com Chao Fang 1. Beijing Advanced Innovation Center for Future Internet Technology Beijing University of Technology 100 Ping Le Yuan, Chaoyang District, Beijing, PRC fangchao.bupt@gmail.com 2. College of Electronic Information and Control Engineering Beijing University of Technology 100 Ping Le Yuan, Chaoyang District, Beijing, PRC fangchao.bupt@gmail.com Abstract: Anomaly network detection is a very important way to analyze and detect malicious behavior in network. How to effectively detect anomaly network flow under the pressure of big data is a very important area, which has attracted more and more researchers’ attention. In this paper, we propose a new model based on big data analysis, which can avoid the influence brought by adjustment of network traffic distribution, increase detection accuracy and reduce the false negative rate. Simulation results reveal that, compared with k-means, decision tree and random forest algorithms, the proposed model has a much better performance, which can achieve a detection rate of 95.4% on normal data, 98.6% on DoS attack, 93.9% on Probe attack, 56.1% on U2R attack, and 77.2% on R2L attack. Keywords: Anomaly Traffic Detection, Big Data, K-means, Decision Tree, Random Forest. 1 Introduction With the improvement of network, storage, calculation and transmission,the Internet is in- teracting more closely with people than ever before. While the Internet is making our life more convenient, it brings about some potential risks. For example, malicious attacks involving user privacy and security become more and more frequent. The changes of how people use the Internet is a new challenge for traditional abnormal network event detection techniques. It is more hard for researchers to get aware of some new kinds of attacks. To resolve these problems, some abnormal network traffic detection methods Copyright © 2006-2016 by CCC Publications 568 H.P. Yao, Y.Q. Liu, C. Fang have been proposed. Traditional abnormal traffic detection method can be classified into two categories [1–3]. One is misuse detection, and the other is abnormal detection. The two methods have their own pros and cons. Misuse detection has a high accuracy but needs support from known knowledge. Abnormal detection do not need known knowledge, but cannot categorize the type of attacks, the accuracy is also lower. For example, Hari Om [4] designes a hybrid detection system, which is a hybrid anomaly detection system considering k-means, k-nearest neighbor and Naïve Bayes methods. However, the explosive increase of network traffic has directly or indirectly pushed the Internet into the big data era, which makes anomaly traffic detection more difficult to deal with because of high calculation volume and constant changes of network data distribution caused by big data [5–8]. Because the speed of network data generation is fast, it makes the volume of normal traffic and abnormal traffic differ a lot, and the distribution of the data change. Besides, with big data, the difference between normal traffic and abnormal traffic is increasing. It makes the traditional methods unable to effectively detect abnormal traffic. Therefore, to increase the accuracy of abnormal traffic and avoid the loose caused by false negative detection, we propose a novel model based on big data analytics, which can avoid the influence brought by adjustment of network traffic distribution, increase dectection accuracy and reduce the false negative rate. The core of the proposed model is not simply combination of traditional detection methods, but a novel detection model based on big data. In the simulation, we use k-means, decision tree and random forest algorithms as comparative objects to vertify the effectiveness of our model. Simulation results reveal that the proposed model has a much better performance, which can achieve a detection rate of 95.4% on normal data, 98.6% on DoS attack, 93.9% on Probe attack, 56.1% on U2R attack, and 77.2% on R2L attack. The rest of this paper is organized as follows. In Section 2, related work of this paper is presented. The system model is given in Section 3. Simulation results are presented and discussed in Section 4. Finally, we conclude this study in Section 4.3. 2 Related work 2.1 k-means k-means is a classic clustering algorithm [9,10], which uses simple iteration algorithm to clus- ter the data set into certain amount of categories.Commonly, the number of clusters is annotated to be K. The four steps of k-means are: 1. Initialization: Randomly select K data points from the data set as the centers of the K- clusters; 2. Distribution: Assign each point in the data set to the nearest center; 3. Update: calculate new centers according to the cluster assignment, the new center is the average point of all the points in a cluster; 4. Repeat: Repeat these steps until no center is updated in this round, and the clustering is converged. k-means needs the number of classification K to be specified. If K is not chosen properly, it will lead to an improper result of classification. So choose a proper cluster number is crucial to the result of k-means. Another disadvantage of k-means is that, k-means can only use Euclidean distance. Even though Euclidean distance is convenient to calculate, but it cannot take the difference between two features into consideration, it means it treats all features as same. In the reality, it will sometimes lead to poor performance. Anyway, k-means has its own advantages when dealing with big data. An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis 569 1. k-means is simple. The time complexity is n ( nd∗k+1 log n ) , it can be fast when the number of clusters and the number of features are small; 2. k-means can be well adjusted to big data set and has high performance. 2.2 Decision tree Decision Tree [9] is a common algorithm used in machine learning. A complete decision tree is composed by three kind of elements: 1. Decision Node, indicating which feature is used in split; 2. Chance nodes, indicating possible values of each features; 3. Leaf node, indicating which category is the record in. There are two steps needed to use a decision tree: 1. Tree generation: Generate a tree according to training set. Need to determine which feature need to use in the split, and determine which category the result is in. 2. Classification: Classify new records from the root of the decision tree, and compare the record with each of the decision node, move to corresponding branch with the result. Repeat this process, and after a data reaches the leaf node, the category of leaf node is the new category of the node. Quinlan proposed C4.5 algorithm in [11], which is a well known decision tree algorithm.The main method is to generate the decision tree from root to leaf, in order to reduce the level of uncertainty. Therefore, this algorithm can be described as follows. Gain ratio is the index C4.5 used to select feature. Define a feature in the feature set to be Ak, the training set to be T��definition of information gain is defined like this: Gain(T,Ak) = Info(T)− InfoAk(T) (1) where Info(T) = − n ∑ i=1 freq(ci,T) |T | log2 freq(ci,T) |T | (2) InfoAk(T) = − ∑ ak∈D(Ak) |TAkak | |T | Info(T Ak ak ) (3) freq(ci,T) means the number of records belongs to ci in T . TAkak express that subset which Ak is ak, and domain of Ak is D(Ak). SplitInfo(Ak) is defined to be: SplitInfo(T,Ak) = − ∑ ak∈D(Ak) |TAkak | |T | log2 |TAkak | |T | (4) Gain ratio is Gainratio(T,Ak) = Gain(T,Ak) SplitInfo(Ak) (5) The advantages of decision tree are: 1. The tree generated is easy to generate and easy to explain; 2. Performs well when dealing with large data set. 570 H.P. Yao, Y.Q. Liu, C. Fang 2.3 Random forest Random Forest algorithm [9, 12] is a classification algorithm and contains multiple decision trees, where each tree has a vote, and result is the one with highest vote. When generating decision tree, feature selection and pruning can be used to avoid over fitting. But when the number of features is large, the problems can hardly be avoided. Random forest consists of multiple decision trees, which can effectively avoid those problems. Random forest has following advantages: 1. It can be used in various situation with a pretty high accuracy on classification; 2. It can effectively support multi-feature situation without feature selection; 3. It can report the importance distribution of features. 3 System model Influenced by big data, network data distribution is gradually changing. This paper try to solve the problem that caused by the increasing difference between normal traffic and abnormal traffic. Therefore, we proposed a new abnormal traffic detection model based on big data analysis, and this model includes three sub-models. 3.1 Normal traffic selection model Normal traffic selection model uses classification and clustering algorithm to distinguish nor- mal and anomaly behaviors, rather than involved specific anomaly behaviors. This model in- cludes two stages: 1. Training stage: training model uses data that labeled normal or abnormal, and the model applies in test stage. 2. Test stage: test stage is similar to detection in practice. Using unlabeled date, the model classifies traffic data into normal or abnormal, and labels them. Normal traffic selection model uses k-means clustering algorithm, KNN, decision tree and random forest classification algorithms. Traditionally, before using k-means algorithms, it is very important to set the number of categories, because we don’t know how many categories. But in order to distinguish normal and abnormal behavior, the normal traffic selection model uses k-means as following way. In training stage, using labeled information classify data into normal and abnormal. These two categories use k-means separately instead of clustering all data at once, getting the center of the data set respectively. Then using the center of the data set, KNN clustering algorithm classifies test data. Decision tree and random forest classification algorithms train with labeled normal and abnormal data. 3.2 Abnormal traffic selection model The purpose of abnormal traffic selection model is avoid influence caused by too many normal traffic than abnormal traffic. This model classifies anomaly traffic into specific categories, and includes two stage as well: 1. Training stage: this stage only use abnormal data to train classification model, and every data label specific attack group. Using classification algorithms learns classified rules. 2. Test stage: test stage is similar to detection in practice, using unlabeled data (including normal behavior data). The classification model classifies anomaly traffic into specific categories according to the classified rules, and gives specific label to every data. An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis 571 Table 1: Distribution of KDDCUP99 data set Data set Normal DoS Probe R2L U2R 10 percent of training data set 97278 391458 4107 1126 52 test data set 60593 229853 4166 16189 228 Abnormal traffic selection model uses decision tree and random forest classification algo- rithms. Abnormal traffic selection model and normal traffic selection model are independent, without order of priority in training stage or test stage. Mixed compensation model combines the result from normal traffic selection model and ab- normal traffic selection model to produce a final result. Although abnormal traffic selection model is more effective because without influence of normal traffic data, the model has high false negative rate due to this characteristic. Therefore use normal set produced by normal traffic selection model to compensate abnormal set A = {A1,A2, · · · ,Ak} produced by abnormal traffic selection model. Ai, i ∈ [1,k] denote specific attack category. If c denote detection result, rule of compensation as follow: { if c ∈ Ai, c ∈ N, then c ∈ N if c ∈ Ai, c /∈ N, then c ∈ Ai. (6) 4 Simulation results and discussions Before using three sub-models of anomaly detection based on big data analysis, data set needs be preprocessed with label for training model. It should be noted that rightly selecting feature is a good way to reduce dimension and increase efficiency of running. In the simulation, three different algorithms are used to verify validity of the proposed model. 4.1 Data set In the simulation, we use KDDCUP99 [13] data set to test my model. KDDCUP99 data set is widespread use for testing abnormal detection model, which is obtained and processed from KDDCUP99 [14]. KDDCUP99 data set has 41 features and been sorted into three group: basic feature, content feature and time feature [15]. The distribution of data set is shown as Table 1, where training data has 5 million records, 10 percent of training data has 494021 records, and test data has 311029 records. Every record is labeled to be normal or abnormal, and abnormal data can be classified into four groups: Dos, U2R, R2L and Probe. From Table 1, we find that normal data in training data set is more than abnormal data in test data set. Therefore, this data set can be used to test the performance of the proposed model under different circumstances. 4.2 Simulation results As shown in Table 2, we have done eight experiments with the model based on big data analysis, and three control experiments which used k-means, decision tree or random forest respectively. In the control groups, training classify model uses all training data set with five categories, then classifying test data into five categories. Another control group is winner of KDDCUP99. In the simulation, prediction accuracy is used as a simulation metric of detection effect, which is shown in Table 3. Besides, we adopt way of sorting and grading for every type. For example, 572 H.P. Yao, Y.Q. Liu, C. Fang Table 2: Number of experiments No. Normal traffic selection model Abnormal traffic selection model No. of control group Algorithm 1 k-means1∗ Random Forest 9 k-means 2 k-means1∗ Decision Tree 10 Decision Tree 3 k-means2∗ Random Forest 11 Random Forest 4 k-means2∗ Decision Tree 12 Winner of KDDCUP99 5 Decision Tree Decision Tree 6 Decision Tree Random Forest 7 Random Forest Decision Tree 8 Random Forest Random Forest *note: In the normal traffic selection model, the number of cluster of normal and abnormal respectively is 4 and 30 in k − means1, and the number of cluster of normal and abnormal respectively is 100 and 300 in k − means2. Table 3: Prediction accuracy No. Experiment Normal DoS Probe U2R R2L 1 k-means1+Random Forest 0.632 0.814 0.939 0.561 0.679 2 k-means1+Decision Tree 0.656 0.791 0.878 0.500 0.772 3 k-means2+Random Forest 0.945 0.983 0.910 0.513 0.510 4 k-means2+Decision Tree 0.946 0.979 0.852 0.500 0.504 5 Decision Tree+Decision Tree 0.951 0.984 0.829 0.500 0.512 6 Decision Tree + Random Forest 0.951 0.986 0.831 0.550 0.517 7 Random Forest + Decision Tree 0.954 0.980 0.861 0.500 0.521 8 Random Forest + Random Forest 0.952 0.985 0.872 0.520 0.510 9 k-means 0.938 0.968 0.785 0.500 0.528 10 Decision Tree 0.951 0.983 0.793 0.500 0.500 11 Random Forest 0.952 0.985 0.875 0.522 0.507 12 Winner of KDDCUP99 0.995 0.971 0.833 0.132 0.084 all experiments are sorted by prediction accuracy of normal. The first grades 1 point, the second grades 2 points, and so on. Finally, adding grade of five groups is final grade. As shown in Table 4, the experiment group and winner of KDDCUP99 are sorted by final grade. While the later has high detection rate in normal data, as for four attack types, the result of model based on big data analysis is better than winner of KDDCUP99. Algorithm of winner of KDDCUP99 is C5 decision tree [16–19]. Training data of winner of KDDCUP99 is a little different with my experiment. Thus for evaluating detection effect of the proposed mode, we did three control experiments with same training data and test data, used with k-means, decision tree or random forest respectively. The number of these experiments is noted as 11, 10 and 9. Sorting result shows that detection effect of algorithm of the proposed model is better than no use, as shown as Table 5. We will discuss experiments results, compared No.8 with No.11, No.7 with No.5 and No.3 with No.4. Discussing result of no.8 and no.11 Score of top three are same. Judging No.8 and No.11 with final grade, detection result of two experiments are almost same. And both of them use random forest algorithm. But the difference is: An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis 573 Table 4: Compared with winner of KDDCUP99 No. Experiment Normal DoS Probe U2R R2L Final Socre Rank 8 Random Forest+Random Forest 3 2 2 2 4 13 1 6 Decision Tree+Random Forest 4 1 6 1 2 14 2 7 Random Forest +Decision Tree 2 5 3 4 1 15 3 2 k-means2+Random Forest 7 4 1 3 5 20 4 5 Decision Tree+Decision Tree 4 3 7 6 3 23 5 4 k-means2+Decision Tree 6 6 4 5 6 27 6 12 Winner of KDDCUP99 1 7 5 7 7 27 6 Table 5: Compared with control group No. Experiment Normal DoS Probe U2R R2L Final Socre Rank 6 Decision Tree+Random Forest 4 1 6 1 3 15 1 8 Random Forest + Random Forest 2 2 3 3 5 15 1 11 Random Forest 2 2 2 2 7 15 1 7 Random Forest + Decision Tree 1 7 4 5 2 19 4 3 k-means2+ Random Forest 8 5 1 4 5 23 5 5 Decision Tree + Decision Tree 4 4 7 5 4 24 6 10 Decision Tree 4 5 8 5 9 31 7 9 k-means 9 9 9 5 1 33 8 4 k-means2+ Decision Tree 7 8 5 5 8 33 8 1. Importance of variable used in classifying is different; 2. No.8 has lower false negative rate. • Importance of variable As shown as Fig. 1, variables chosen by random forest in No.8 and No.11 are different. Random forest algorithm can output importance of variables, noted Gini index [9]. Fig. 1 shows that top 20 have important variables in comparison with top 1, whose value is higher and more important. In No.8, rank of variables is different between normal traffic selection model and abnormal traffic selection model. This means that variable used for predicting normal or abnormal and specific attack is different. Therefore, choosing variable in No.11 is influenced by both sides, and output a compromised result when choosing variables, that’s why prediction of model in No.11 has deviation. • Comparison of false negative rate In order to evaluate effect on predicting abnormal behavior, false negative rate is used as an important index, which can measure how many attack events are omitted. Table 6 shows confusion matrix of results of experiments No.8 and No.11 when using random forest. Row express information of prediction, and column express actual information. False negative rate of No.8 in normal type is very low, but high in U2R and R2L type. In No.8, false negative rate of normal selection model in normal is low. Without influence of normal training data, false negative rate of abnormal selection model in four specific attack types are lower than No.11. Discussing result of no.5 and no.7 No.5 and No.7 respectively compare with No.6 and No.8 by using same algorithm in normal traffic selection model, and their ranks are lower when using decision tree in abnormal traffic 574 H.P. Yao, Y.Q. Liu, C. Fang (a) No.11 (b) No.8 Normal traffic selection model (c) No.8 Abnormal traffic selection model Figure 1: Importance of Variables in Random Forest. An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis 575 Table 6: Confusion matrix No.11 Prediction Normal DoS Probe U2R R2L Normal 60287 5967 847 159 15839 DoS 69 223814 191 8 0 Probe 233 72 3128 50 104 U2R 1 0 0 10 5 R2L 3 0 0 1 241 False Negative 0.00505 0.026273 0.24916 0.95614 0.985113 No.8 Normal traffic selection model Prediction Normal Abnormal Normal 60289 22853 Abnormal 304 227583 False Negative 0.005017 0.091253 No.8 Abnormal traffic selection model Prediction DoS Probe U2R R2L DoS 229231 769 20 4693 Probe 297 3393 135 5646 U2R 0 0 39 32 R2L 325 4 34 5818 False Negative 0.002706 0.18555 0.828947 0.64062 Table 7: Confusion Matrix of abnormal traffic selection model with decision tree Predition DoS Probe U2R R2L DoS 227792 589 34 6245 Probe 1434 3192 20 283 U2R 0 0 0 0 R2L 627 385 174 9661 selection model. Table 7 is confusion matrix of abnormal traffic selection model with decision tree algorithm. It shows that U2R can not be detected and false negative rate of R2L is higher. In order to find the reason, classify tree is checked in Fig. 2, where the classification model prefers DoS and Probe attack, then R2L attack, and no result point of U2R attack. Distribution of training data can explain this phenomenon, which can be shown in Fig. 3. When generating decision tree, the obtained information will cause results in favor of feature which have more samples. Therefore, if the number of training data set in every group is different enough, it cannot get efficient classification model for small samples. Moreover, because the number of between training data is comparatively equal, classification result is better, such as No.6, when normal traffic selection model uses decision tree. Discussing result of no.3 and no.4 No.3 and No.4 use k-means in normal traffic selection model to choose clustering center. Table 8 shows final prediction accuracies in No.3 and No.4. Because final results are lower than that of normal traffic selection model or abnormal traffic selection model, we find that this problem is caused by using k-means in normal selection model. Table 9 shows confusion matrix 576 H.P. Yao, Y.Q. Liu, C. Fang Figure 2: Classify tree of abnormal traffic selection model. Figure 3: Distribution of training data. An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis 577 Table 8: Accuracy of no.3 and no.4 No. Model Algorithm Accuracy No.3 Normal Traffic Selection k-means 0.926 Abnormal Traffic Selection Random Forest 0.948 Mixed Compensation Model 0.923 No.4 Normal Traffic Selection k-means 0.925 Abnormal Traffic Selection Decision Tree 0.961 Mixed Compensation Model 0.918 Table 9: Confusion matrix of normal traffic selection model of no.3 and no.4 No. Prediction Normal Abnormal No.3 Normal 59189 21663 Abnormal 1404 228773 No.4 Normal 59428 22221 Abnormal 1165 228215 of normal traffic selection model of No.3 and No.4. Many abnormal records are predicted as normal, which cause high false negative rate. Therefore, many abnormal records predicted by abnormal traffic selection model will be regarded as normal after mixed compensation model. Nowadays, many novel attacks are unknown to researchers, and many attacks will be disguised as normal. It’s very dangerous to have high false negative rate, and it does not fit the proposed model. Because the effect of k-means has great correlation with the number of centers chosen to cluster, and we can fine tune the strength of clustering, and lower the false negative rate to establish a strict normal selection model. In No.3 and No.4, the number of centers for normal traffic and attacks is 100 and 300, respectively. Although it can achieve a good overall accuracy, its false negative rate is higher than other model. However, according to Table 10, by choosing 4 and 30 in No.1 and No.2, it has lower false negative rate, and only classify four kinds of attacks. Besides, a strict normal detection model is established. By adjusting the parameters and reducing false negative rate in No.1 and No.2, we can find that the rank has increased rapidly compared with No.3 and No.4. Especially, when K- means combines with random forest, it has a very high accuracy on Probe, U2R and R2L attack. Therefore, we can draw the conclusion that by adjusting the parameters of K-means, the strength of abnormal traffic detection can be controlled by adjusting the strength of normal traffic identification. 4.3 Summary Based on the results analyzed above, as shown in Table 11, the following conclusions can be drawn: 1. Random forest classification algorithm can adapt to the change of distribution of network data, and this algorithm by using the proposed model can reduce false negative rate. 2. If the number of training data in different group is largely different with each other, the classify model built by decision tree will prefer to attack types, which have more training data. So we should avoid using decision tree in abnormal traffic selection model. However, in the normal traffic selection model, the difference between different groups is comparatively small. In this 578 H.P. Yao, Y.Q. Liu, C. Fang Table 10: Results of experiments No. Experiment DoS Probe U2R R2L Final Rank 1 k-means1+Random forest 10 1 1 2 14 1 6 Decision tree+ Random forest 1 8 2 5 16 2 8 Random forest + Random forest 2 5 4 7 18 3 11 Random forest 2 4 3 9 18 3 3 k-means2+ Random forest 5 2 5 7 19 5 2 k-means1+ Decision tree 11 3 6 1 21 6 7 Random forest + Decision tree 7 6 6 4 23 7 5 Decision tree + Decision tree 4 9 6 6 25 8 9 k-means 9 11 6 3 29 9 4 k-means2+ Decision tree 8 7 6 10 31 10 10 Decision tree 5 10 6 11 32 11 Table 11: Summary of model Model 1 Model 2 Model 3 Normal traffic selection model k-means1 Decision Tree Random Forest Abnormal traffic selection model Random Forest Random Forest Random Forest situation, using decision tree can fast get classify model, and the results have higher accuracy. 3. There are more and more unknown abnormal events in the future. In order to avoid loss of false negative prediction, we can change the number of clustering in the normal traffic selection model with k-means algorithm to reduce false negative rate and increase the accuracy of detecting abnormal events. Conclusion With the change of distribution of network data, traditional anomaly traffic detection tech- niques can not fit this situation. In order to solve the problem, we propose an anomaly traffic detection model based on big data analysis. Simulation results show that the proposed model achieves a detection rate of 95.4% on normal data, 98.6% on DoS attack, 93.9% on Probe attack, 56.1% on U2R attack, and 77.2% on R2L attack.Therefore, the model can increase the accuracy of attack behavior, and reduce false negative rate. Acknowledgment This work was supported by NSFC (61471056) and China Jiangsu Future Internet Research Fund (BY2013095-3-1, BY2013095-3-03). Bibliography [1] Patcha, A.; Park, J.M. (2007); An overview of anomaly detection techniques: Existing so- lutions and latest technological trends, Computer Networks, ISSN 1389-1286, 51(12): 3448- 3470. [2] Lazarevic, A.; Kumar, V.; Srivastava, J. (2005); Intrusion detection: A survey, Managing Cyber Threats, ISSN 0924-6703, 5: 19-78. An Abnormal Network Traffic Detection Algorithm Based on Big Data Analysis 579 [3] Axelsson, S. (1998); Research in intrusion-detection systems: a survey, Department of Com- puter Engineering, Chalmers University of Technology, Goteborg. Sweden, Technical Report 98-17. [4] Om, H.; Kundu, A. (2012); A hybrid system for reducing the false alarm rate of anomaly intrusion detection system, IEEE 1st International Conference on Recent Advances in Infor- mation Technology (RAIT), ISBN 978-1-4577-0694-3, 131-136. [5] Kaisler, S. et al (2013); Big data: Issues and challenges moving forward, IEEE 46th Hawaii International Conference on System Sciences (HICSS), ISSN 1530-1605, 995-1004. [6] Michael, K.; Miller, K.W. (2013); Big Data: New Opportunities and New Challenges, Com- puter, ISSN 0018-9162, 46(6):22-24. [7] Russom, P. et al (2011); Big Data Analytics, TDWI Best Practices Report, Fourth Quarter. [8] Fan, W.; Bifet, A. (2013); Mining big data: current status, and forecast to the future, ACM SIGKDD Explorations Newsletter, ISSN 1931-0145, 14(2): 1-5. [9] James, G. et al (2013); An introduction to statistical learning, Springer, ISSN 1431-875X. [10] Guan, Y.; Ghorbani, A.A.; Belacel, N. (2003); Y-means: A clustering method for intrusion detection, IEEE Canadian Conference on Electrical and Computer Engineering, ISSN 0840- 7789, 2:1083-1086. [11] Quinlan, J.R. (1993); C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., ISBN 1558602402. [12] Elbasiony, R.M. et al (2013); A hybrid network intrusion detection framework based on random forests and weighted k-means, Ain Shams Engineering Journal, ISSN 2090-4479, 4(4): 753-762. [13] KDD Cup 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. May 2015 [14] Lippmann, R.P. et al (2000); Evaluating intrusion detection systems: The 1998 DARPA off- line intrusion detection evaluation, IEEE Proceedings of DARPA Information Survivability Conference and Exposition (DISCEX), ISBN 0-7695-0490-6, 2:12-26. [15] Tavallaee, M. et al (2009); A detailed analysis of the KDD CUP 99 data set, Proceed- ings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications (CISDA), ISBN 978-1-4244-3763-4, 1-6. [16] Pfahringer, B. (2000); Winning the KDD99 classification cup: bagged boosting, ACM SIGKDD Explorations Newsletter, ISSN 1931-0145, 1(2): 65-66. [17] Yu, G. D. et al (2014); Multi-objective rescheduling model for product collaborative de- sign considering disturbance, International journal of simulation modelling, ISSN 1726-4529, 13(4): 472-484. [18] Gusel, L. R. et al (2015); Genetic based approach to predicting the elongation of drawn alloy, International journal of simulation modelling, ISSN 1726-4529, 14(1): 39-47. [19] Prasad, K. et al (2016); A knowledge-based system for end mill selection, Advances in Production Engineering & Management, ISSN 1856-6250, 11(1): 15-28.