Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. V (2010), No. 3, pp. 351-361 Improving a SVM Meta-classifier for Text Documents by using Naive Bayes D. Morariu, R. Creţulescu, L. Vinţan Daniel Morariu, Radu Creţulescu, Lucian Vinţan "Lucian Blaga" University of Sibiu Engineering Faculty,Computer Science Department E. Cioran Street, No. 4, 550025 Sibiu, ROMANIA E-mail: {daniel.morariu,radu.kretzulescu,lucian.vintan}@ulbsibiu.ro Abstract: Text categorization is the problem of classifying text documents into a set of predefined classes. In this paper, we investigated two approaches: a) to de- velop a classifier for text document based on Naive Bayes Theory and b) to integrate this classifier into a meta-classifier in order to increase the classification accuracy. The basic idea is to learn a meta-classifier to optimally select the best component classifier for each data point. The experimental results show that combining clas- sifiers can significantly improve the classification accuracy and that our improved meta-classification strategy gives better results than each individual classifier. For Reuters2000 text documents we obtained classification accuracies up to 93.87%. Keywords: Meta-classification, Support Vector Machine, Naive Bayes, Text docu- ment and Performance Evaluation 1 Introduction WHILE more and more textual information is available online, effective retrieval is difficult without good indexing and summarization of document content. Document categorization is one solution to this problem. The task of document categorization is to assign a user defined categorical label to a given document. In recent years a growing number of categorization methods and machine learning techniques have been developed and applied in different contexts. Documents are typically represented as vectors in a features space. Each word in the vocabulary is represented as a separate dimension. The number of occurrences of a word in a document represents the value of the corresponding component in the document’s vector. In this paper we investigate some strategies for combining classifiers in order to improve the classi- fication accuracy. We used classifiers based on Support Vector Machine (SVM) techniques and based on Naive Bayes Theory, respectively. They are less vulnerable to degrade with an increasing dimensionality of the feature space, and have been shown effective in many classification tasks. The SVM classifiers are actually based on learning with kernels and support vectors. We combine multiple classifiers hoping that the classification accuracy can be improved without a significant increase in response time. Instead of building only one highly accurate specialized classifier with much time and effort, we build and combine several simpler classifiers. Several combination schemes have been described in the papers [2] and [6]. A usually approach is to build individual classifiers and later combine their judgments to make the final decision. Another approach, which is not so commonly used because it suffers from the "curse of dimensionality" [5], is to concatenate features from each classifier to make a longer feature vector and use it for the final decision. Anyway, meta-classification is effective only if classifiers’ synergism can be exploited. In previous studies combination strategies were usually ad hoc and are implementing strategies like majority vote, linear combination, winner-take-all [2], or Bagging and Adaboost [16]. Also, some rather complex strategies have been suggested; for example in [4] a meta-classification strategy using SVM [15] is presented and compared with probability based strategies. Copyright c© 2006-2010 by CCC Publications 352 D. Morariu, R. Creţulescu, L. Vinţan Section 2 and 3 contains prerequisites for the work that we present in this paper. In sections 4 we present the methodology used for our experiments. Section 5 presents the experimental framework and section 6 presents the main results of our experiments. The last section debates and concludes on the most important obtained results and proposes some further work. 2 Support Vector Machine The Support Vector Machine (SVM) is a classification technique based on statistical learning the- ory [13], [15] that was applied with great success in many challenging non-linear classification problems and on large data sets. The SVM algorithm finds a hyperplane that optimally splits the training set. The optimal hyperplane can be distinguished by the maximum margin of separation between all train- ing points and the hyperplane. Looking at a two-dimensional problem we actually want to find a line that "best" separates points in the positive class from points in the negative class. The hyperplane is characterized by a decision function like: f (x) = sign(〈~w, Φ(x)〉+ b) (1) where ~w is the weight vector, orthogonal to the hyperplane, "b" is a scalar that represents the hyper- plane’s margin, "x" is the current sample tested, "Φ(x)" is a function that transforms the input data into a higher dimensional feature space and 〈·,·〉 represents the dot product. Sign is the sign function. If ~w has unit length, then 〈~w, Φ(x)〉 is the length of Φ(x) along the direction of ~w. Generally ~w will be scaled by ‖~w‖ . In the training part the algorithm needs to find the normal vector "~w" that leads to the largest "b" of the hyperplane. 3 Naive Bayes The Bayes classifier uses the Bayes Theorem which basically computes prior probabilities for a given class based on the probability for a given term to belong to the specified class. The classifier computes the probability for a document to be into a given class. Bayesian theory woks as a framework for making decision under uncertainty - a probabilistic ap- proach to inference [4] and is particularly suited when the dimensionality of the inputs data is high. Bayes theorized that the probability of future events could be calculated by determining their earlier frequency. Bayes theorem states that: P(Y = yi | X = xk) = P(Y = yi)P(X = xk | Y = yi) P(X = xk) (2) where: P(Y = yi) - Prior probability of hypothesis Y- Prior P(X = xk) - Prior probability of training data X-Evidence P(X = xk | Y = yi) - Probability of X given Y- Likelihood P(Y = yi | X = xk) - Probability of Y given X- Posterior probability. The Naive Bayes classifier is based on the simplifying assumption that the attribute values are con- ditionally independent given target value. In other words the assumption is that, given the target value of the instance, the probability of observing the conjunction y, y...yn is just the product of the probabilities for the individual attributes: cmap = argmax