229Question Categorization.....(Christian Eka Saputra, et al.) QUESTION CATEGORIZATION USING LEXICAL FEATURE IN OPINI.ID Christian Eka Saputra1, Derwin Suhartono2, and Rini Wongso3 1,2,3Computer Science Department, School of Computer Science, Bina Nusantara University Jln. K. H. Syahdan No. 9, Jakarta Barat 11480, Indonesia 1christian.saputra001@binus.ac.id; 2dsuhartono@binus.edu; 3rwongso@binus.edu Received: 22nd September 2017/ Revised: 28th November 2017/ Accepted: 4th December 2017 Abstract - This research aimed to categorize questions posted in Opini.id. N-gram and Bag of Concept (BOC) were used as the lexical features. Those were combined with Naïve Bayes, Support Vector Machine (SVM), and J48 Tree as the classification method. The experiments were done by using data from online media portal to categorize questions posted by user. Based on the experiments, the best accuracy is 96,5%. It is obtained by using the combination of Bigram Trigram Keyword (BTK) features with J48 Tree as classifier. Meanwhile, the combination of Unigram Bigram (UB) and Unigram Bigram Keyword (UBK) with attribute selection in WEKA achieves the accuracy of 95,94% by using SVM as the classifier. Keywords: text classification, Bag of Concept, Naïve Bayes, Support Vector Machine (SVM), J48 Tree I. INTRODUCTION In this modern era, the number of information has increased massively in the form of text or multimedia (sound, images, video, etc.). As the fundamental form of data, the researchers highlight text which has been used for many tasks such as question answering system (Jovita et al., 2015), argumentation classification (Desilia et al., 2017), and recommender system (Gunawan, Tania, & Suhartono, 2016). The evolution of information has also led to information overload. Some of the information seems meaningless now, but they can be a useful thing in the future. One of the best solutions is to use the process of categorization or classification of information. In text categorization, feature extraction method and machine learning algorithm strongly affect categorization accuracy. According to Ikonomakis, Kotsiantis, and Tampakas (2005), one of the solutions that can be offered to face problems of massive information is to make the process of automatic text classification. Automatic text classification is needed as the number and varieties of text or multimedia information has grown massively and often unstructured. Thus, it becomes less useful if it is not treated properly. From the business standpoint, the information gathered can be a benchmark and a good guideline in determining a company’s policies or changing business processes to answer public’s needs. If the information can be grouped well, the decision making can create the best solution. For example, business leaders can get proper information regarding their needs if the news classification is defined properly. In the process of automatic text classification or information, some classifiers that can be used are Naïve Bayes, Support Vector Machine (SVM), and Decision Tree. Many linguistic researchers implement these algorithms in their research as they perform well. Other than classifier, features are very important to describe the characteristics of information from various viewpoints. Structural features, lexical features, syntactic features, and contextual features are defined as group of features in specific task (Stab & Gurevych, 2014). One specific feature which quite succeeds in describing the meaning of one sentence is lexical feature. The lexical feature is a representation of indicators that have been defined previously and associated with the word, lexeme, and vocabulary. The word is not tied to any other words. The example of the implementation of lexical features that are used is the N-gram (unigram, bigram, trigram), Bag of Word (BOW), and Bag of Concepts (BOC). There are several researches that have been conducted and associated with automation process of text classification using lexical feature. The test accuracy is obtained by comparing the implementation of lexical feature. The combination of N-gram, BOW, and Bag of stemmed Word is also used (Rahmoun & Elberricihi, 2007). The process undertaken is to use the corpus of test data derived from two sources of data. Those are Reuters and Newsgroups. The results reveal that it is the best representation in determining the classification of a text derived from N-gram compared with other features such representations by BOW or Bag of stemmed Word. Wei et al. (2008) used N-gram feature in Mandarin text. They also used a big corpus from TanCorp. It consisted of more than 14.000 texts and was divided into 12 classes. They mentioned the advantages of using N-gram were no word segmentation, and no special techniques and dictionary required for the implementation. The experiments concluded that bigram is the best feature for Mandarin. The experiments also implemented the combination of N-gram with 1-, 2-, 3-, and 4-gram that gave the best result, followed by 1-, 2-gram, 2-gram. The worst feature was by using only 1-gram. Mandarin mostly consisted of only 1 or 2 characters. Some of the Chinese scientifics’ names consisted of more characters that made the combination of N-gram gave a good result in text classification process. Sahlgren and Coster (2004) used a new approach to represent new feature in text categorization. They utilized BOC that combined some words with similar meaning. The result from BOC was compared to the result from BOW. It only calculated the frequency of occurrence of a feature derived from each word of documents. The new approach by Sahlgren and Coster (2004), BOC or concepts based representation, is considered to be more efficient and fast. Additionally, it does not require additional external resources. Random indexing is also used in the implementation of BOC which aims to accelerate process of giving values for vector space model. It is due to 230 ComTech, Vol. 8 No. 4 December 2017, 229-234 expensive cost of BOC regarding computational cost. The experiment concluded that BOW (82,77%) gave good result for a small number of documents using Linear Kernel and TF-IDF rather than BOC (82,29%) with Polynomial Kernel and TF-IDF. However, in a large number of documents such as REUTERS-21578, BOC performed better (88,74%) compared to BOW (88,09%) (Sahlgren & Coster (2004). Other researchers classify text on a biomedical literature. The classification process uses supervised learning method. The classifier processes the data with a few samples that have had previous categories. Both apply a data model formed into a set of other documents that serve as test data. The research compares the advantages and disadvantages of using a system of BOW and BOC in the process of transformation of a feature in the Vector Space Model. Researchers agree that the concept of BOW has high level of sparse data to produce the high dimensionality of the data. Therefore, the researchers use BOC in the transformation process feature. This concept explains about unit of meaning which means the unity of the various meanings (Garcia, Rodriguez, & Anido, 2015). Moreover, the classification process of information is very useful to facilitate the formation of new meanings. It also can be the representations of data into useful information for the future. According to Movementi (2015), one of a popular news portal in Indonesia, Opini.id is the combination of social media and news portal. Opini.id is a portal that facilitates Indonesians and the communities in giving opinions and sharing. Its mission is to facilitate and supports public opinions in developing Indonesia. In this portal, Indonesians are free to share their thoughts and opinions. Therefore, a lot of information can be obtained through this portal. However, many opinions are posted in Opini.id. Therefore, it is difficult for admins to manually categorize it to manage or analyze the data. This can lead to incorrect labeling. Based on the problem, this research aims to categorize questions posted in Opini.id (Opini.id uses Indonesian language). The researchers will use lexical features of N-gram and BOC with Naïve Bayes, SVM, and Decision Tree (J48 Tree) as the classifier using Java WEKA API. II. METHODS In this research, the posted questions in Opini.id will be processed for classification. It will use the lexical features and three classifiers of Naïve Bayes, SVM, and Decision Tree (J48 Tree). There are two main phases in the automation of text classification. The phases are data preprocessing and data processing. The first step is data preprocessing. Data preprocessing is divided into three main processes. They are streaming data, stop word removal, and stemming. Figure 1 describes steps in data preprocessing. Figure 1 Data Preprocessing Steps Streaming data is the initial process to obtain data so it can be used in the text classification process. The data source is derived from internal data of Opini.id. It has 12.700 rows in a single database and has been divided into ten main categories. The categories are business, technology, sports, health, travel, politics, celebrities, lifestyle, art, and education. The second step of data preprocessing is stemming. It is to stem all data used as training data by using stemming algorithm proposed by Nazief and Adriani (1996). The algorithm works by using the rules of Indonesian morphological words. It removes the prefix and suffix of a word and uses the stem words database provided previously to decide the stemming process. This algorithm emphasizes the use of stem word database. Hence, the more complete list of words provided is, the higher the accuracy results will be. The enhanced implementation of stemming process in this research is by doing stemmed word storage mechanism. The words that have passed through stemming is stored in a database. It aims to reduce the long computing process by checking and taking existing result from stemmed word database. The process has already been defined previously without the needs of repeating the stemming process from beginning. This enhancement provides an excellent time and efficiency in stemming process. Thus, it can be directed through the primary process, and the creation of training data and data model in text classification. In this research, stemmed words are stored in database as a reference for the next stemming process. Hence, base words will be taken from the database for the same words that have been previously stemmed. The last step of data preprocessing is stopword removal. It will do word removal for certain word types such as conjunction word (“dan”, “atau”, and others) and interjection word (“duh”, “wow”, “wah”, and others). The stopwords list are taken from previously published research. The stopword removal is done to reduce noise in data to improve the computation. In this process, the statement like “Ternyata memang Andi suka sepakbola sejak lama” (literally means: evidently, Andi loves football since long time ago) will become “Andi suka sepakbola” (literally means: Andi loves football). Stopword removal aims to improve the accuracy of automation classification process because the process of grouping into predetermined categories will be carried out. All words that have no connection or directly related to that category will be eliminated from the corpus of available data. It can reduce the sparse data on the implementation of word matrices in machine learning algorithm that is yet to be performed. The next step after data preprocessing is data processing. It is described in Figure 2. Figure 2 Data Processing Process Feature extraction process is the process to get representation of data by extracting important characteristics from data. It needs some predefined categories of feature so that the extracted features can meet the actual needs. Lexical features used as features in this research are N-gram and BOC. N-gram is a combination of words that can be obtained by stemming a longer string. The unique characteristic of an N-gram is that it is in a contiguous sequence of items 231Question Categorization.....(Christian Eka Saputra, et al.) of phonemes, syllables, letters, or words (Permadi, 2008). According to Hanafi, Whidiana, and Dayawati (2009), N-gram is a simple method for categorizing text or document with the superiority of not too sensitive with misspelling. However, by using N-gram, the feature matrices will become very huge. N-gram implementation is not only in the form of character-based, but also word-based. N in N-gram indicates the size, unigram (N=1), bigram (N=2), and trigram (N=3). The example of implementation in N-gram can be seen in Table 1. Table 1 N-Gram Implementation Example in Indonesian Language Word-Based Character-Based String “Andi suka bermain sepakbola di lapangan Senayan” (literally means Andi loves to play football at Senayan stadium) “Pemerintah” (literally means government) Unigram Andi, suka, bermain, sepakbola, di, lapangan, senayan p, e, m, e, r, i, n, t, a, h Bigram Andi suka, suka bermain, bermain ssepakbola, sepakbola di, di lapangan, lapangan senayan pe, em, me, er, ri, in, ta, ah Trigram Andi suka bermain, suka bermain sepakbola, bermain sepakbola di, sepakbola di lapangan, di lapangan senayan pem, eme, mer, eri, rin, int, nta, tah In this research, the construction of N-Gram is implemented by using internal data Opini.id. It is done by performing data retrieval from a database. Then, it implements a function to form N-gram as unigram, bigram, and trigram. Next, it is saved in a new data table. Feature representation method with the concept of BOC is the new development from the previous transformation concept, BOW. This feature representation focuses on the meaning which contains a word. It is a combination of some words in a document which has the same meaning. The concept of this feature transformation can be implemented by calculating the sum of all existing vector values. The values are based on the number of words in each document. The concept of BOC according to Täckström (2005), is proven to be implemented well in the information retrieval system, although this concept is quite simple. However, this concept has the disadvantage that is similar to BOW. The relevance or contextual relationships that exist in a word is not taken into account at all. According to Sahlgren and Coster (2004), BOC can improve the performance of a classifier. In their research, the classifier tested was SVM. The experimental score increased from 88,74% to 88,99%. Despite the admittedly small differences, they insisted that it could not be negligible as it was consistent with the previous findings. BOC in this research is implemented by collecting all internal data of Opini.id. Then, the data are put into Java Programming function that will create separation of some statement from those data into a few of words. It will be classified into several groups according to internal data of Opini.id categories. The next step is data modeling. The purpose of this data model is to be the benchmark data for the new data that has not had a specific label. Then, the new data can be classified by specific category labels. III. RESULTS AND DISCUSSIONS The method proposed in the research is implemented using WEKA API .WEKA is a collection of machine learning algorithms for data mining tasks. As mentioned previously, this API is implemented in Java Spring framework. The process of building data model starts with instances and attributes initialization. The term of instance is used as a condition of format or template in before the implementation of WEKA function. Instances are useful to accommodate training and testing data set. The attribute is used as a parameter feature in the categorization process. Figure 3 describes the process. There are various types of data facilitated by the function and method of WEKA API. In this research, the numeric (representation regarding numbers) and class (representation in the form of selection category) are used. There is container initialization of dataset by using DenseInstance function. The example of a instance of dataset is 1, 2, 1, 3, 2, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, Bisnis. Then, the data are inserted into instances in WEKA to create the data model. The data is built and ready for classification. In this research, data classification is done using three classifiers of Naïve Bayes, Decision Tree (J48 Tree), and SVM with 10 categories of “Bisnis”, “Teknologi”, “Olahraga”, “Kesehatan”, “Wisata”, “Politik”, “Selebritas”, “Gaya Hidup”, “Seni”, and “Edukasi”. The Naïve Bayes classifier is based on the Bayes rule of conditional probability. It makes use of all the attributes contained in the data. It also analyzes them individually as if they are equally important and independent of each other (Wongso et al., 2017). Decision Tree is a predictive machine learning model. It decides the target value of a new sample based on various attribute values of the available data. The internal nodes of a Decision Tree denote the different attributes. The branches between the nodes tell the possible values that these attributes may have in the observed samples. Meanwhile, the terminal nodes describe the final value (the classification) of the dependent variable. As described by Ozer (2008), C4.5 algorithm is an algorithm that can be used to generate a Decision Tree developed by Ross Quinlan. Meanwhile, J48 Tree is an open source Java of C4.5 algorithm in the WEKA data mining tool. The algorithm uses a technique to induce Decision Tree for 20 classifications and uses reduced-error pruning (Ozer, 2008). The WEKA tool provides many options associated with tree pruning. In case of potential overfitting pruning, it can be used as a tool for précising. In other algorithms, the classification is performed recursively till every single leaf is pure, so the classification of the data should be as perfect as possible. This algorithm generates the rules which particular identity of that data is generated. The objective is to generalize Decision Tree progressively until it gains equilibrium of flexibility and accuracy (Kaur & Chhabra, 2014). According to Nugroho, Witarto, and Handoko (2003), 232 ComTech, Vol. 8 No. 4 December 2017, 229-234 SVM is a method used for pattern recognition process. SVM is a machine learning algorithm with structural risk minimization. It aims to search for a hyperplane by separating two classes of data in an input space. Hyperplane can be measured by a margin or distance. The nearest pattern to the borderline of a hyperplane is called as support vector. The classification is carried out after the formation of the data model. It is derived from the data that have been trained. It begins with obtaining data on instances to do the categorization as described in Figure 4. Then, categorization is performed based on the existing data model. The prediction of categorization can be obtained along with possible value acquisition for each data. The whole process is described in Figure 5. The comparison process of data model testing uses the cross-validation with n = 10 with the classifier of Naïve Bayes, SVM, and J48 Tree are described in Table 2 and Table 3. The feature combinations use the cross-validation folds = 10. Based on that, the highest accuracy of 96,5% is obtained by using features of Bigram Trigram Keyword (BTK). It uses J48 Tree classifier which outperforms the other classifiers and the other feature combinations. The highest accuracy after BTK and J48 Tree is obtained by using Bigram Keyword (BK) and J48 Tree. It is with the accuracy of 96,41%. It is slightly lower than the previous one. Meanwhile, the worst result is obtained by using feature of Unigram Keyword (UK) with Naïve Bayes as the classifier. It only obtains 41,79%. It is far below the average. The results by using feature of UK consistently give the worst result among other features. It is with 51,92% of accuracy using J48 Tree, and 62,72% of accuracy using SVM. Figure 3 Build Data Model Process Figure 4 Data Modelling Process Figure 5 Data Classification Process 233Question Categorization.....(Christian Eka Saputra, et al.) Next, the other experiment is done by using the feature combinations with attribute selection and the classifier of Naïve Bayes, SVM, J48 Tree. The results are shown in Table 3. Table 2 Experiment Using N-gram Features Feature Combination Classifier Naïve Bayes J48 Tree SVM UB 80,39 96,04 93,37 UT 80,81 91,3 88,89 UK 41,79 51,92 62,72 UBT 89,23 96,21 94,84 UBTK 79,79 91,02 86,18 UBK 89,48 96,04 92,7 BT 90,12 96,21 95,83 BK 90,43 96,41 95 BTK 90,26 96,5 95,07 TK 82,36 91,86 90,71 UTK 81,32 91,04 88,6 Description: UB = Unigram Bigram UT = Unigram Trigram UK = Unigram Keyword UBT = Unigram Bigram Trigram UBTK = Unigram Bigram Trigram Keyword UBK = Unigram Bigram Keyword BT = Bigram Trigram BK = Bigram Keyword BTK = Bigram Trigram Keyword TK = Trigram Keyword UTK = Unigram Trigram Keyword Table 3 Experiment with N-gram and Attribute Selection Feature Combination Classifier Naïve Bayes J48 Tree SVM UB 89,56 95,78 95,94 UT 82,09 91,09 91,28 UK 44,35 44,48 53,66 UBT 88,8 95,92 95,88 UBTK 82,1 91,1 91,3 UBK 89,56 95,78 95,94 BT 88,8 95,92 95,88 BK 88,28 94,44 94,76 BTK 88,8 95,91 95,88 TK 82,09 91,09 91,28 UTK 82,1 91,1 91,28 According to the results shown in Table 3, the best result is obtained by using feature of Unigram Bigram (UB) and Unigram Bigram Keyword (UBK) with classifier of SVM. The accuracy is 95,94% and is followed by UBT-J48 Tree with 95,92% of accuracy. Meanwhile, BTK-J48 Tree is with 95,91% of accuracy. The worst result is still achieved by using feature of UK and classifier of Naïve Bayes with only 44,35% of accuracy. It is followed by using J48 Tree (44,48%) and SVM (53,66%). IV. CONCLUSIONS Based on experiment comparison of automation classification process by using lexical feature implementation, it can be concluded. The researchers conclude that the experiments are carried out with a combination of lexical features between unigram, bigram, trigram, and keywords of each category in the implementation of data modeling. It uses cross-validation by the number of fold about 10. It shows that the combination of bigram, trigram, and keyword gives the highest accuracy of 96,5% with J48 Tree. Moreover, the experiment with combination of lexical feature by using the attribute selection feature is done. It is to find out what are the most significant features that affect the process of automation category of questions. Based on the experiments, it can be seen that the combination of lexical feature UB and UBK by using SVM Classifier provides a high percentage compared to others. It is with accuracy of 95,94%. For further utilization of this finding, the other features such as structural and contextual are suggested to be attached to the current features. This good result of using lexical features depicts that any classification or categorization problems should not ignore the importance of lexical knowledge from the texts. If there are bigger data for this task, deep learning is interesting to experiment as well. Based on the data characteristics, text and Long-Short Term Memory (LSTM) will be the best fit for this question categorization task. REFERENCES Desilia, Y., Utami, V. T., Arta, C., & Suhartono, D. (2017). An attempt to combine features in classifying argument components in persuasive essays. In 17th Workshop on Computational Models of Natural Argument (CMNA). London, United Kingdom. Garcia, M. M., Rodriguez, R. P., & Anido, L. (2015). Bag of concepts document representation for textual news classification. International Journal of Computational Linguistics and Applications, 6(1), 173-188. Gunawan, A. A. S., Tania, & Suhartono, D. (2016). Recommender system for product offering by personalized email. In 1st International Workshop on Big Data and Information Security (IWBIS). Jakarta, Indonesia. Hanafi, A., Whidiana, R., & Dayawati, R. N. (2009). Pengenalan bahasa suku bangsa Indonesia berbasis teks menggunakan metode N-Gram (Skripsi). Bandung: Telkom University. Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS Transactions on Computers, 4(8), 966-974. Jovita, Linda, Hartawan, A. & Suhartono, D. (2015). Using vector space model in question answering system. Procedia Computer Science, 59, 305-311. Kaur, G., & Chhabra, A. (2014). Improved J48 classification algorithm for the prediction of diabetes. International Journal of Computer Applications, 98(22), 13-17. Movementi, S. (2015). Opini.id unggulkan fitur polling. Retrieved from https://tekno.tempo.co/read/ 234 ComTech, Vol. 8 No. 4 December 2017, 229-234 news/2015/02/26/072645583/opini-id-unggulkan- fitur-polling Nazief, B., & Adriani, M. (1996). Confixstripping: Approach to stemming algorithm for Bahasa Indonesia. Jakarta: Faculty of Computer Science, University of Indonesia. Nugroho, A. S., Witarto, A. B., & Handoko, D. (2003). Application of support vector machine in Bioinformatics. In Indonesian Scientific Meeting in Gifu, Central Japan. Ozer, P. (2008). Data mining algorithm for classification (Bachelor Thesis). Redbound University Nijimegan Permadi, Y. (2008). Kategorisasi teks menggunakan N-Gram untuk dokumen berbahasa Indonesia (Skripsi). Bogor: Institut Pertanian Bogor. Rahmoun, A., & Elberricihi, Z. (2007). Experimenting N-Grams in text categorization. The International Arab Journal of Information Technology, 4(4), 377- 385. Sahlgren, M., & Coster, R. (2004). Using bag of concepts to improve the performance of support vector machines in text categorization. In Proceedings of the 20th International Conference on Computational Linguistics Article No 487. Stab, C. & Gurevych, I. (2014). Identifying argumentative discourse structures in persuasive essays. In Conference on Empirical Methods on Natural Language Processing (EMNLP). Täckström, O. (2005). An evaluation of bag-of-concepts representations in automatic text classification (Master Thesis). Swedia: Royal Institute of Technology Sweden. Wei, Z., Miao, D., Chauchat, J. H., & Zhong, C. (2008). Feature selection on Chinese text classification using character N-grams. In International Conference on Rough Sets and Knowledge Technology (pp. 500- 507). Springer. Wongso, R., Luwinda, F., Trisnajaya, B., Rusli, O., & Rudy. (2017). News article text classification in Indonesian language. In The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017) (pp. 137-143). Elsevier.