. 18 UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 1. INTRODUCTION Nowadays, the use of the internet has become inseparable from our daily routines. Social media networks such as Facebook and Twitter have also been developed to give a right to people to easily share their viewpoints about any Big Data Sentimental Analysis Using Document to Vector and Optimized Support Vector Machine Sozan Abdulla Mahmood, Qani Qabil Qasim Department of Computer Science, University of Sulaimani, Sulaymaniyah, Iraq A B S T R A C T With the rapid evolution of the internet, using social media networks such as Twitter, Facebook, and Tumblr, is becoming so common that they have made a great impact on every aspect of human life. Twitter is one of the most popular micro-blogging social media that allow people to share their emotions in short text about variety of topics such as company’s products, people, politics, and services. Analyzing sentiment could be possible as emotions and reviews on different topics are shared every second, which makes social media to become a useful source of information in different fields such as business, politics, applications, and services. Twitter Application Programming Interface (Twitter-API), which is an interface between developers and Twitter, allows them to search for tweets based on the desired keyword using some secret keys and tokens. In this work, Twitter-API used to download the most recent tweets about four keywords, namely, (Trump, Bitcoin, IoT, and Toyota) with a different number of tweets. “Vader” that is a lexicon rule- based method used to categorize downloaded tweets into “Positive” and “Negative” based on their polarity, then the tweets were protected in Mongo database for the next processes. After pre-processing, the hold-out technique was used to split each dataset to 80% as “training-set” and rest 20% “testing-set.” After that, a deep learning-based Document to Vector model was used for feature extraction. To perform the classification task, Radial Bias Function kernel-based support vector machine (SVM) has been used. The accuracy of (RBF-SVM) mainly depends on the value of hyperplane “Soft Margin” penalty “C” and γ “gamma” parameters. The main goal of this work is to select best values for those parameters in order to improve the accuracy of RBF-SVM classifier. The objective of this study is to show the impacts of using four meta-heuristic optimizer algorithms, namely, particle swarm optimizer (PSO), modified PSO (MPSO), grey wolf optimizer (GWO), and hybrid of PSO-GWO in improving SVM classification accuracy by selecting the best values for those parameters. To the best of our knowledge, hybrid PSO-GWO has never been used in SVM optimization. The results show that these optimizers have a significant impact on increasing SVM accuracy. The best accuracy of the model with traditional SVM was 87.885%. After optimization, the highest accuracy obtained with GWO is 91.053% while PSO, hybrid PSO-GWO, and MPSO best accuracies are 90.736%, 90.657%, and 90.557%, respectively. Index Terms: Document to Vector, Grey Wolf Optimizer, Particle Swarm Optimizer, Hybrid Particle Swarm Optimizer_Grey Wolf Optimizer, Opinion Mining, Radial Bias Function Kernel-based Support Vector Machine, Sentiment Analysis, Support Vector Machine Optimization, Twitter Application Programming Interface O R I G I N A L RE SE A RC H A RT I C L E UHD JOURNAL OF SCIENCE AND TECHNOLOGY Corresponding author’s e-mail:  Qani Qabil Qasim, Department of Computer Science, University of Sulaimani, Sulaymaniyah, Iraq. E-mail: qani.qabil@gmail.com Received: 23-10-2019 Accepted: 06-01-2020 Publishing: 13-02-2020 Access this article online DOI: 10.21928/uhdjst.v4n1y2020.pp18-28 E-ISSN: 2521-4217 P-ISSN: 2521-4209 Copyright © 2020 Mahmood and Qasim. This is an open access article distributed under the Creative Commons Attribution Non- Commercial No Derivatives License 4.0 (CC BY-NC-ND 4.0) Sozan Abdulla Mahmood and Qani Qabil Qasim: Twitter Sentiment Analysis UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 19 product or service in the form of short text. This makes them to be rich sources of data that can be valuable for various organizations and companies to find their fans’ or customers’ opinions about their products and services. In spite of companies, well-known people such as politicians and athletes may need to exploit those opinions and attitudes as well as to help them for making better decision-making in the future. However, data diversity and sparsity make it impossible for human to be able to analyze it. Here, the role of machine learning and automation can take a part to solve the problem of big data. Sentiment analysis (SA) or opinion mining techniques could be used [1]. SA refers to the task of finding the opinions of authors about specific entities that expressed in a written text [2]. In recent years, Twitter has become one of the most popular social media and microblogging platform where it is a convenient way for users to write and share their thoughts about anything within 280-characters length (called tweets). Twitter is used extensively as a microblogging service worldwide. Tweets consist of misspellings, slangs, and symbolic forms of words, which poses a major challenge for the conventional natural language processing or machine learning systems to be used on tweets [3]. Sentiment analyzer model can be built in three main approaches – lexicon-based approach, machine learning- based approach, and hybrid of both lexicon-based and machine learning approach. The machine learning approach is one of the most popular techniques that are widely used to build an automated classification model with the help of algorithms such as support vector machine (SVM), Naïve Bayes (NB), and so on. This is due to their ability to handle a large amount of data [4]. In this study, we propose a technique to promote SVM performance for SA by implementing four different meta- heuristic optimizers, namely, particle swarm optimizer (PSO), modified PSO (MPSO), grey wolf optimizer (GWO), and hybrid of PSO-GWO. The sentiment classification goes through four phases: Data collection, data pre-processing, feature extraction, and classification. In the first phase, Twitter Application Programming Interface (Twitter-API) enables developers to collect tweets about any keyword they desire and then followed by preprocessing phase to remove least informative data such as URL, hashtags, numbers, and so on. In the third phase, Document to Vector (Doc2Vec) approaches were used for vectorizing cleaned text, which is the numerical representation of text. PSO, GWO, and hybrid PSO-GWO are used to select the best parameters for the classifier (SVM) to classify generated features from the previous step. The rest of the paper is structured as follows: In section 2, some previous related works in this field that has been conducted before being discussed, section 3 describes the material and methods used in this work, section 4 describes the problem statement, section 5 illustrates the proposed system model and methodology of analyzing the datasets, section 6 shows the results obtained from the model and discussed in detail, and finally, the conclusion and future work are stated in section 7. 2. RELATED WORK Many researches and works have been developed in the field of SA. Researchers have proposed different solutions to different issues of SA in terms of improving performance of classification models, enhancing topic specific corpus, reducing feature-set size to shrink execution time of algorithms and space usage using different techniques. Das et al. [5] review basic stages to be considered in SA, such as pre-processing, feature extraction/selection, and representation along with some data-driven techniques in this field such as SVM and NB as well as to demonstrate how they work and the measuring metrics such as (Precision, Recall, F1-Score, and Accuracy) to evaluate the model efficiency. They concluded that all the SA tasks are challenging and need different techniques to deal with each stage. Naz et al. [6] illustrate the impact of different weighting feature schemes such as term frequency (TF), TF-inverse document frequency (TF-IDF), and binary occurrence (BO) to extract features from tweets along with different n-gram ranges such as unigram, bigram, trigram, and their combination, followed by feeding extracted feature from SemEval2016 dataset to train SVM. The best result they achieved is 79.6% for TF-IDF with Unigram range. They also used the sentiment score vector package to calculate the score of tweets into positive and negative forms to improve the performance of SVM, along with different weighting schemes and n-gram range, the highest accuracy achieved with SVC is 81.0% for BO with unigram range. Seth et al. [7] proposed a hybrid technique for improving the efficiency and reliability of their model by merging SVM with the decision tree. The model performs a classification Sozan Abdulla Mahmood and Qani Qabil Qasim: Twitter Sentiment Analysis 20 UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 of tweets on the basis of SVM and adaboost decision tree individually. Then, a hybrid technique will be applied by feeding the outputs obtained from the two above mentioned algorithms as the input to the decision tree. Finally, they compared traditional techniques to the proposed model and obtained the accuracy of 84%, while prior accuracies were 82% and 67%. Sharma and Kumari [8] applied SVM to find the polarity of four smartphone product review texts, whether positive or negative. Before applying SVM, they used part of speech (POS) tagging with tokens, then used clustering for TF-IDF features to find more appropriate centroids. The accuracy of the model was evaluated based on (Precision, Recall, F-score, and Accuracy) metrics, compared to previous studies on the same datasets where no POS and no clustering were performed. They obtained the accuracy of 90.99% while the best previous study accuracy was 88.5%. Rajput and Dubey [9] made a comparative study between two supervised classification algorithms, namely, NB and SVM for making binary classification of customers review about six Indian stock market. The results show that SVM provides better accuracy, which was 81.647%, while NB accuracy was 78.469%. Rane and Kumar [10] worked on a six major US Airline datasets for performing a multi-class (Positive, Negative, and Neutral) SA. Doc-2Vec deep learning approach has been used for representing these tweets as vectors to do a phrase-level analysis – along with seven (7) supervised machine learning algorithms (Decision Tree, Random Forest, SVM, K-Nearest Neighbors, Logistic Regression, Gaussian NB and AdaBoost). Each classifier was trained with 80% of the data and tested using the remaining 20% data. Accuracy of all classifiers was calculated based on (Precision, Recall, F1-Score) metrics. They concluded that the classification techniques used include ensemble approaches, such as AdaBoost, which combines several other classifiers to form one strong classifier which performs much better. The maximum achieved accuracy was 84.5%. Shuai et al. [11], these authors carry out a binary SA on Chinese hotel reviews by using Doc2vec feature extraction technique and SVM, logistic regression and NB as a classifier. After making a performance comparison between classification algorithms based on the precision, recall rate, and F-measure metrics, SVM achieved the best accuracy in their experiment as follows: 79.5%, 87.92%, and 81.16% for all three metrics. Bindal and Chatterjee [3] described two-step method (lexicon- based sentiment scoring in conjunction with SVM, point-wise mutual information utilized to calculate sentiment of tweets. They also discussed the efficacy of several linguistic features, such as POS tags and higher-order n-grams (Uni + Bi Gram, Uni + Bi + Tri Gram) in sentiment mining. Their proposed scheme had better “F-Score” average than commonly used one-step methods such as Lexicon, NB, Maximum Entropy, and SVM classifier, i.e., for Unigram range lexicon-SVM outperforms other classification methods with F-score of 84.39% while other methods F-score is 82.44%, 81.85%, 80.18%, and 83.56%, respectively. Mukwazvure and Supreethi [12] used a hybrid technique which involves lexicon-based approach for detecting “news comments” polarity in (Technology, Politics, and Business) domains. Then, the outcome of lexicon-based is then fed to train two supervised machine learning algorithms: SVM and K-nearest neighbor (kNN) classifiers. Investigational results revealed that SVM performed better than kNN which were 73.6, 61.38, and 58.00 while kNN results were 74.24%, 56.27%, and 55.58%. Flores et al. [13] made a comparative analysis of SVM algorithm-sequential minimal optimization with synthetic minority over-sampling technique (SMOTE) and Naive Bayes multinomial (NBM) algorithm with SMOTE for classification of two SA datasets gathered by students of University of San Carlos. The outcomes have shown that with 10-folds cross-validation SA for their datasets could perform better compared to 70:30 split. Performance of NBM with SMOTE was 72.33% and 78.02% and SVM with SMOTE were 83.16% and 82.22% in the term of accuracy. 3. MATERIALS AND METHODS 3.1. VADER VADER stands for Valence Aware Dictionary and sEntiment Reasoner. It is a lexicon and rule-based SA tool that was developed by Hutto and Gilbert [14] in 2014. It is specifically attuned to do calculate the sentiment scores of texts expressed on social media. VADER uses a combination of a sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the positivity and negativity score but also tells us about how much positive or negative a sentiment is? VADER produces four sentiment metrics from these word ratings, the first three, positive, neutral, and negative, represents the Sozan Abdulla Mahmood and Qani Qabil Qasim: Twitter Sentiment Analysis UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 21 proportion of the text that falls into those categories and the final metric is compound score which is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). According to their experiment, it is more effective than other existing lexicon-based approaches, for example, SentiWordNet. 3.2. Word Embedding and DOC2VEC Word embedding, also known as (Word2Vec), is a technique for unique vector representation of each word with its semantic meaning of the word taken into consideration. Unlike bag of words, which is one of the most common techniques used for numerical representation of words that convert word to a fixed-length feature vector, it has some shortcomings. First, it does not consider the ordering of the words, ignores semantics of the words. For example, “powerful,” “strong,” and “Paris” are equally evaluated and generate a high dimensional feature set, so, it needs a lot of memory space [15]. In Word2Vec approach, each word is mapped to a vector in a predefined vector space. These vectors are learned using neural networks. The learning process can be done with a neural network model or by using an unsupervised process involving document statistics. Word2Vec can be implemented in two different architectures, first is continuous bag of word (CBoW), as shown in Fig. 1 which is designed to predict current words at an input of future words and history words and the second is skip- gram (SG) which is used to maximize the probability of surrounding words given the current word being used in word embedding [15], [16]. Doc2Vec, also called paragraph vector (PV), is a (Word2Vec) based learning approach that converts entire paragraph to a unique vector which is represented by a column in matrix D and every word is mapped to unique vector mapped in matrix W. The word and PVs are then concatenated to predict the next word. CBoW and SG methods have been tuned for Doc2Vec and converted into two methods, namely, distributed bag of words version of PVs (PV-DBOW) and distributed memory of PVs (PV-DM) [10], as shown in Figs. 2 and 3. The DBOW model ignores the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output [15]. In DM model, to predict the next word in a context, the paragraph and word vectors either being averaged (mean) which is called DM mean (DMM), or concatenated which is called DM concatenation (DMC) [15]. 3.3. PSO Algorithm PSO is a type of meta-heuristic algorithm developed by Dr. Kennedy and Dr. Eberhart in 1995 to optimize numeric problems iteratively. PSO simulates the behaviors of the animals’ groups searching for food, especially bird flocking or fish schooling. PSO starts through a randomly distributed group of agents called particles in a search space; every particle has self-own velocity [17]. Each particle has two “best” achieved positions; the first one is its best position or (local best position) referred to as “pbest.” And the second one is (global best position) referred to as “gbest.” At each time the particles will move toward “pbest” and “gbest” based on a new “velocity” and some constant Fig. 1. Continuous bag of word and skip-gram. Sozan Abdulla Mahmood and Qani Qabil Qasim: Twitter Sentiment Analysis 22 UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 coefficient parameters such as c 1 , c 2 , and w (inertia weight) and two random numbers. In D-dimensional space, PSO algorithm can be described as follows: X i = (X i1 , X i2 , X i3 ,…, X iD ) represents the current position of the “particle,” V i = (V i1 , V i2 , V i3 …. V iD ) and it refers to its velocity, the local best location is denoted as Pbest,i = (P i1 , P i2 , P i3 …. P iD ), and global best position of all particles refers to Pgbest,i = (P g1 , P g2 , P g3 …. P gD ). At every iteration, each particle changes its position according to the new velocity. v w*�v c r pBest x c r gBest xi t i t i t i t i t i t+ = + + −( )+ + −( )1 1 1 2 2 (1) In this study, instead of multiplying w to only current velocity, after changing the current particle best position and group best position, we multiplied them all to “w.” The formulated equation is: v w* v �c r pBest �x c r gBest �x i t i t i t i t i t i t + = + + −( ) + + −( )       1 1 1 2 2   (2) x �x vi t i t i t+ += +1 1 (3) Where “i” refers to a particle, pBest, and gBest as the best particle position, best group position, and the parameters w, c 1 and c 2 are called inertia weighs. r 1 and r 2 are two random numbers in the range of (0, 1), vi t is a current velocity, vi t+1 indicates new velocity in the next time or iteration. Furthermore, xi t is current particle position, xi t+1 indicates the new particle position. The pseudocode of PSO is: Initialized number of particles (n_particle), D, n_iterations, c1, c2, and w. For each particle i ∈ (n_particle) Initialize X i , V i End for For each particle i in n_particle do If ƒ(X i ) <ƒ(Pi) Pbest i = X i End if If f (Pbest i ) < f Gbest i Gbest = Pbest i End if End for For each particle i in n_particle do For each dimension d in D Update velocity according to equation (1) for PSO and equation (2) for MPSO Update position according to equation (3) End for End for Iteration = Iteration +1 Until iteration > n_iterations. 3.4. GWO Algorithm GWO algorithm is another type of swarm intelligence algorithm, proposed by Mirjalili et al. in 2014 [18], that mimics the leadership hierarchy and hunting mechanism of Fig. 2. Distributed bag of word of paragraph vector. Fig. 3. Distributed memory of paragraph vector. Sozan Abdulla Mahmood and Qani Qabil Qasim: Twitter Sentiment Analysis UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 23 grey wolves in nature. Four types of grey wolves, such as alpha, beta, delta, and omega, are employed for simulating the leadership hierarchy. Furthermore, the three main steps of hunting, searching for prey, encircling prey, and attacking prey, are implemented. 3.4.1. Social hierarchy The social hierarchy in this algorithm consists of four groups of wolves, namely, alpha (α), beat (β), and delta (δ), and the other is called omega (ω). In the GWO algorithm, the hunting (optimization) process is guided by α, β, and δ. The ω wolves follow these three wolves [18]. 3.4.2. Encircling prey Encircling prey means that grey wolves surround prey during the hunt, the following mathematical equations form the encircling behavior [18]: D C X X → → → → = ( )− ( ). p t � t (4) X X X D → → → → + = ( )−( ) .t p t ��1 (5) Where t indicates the current iteration, A → and C → are coefficient vectors, XP → is the position vector of the prey, and X → indicates the position vector of a grey wolf. The vectors A and C are calculated as: A a r a → → → → = −2 1� �� �. (6) C r → → = 2 2� (7) Where a → linearly decreased from 2 to 0 throughout iterations and r 1 , r 2 are two random vectors in the range of 0, 1. 3.4.3. Hunting Grey wolves can identify the location of prey and encircle them. The hunt is usually guided by the alpha. Sometimes beta and delta might also get involved in hunting, alpha (best candidate solution), beta, and delta have better knowledge about the potential location of prey. Thus, the first three best solutions are selected to update their positions according to the position of the best search agents based on the following mathematical formulas [18]: → → α α → → = −1 . D X  C X D C X X → → → → = −β β2 .� � (8) D C X X → → → → = −δ δ3 .� � X X A D → → → → = −1 1α α� � ���.( ) X X A D → → → → = −2 2β β� � ���.( ) (9) X X A D → → → → = −3 3δ δ� � ���.( ) X X X X→ → → → + = + + ( )t � � 1 3 1 2 3 (10) The pseudocode of GWO as follows: Initialize the grey wolf population Xi (i = 1, 2..., n) Initialize a, A, and C C a l c u l a te t h e f i tn es s o f ea ch s e a r ch a g en t u s i n g equations (8) and (9) X α = The first best search agent X β = The second-best search agent X δ = The third best search agent While (t < Max number of iterations) For each search agent Update the position of the current search agent according to equation (9) End for a=2−t*(2⁄(Max_iteration)) Calculate A, C using equations (6) and (7) Calculate the fitness of each search agent using equations (8) and (9) Update position of the current search agents according to equation (10) t=t + 1 End while Return X α . 3.5. Hybrid PSO-GWO In hybrid PSO-GWO, the first three agents’ position is updated in the search space by a mathematical equation 8. Instead of using common mathematical formulas, the exploration and exploitation of the grey wolf in the search space have been controlled by inertia constant [19]. The modified set of dominant equations is as follows: D C X X → → → → = −α α1.� w*� Sozan Abdulla Mahmood and Qani Qabil Qasim: Twitter Sentiment Analysis 24 UHD Journal of Science and Technology | Jan 2020 | Vol 4 | Issue 1 D C X X → → → → = −β β2 .� w*� (11) D C X X → → → → = −δ δ3 .� w*� Where c 1, c 2 , c 3 , and w are constants, To combine PSO and GWO variants, the velocity and updated equation are calculated as follows: v w* v c r x �x c r x �x � �c r x �x i t i t i t � i t i t + = + −( ) + −( ) + −( )   1 1 1 1 2 2 2 3 3 3       (12) x �x �vi t i t i t+ += +1 1 (13) The pseudocode of hybrid PSO-GWO as follows: Initialize c 1, c 2 , c 3, t = 0, w = 0.5 + r⁄2, velocity=random (search Agents No. dim), Postion=dot (random (search Agents No, dim), (ub−lb)) + lb While (t