J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 Journal of the Nigerian Society of Physical Sciences Performance Study of N-grams in the Analysis of Sentiments O. E. Ojo∗, A. Gelbukh, H. Calvo, O. O. Adebanji aInstituto Politécnico Nacional, Natural Language and Text Processing Laboratory, Centro de Investigacion en Computación, CDMX, Mexico Abstract In this work, a study investigation was carried out using n-grams to classify sentiments with different machine learning and deep learning methods. We used this approach, which combines existing techniques, with the problem of predicting sequence tags to understand the advantages and problems confronted with using unigrams, bigrams and trigrams to analyse economic texts. Our study aims to fill the gap by evaluating the performance of these n-grams features on different texts in the economic domain using nine sentiment analysis techniques and found more insights. We show that by comparing the performance of these features on different datasets and using multiple learning techniques, we extracted useful intelligence. The evaluation involves assessing the precision, recall, f1-score and accuracy of the function output of the several machine learning algorithms proposed. The methods were tested using Amazon, IMDB, Reuters, and Yelp economic review datasets and our comprehensive experiment shows the effectiveness of n-grams in the analysis of sentiments. DOI:10.46481/jnsps.2021.201 Keywords: n-grams, machine learning, deep learning, sentiment analysis Article History : Received: 3 July 2021 Received in revised form: 13 September 2021 Accepted for publication: 14 September 2021 Published: 29 November 2021 c©2021 Journal of the Nigerian Society of Physical Sciences. All rights reserved. Communicated by: J. Ndam 1. Introduction Machine learning and deep learning architectures are the bane of many Natural Language Processing (NLP) research works. To address a variety of tasks, including sentiment analysis, sev- eral machine learning and deep learning architectures have been proposed. Uncommon and unfamiliar words used in informa- tion and knowledge exchange can influence different aspects of life including marketing, education, governance, etc. As an in- tegral part of the internet, the digital media platforms facilitates meaningful information and knowledge exchange with a list of other network users. Data collection and reviews, with diverse ∗Corresponding author tel. no: +525560590794 Email addresses: olumideoea@gmail.com (O. E. Ojo ), gelbukh@gelbukh.com (A. Gelbukh), hcalvo@cic.ipn.mx (H. Calvo), olaronke.oluwayemisi@gmail.com (O. O. Adebanji) views and opinions about events, is gaining more impact and fast becoming an attraction for researchers and generating sig- nificant computational challenges. Effective wide-ranging min- ing of information from text helps to discover useful knowledge of vital significance. Computers can detect, interpret and pro- duce the sentiments (or tags) of a text, thereby improving gov- ernment and private companies’ operations, recognizing possi- ble threats, minimizing crime, and improving public services. The main objective of this research is to observe the efficacy of unigrams, bigrams and trigrams as characteristics of a word sequence and to predict a tag for sentiment classification. The target is to extract words or phrases behind the tags and to use the machine learning and deep learning methods to classify the data whilst measuring the accuracy of the classification. Using n-grams to study the opinion of people, we will be able to see their strength by tagging them. As they contribute to conceptual 477 O. E. Ojo et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 478 characterization, we use machine learning and deep learning models on the text. In machine learning, two major techniques are adopted: su- pervised learning [1, 2, 3, 4] and unsupervised methods [5]. Supervised methods have a training dataset with manually de- fined tags, and they learn the characteristics that match the tags from the training data. Gelbukh and Kolesnikova [2, 6] de- veloped methods that allowed the automatic sorting of word combinations into pre-established categories relating to auto- mated collocation classification. Gambino and Calvo [7] con- sidered the use of text-learned opinions using NLP techniques to distinguish tags. On the other hand, unsupervised systems are more flexible across different kinds of texts and domains. Different machine learning algorithms have been used in past works [6, 2, 4], and also neural network models [1] to gain the knowledge of how to predict the sentiments of text [3, 8]. Pre-trained models are very helpful in classifying text and other NLP activities. To extract the keywords behind the text’s feelings, the mod- els will be pre-trained on the training data and the accuracy of prediction of the models will be measured, registered and com- pared. The remaining part of the paper is organized as follows: Section 2 deals with the background and relevant works, Sec- tion 3 describes the approach used in this study, while Section 4 shows the features we experimented. Sections 5 and 6 provides the information about the machine learning and deep learning algorithms used and our experimental findings with the discus- sion of results in section 7. Section 8 gives conclusion about the work. 2. Background and Related Work Different works have been carried out in the field of senti- ment analysis [1, 7, 3, 4]. The social media and other digital media platforms, as an accepted means of communication, has flourished thereby aiding intelligence gathering and informa- tion dissemination. Machine learning techniques have shown good results in analysing sentiments in text [6, 2, 4] and other tasks such as part of speech recognition (PoS) [9], named en- tity recognition (NER) [10], etc. Linear statistical models, such as random-field (CRF) and Hidden Markov (HMM) fields, are NLP approaches used for sequence tagging with a long history of excellent performance. However, adapting these models to new tasks in new domains or languages is challenging. The combination of categorical grammar, annotation, ac- quisition of lexicons and semantic networks was used by Pekka et al. [11] to analyze the feelings of the text and to define the tags of the text. They investigated how the overall phrase struc- tured data and domain-specific language usage could aid in the detection of semantic orientations in financial and economic news. In [12], the use of syntactic n-grams (Sn-grams) to incor- porate syntactic knowledge into machine learning algorithms proved successful. Sn-grams were utilized as a baseline for authorship identification, replacing standard n-grams of words, POS tags, and characters. GBC KNN LRM NBA DTC XGB RFC MLP SVM models 70 75 80 85 % Ac cu ra cy unigram bigram trigram Figure 1. Distribution of n-grams accuracy scores in the Models for the first dataset DTC GBC KNN RFC XGB NBA MLP LRM SVM models 50 55 60 65 70 75 80 85 % Ac cu ra cy unigram bigram trigram Figure 2. Distribution of n-grams accuracy scores in the Models for the second dataset [13] have used text-CNN for extraction of text features with LSTM architecture in addition to the unimodal input functions. In a multimodal sentiment analysis task, they explored and an- alyzed the performance of three deep-learning-based architec- tures and recorded their results. On a social media data baseline, [14] explored the efficiency of deep neural network models of different complexity based on character n-grams. The training was done with augmented data and pseudo-labeled samples, and the accuracy result was enhanced. [15] also used classifiers to predict sentence tags using an objective function to infer similarity between sentences. A new objective function was used to train many classifiers to make predictions at the instance level, promoting smoothness of in- ferred instance-level labels while keeping group-level label con- straints in place. 478 O. E. Ojo et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 479 Table 1. Precision, Recall, f1 score and accuracy of the classifiers trained on the first dataset. Model n-grams Precision Recall F1 score Accuracy Logistic Regression Unigram 0.85 0.69 0.72 80.5% Bigram 0.81 0.59 0.58 74.4% Trigram 0.86 0.57 0.54 73.6% Support Vector Machine Unigram 0.88 0.81 0.83 87.1% Bigram 0.79 0.70 0.72 79.4% Trigram 0.78 0.64 0.65 76.4% Naive Bayes Unigram 0.84 0.72 0.75 81.7% Bigram 0.85 0.54 0.48 71.6% Trigram 0.85 0.51 0.43 69.8% Gradient Boosting Unigram 0.35 0.50 0.41 69.3% Bigram 0.35 0.50 0.41 69.3% Trigram 0.35 0.50 0.41 69.3% Decision Tree Unigram 0.79 0.77 0.78 81.7% Bigram 0.66 0.64 0.65 71.6% Trigram 0.67 0.61 0.61 72.1% Random Forest Unigram 0.90 0.77 0.80 85.5% Bigram 0.79 0.64 0.66 76.9% Trigram 0.82 0.60 0.59 74.9% K-Nearest Neighbors Unigram 0.75 0.72 0.73 78.4% Bigram 0.70 0.65 0.66 74.6% Trigram 0.62 0.63 0.62 66.2% XGBoost Unigram 0.85 0.75 0.77 83.3% Bigram 0.80 0.61 0.62 75.6% Trigram 0.75 0.58 0.56 73.1% Multi-Layer Perceptron Unigram 0.84 0.81 0.82 85.5% Bigram 0.78 0.74 0.76 80.7% Trigram 0.76 0.62 0.62 75.1% DTC GBC XGB KNN RFC LRM MLP SVM NBA models 50 55 60 65 70 75 80 % Ac cu ra cy unigram bigram trigram Figure 3. Distribution of n-grams accuracy scores in the Models for the third dataset 3. Approach To recognize patterns and regularities in data, the machine learning and deep learning algorithms use learned patterns to predict new observations. We pre-processed the text data be- fore we applied the different learning algorithms on the text GBC DTC RFC XGB NBA KNN LRM SVM MLP models 50 55 60 65 70 75 80 % Ac cu ra cy unigram bigram trigram Figure 4. Distribution of n-grams accuracy scores in the Models for the fourth dataset data. We tokenized the data into words and n-grams and gen- erated a vocabulary of all the special n-grams that occurred in the document. Using the term frequency-inverse document fre- quency (tf-idf) technique, data features were rescaled. We used supervised learning and we compared the results. For our work, we chose to filter out uncommon, non-informative 479 O. E. Ojo et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 480 Table 2. Precision, Recall, f1 score and accuracy of the classifiers trained on the second dataset. Model n-grams Precision Recall F1 score Accuracy Logistic Regression Unigram 0.86 0.85 0.85 85.5% Bigram 0.76 0.75 0.75 75.5% Trigram 0.63 0.57 0.51 57.0% Support Vector Machine Unigram 0.87 0.86 0.86 86.5% Bigram 0.75 0.75 0.75 75.0% Trigram 0.63 0.57 0.51 57.0% Naive Bayes Unigram 0.86 0.84 0.84 84.5% Bigram 0.73 0.71 0.70 70.5% Trigram 0.65 0.57 0.51 57.0% Gradient Boosting Unigram 0.79 0.79 0.78 78.5% Bigram 0.73 0.71 0.70 71.0% Trigram 0.72 0.58 0.50 58.0% Decision Tree Unigram 0.78 0.78 0.78 78.0% Bigram 0.68 0.67 0.66 66.5% Trigram 0.73 0.56 0.46 56.0% Random Forest Unigram 0.82 0.82 0.82 82.0% Bigram 0.75 0.73 0.72 73.0% Trigram 0.69 0.55 0.45 55.0% K-Nearest Neighbors Unigram 0.80 0.79 0.78 78.5% Bigram 0.76 0.55 0.43 54.5% Trigram 0.53 0.51 0.41 51.0% XGBoost Unigram 0.82 0.82 0.82 82.0% Bigram 0.71 0.64 0.61 64.0% Trigram 0.75 0.51 0.36 51.0% Multi-Layer Perceptron Unigram 0.85 0.84 0.84 84.5% Bigram 0.75 0.74 0.74 74.5% Trigram 0.64 0.57 0.52 57.5% content by extracting n-grams to make the algorithms more intelligent. We identified some common techniques used in recent studies [6, 2, 4, 16], namely Decision Tree Classifier (DTC), Gradient Boosting Classifier (GBC), Naive Bayes Al- gorithm (NBA) and Random Forest Classifier (RFC). Others are K-Nearest Neighbors (KNN), eXtreme Gradient Boosting (XGB), Support Vector Machines (SVM), Logistic Regression Model (LRM) and the Multi-Layer Perceptron (MLP) Classi- fier. For sentiment analysis, these models have been extensively tested, and provided accurate results when working with vari- ous dataset types. The words were patterned for parsing, such that every n-gram consists of n terms and are tagged accord- ingly. The accuracy of these methods often differ widely in validation, ranging from using small samples to a wide array of tagged data. 4. Experiments The data used for this analysis consist of a collection of four related economic and financial market reviews selected from multiple texts that have been tagged with positive, negative and neutral classes. These four datasets, extracted from different digital media platforms, have been selected because they con- tain explicit economic sentiments from which the machine and deep learning algorithms can learn. We have used the Reuters dataset in Pekka et al.[11], containing subjective sentences from economic review, and the IMDb, Amazon, and Yelp datasets in Kotzias et al.[15], which contains text sentences from reviews of products, movies, and restaurants. The first dataset contains reviews and tags for products sold on amazon.com while the second dataset contains the sentiment dataset for IMDb movie reviews. The third and fourth datasets have a collection of texts about economic and restaurant reviews respectively. The text were splitted into training and testing data. Using the train- ing set, the machine and deep learning algorithms were trained to understand, extract and evaluate subjective information from the data with n-grams as features. Basically, after fitting the training data to the models, we used the various models to predict the tags of the test data. Us- ing the training set to train the algorithm, we translated the data into numeric form, while the test set was used to evaluate the performance of the machine and deep learning models. The machine and deep learning algorithms learnt from the training data, passing the features and tags as parameters. The models predicted the outcomes, while the precision, accuracy and f1 score were obtained using the n-gram features within the model. To keep a list of the word vectors, we transformed the text array into a TF-IDF function matrix and a vocabulary was created. A 480 O. E. Ojo et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 481 Table 3. Precision, Recall, f1 score and accuracy of the classifiers trained on the third dataset. Model n-grams Precision Recall F1 score Accuracy Logistic Regression Unigram 0.80 0.80 0.80 80.0% Bigram 0.66 0.66 0.66 66.0% Trigram 0.55 0.53 0.48 53.0% Support Vector Machine Unigram 0.82 0.82 0.82 82.0% Bigram 0.67 0.67 0.66 66.5% Trigram 0.54 0.53 0.48 52.5% Naive Bayes Unigram 0.84 0.82 0.82 82.5% Bigram 0.70 0.62 0.59 62.5% Trigram 0.60 0.53 0.43 53.0% Gradient Boosting Unigram 0.64 0.64 0.63 63.5% Bigram 0.56 0.55 0.54 55.0% Trigram 0.54 0.52 0.44 52.0% Decision Tree Unigram 0.63 0.63 0.63 63.0% Bigram 0.60 0.59 0.58 59.0% Trigram 0.57 0.52 0.42 52.0% Random Forest Unigram 0.75 0.74 0.74 74.5% Bigram 0.61 0.59 0.59 59.5% Trigram 0.56 0.54 0.48 53.5% K-Nearest Neighbors Unigram 0.73 0.73 0.73 73.0% Bigram 0.69 0.69 0.68 68.5% Trigram 0.51 0.51 0.46 50.5% XGBoost Unigram 0.72 0.71 0.71 71.5% Bigram 0.51 0.51 0.47 51.0% Trigram 0.48 0.49 0.38 49.5% Multi-Layer Perceptron Unigram 0.81 0.81 0.81 81.0% Bigram 0.66 0.66 0.65 65.5% Trigram 0.52 0.51 0.39 50.5% machine and/or deep learning algorithm can then directly be used on the encoded vectors. The classification and evaluation of the different meanings of the text was carried out and we compared them to each other. The n-grams offered an indication of the words that could affect the tags of the text. We extracted the n-gram distribution such as unigrams, bigrams, and trigrams for use in the differ- ent models, thereby making the learning algorithms more intel- ligent for proper prediction. We applied the machine and deep learning algorithm on the text for classification and the accuracy score for all models used in the experiment were calculated. 5. Results and Discussion In this study, we present a performance review of special n-gram based evaluation of a sequence labeling task using dif- ferent learning algorithm. We introduced machine learning and deep learning techniques to analyze the sentiments in the data for better and faster decision making, and we were able to com- pare the output of the techniques implemented, thus adding to the state-of-the-art literature on tasks of sentiment analysis. These algorithms were applied on the datasets to predict the tags and to classify it accordingly using the n-grams features. The performance of the n-grams in the different machine and deep learning approach was calculated using the overall accuracy measurement. For a comparative performance evalu- ation of each system in terms of predicting the tags correctly, we present the results for the nine methods used for precision, recall, accuracy and F1-score calculation. Tables 1-4 shows the model classification values of the models and Figures 1–4 de- picts the distribution of the values on a line graph. The macro-averaged f1, recall, precision and accuracy scores of the various models used are shown in Tables 1-4. The find- ings indicate that the SVM and the MLP models generally im- proved the effectiveness of the classification. The results also reveals that the DTC, GBC, KNN and the XGB failed to per- form well in the classification task. In the comparative analysis, using the different methods of machine learning and n-gram ap- proaches on the datasets, results were better compared and the effectiveness of the n-gram features were recorded. The n-gram features gave a very good performance for all learning algorithms with the unigrams performing better than the bigrams and trigrams in the classification task. The SVM, LRM, RFC, NBA and the MLP models are the most reliable for all of the n-gram features. In the first dataset (see Figure 1), RFC, MLP, SVM had maximum scores among the models. The SVM, LRM, and MLP models gave the highest output for all n-gram functions on the second dataset (see Figure 2). On the third dataset, NBA, SVM and MLP are with the highest results 481 O. E. Ojo et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 482 Table 4. Precision, Recall, f1 score and accuracy of the classifiers trained on the fourth dataset. Model n-grams Precision Recall F1 score Accuracy Logistic Regression Unigram 0.80 0.80 0.79 79.5% Bigram 0.68 0.68 0.67 67.5% Trigram 0.57 0.53 0.45 53.0% Support Vector Machine Unigram 0.81 0.81 0.80 80.5% Bigram 0.69 0.69 0.69 69.0% Trigram 0.59 0.54 0.46 54.0% Naive Bayes Unigram 0.75 0.75 0.75 75.0% Bigram 0.70 0.64 0.61 64.0% Trigram 0.63 0.51 0.36 51.0% Gradient Boosting Unigram 0.64 0.64 0.63 63.5% Bigram 0.60 0.59 0.59 59.5% Trigram 0.50 0.50 0.37 50.0% Decision Tree Unigram 0.66 0.66 0.66 66.0% Bigram 0.59 0.58 0.57 58.0% Trigram 0.57 0.51 0.38 51.0% Random Forest Unigram 0.72 0.72 0.72 72.0% Bigram 0.64 0.63 0.63 63.0% Trigram 0.55 0.52 0.41 51.5% K-Nearest Neighbors Unigram 0.77 0.76 0.76 76.0% Bigram 0.69 0.67 0.66 66.5% Trigram 0.45 0.48 0.38 48.5% XGBoost Unigram 0.73 0.72 0.72 72.5% Bigram 0.59 0.56 0.53 56.5% Trigram 0.55 0.52 0.41 51.5% Multi-Layer Perceptron Unigram 0.81 0.81 0.81 81.0% Bigram 0.68 0.68 0.68 68.0% Trigram 0.59 0.54 0.45 53.5% (see Figure 4) while on the fourth dataset, SVM, LRM and MLP performed best on the test dataset (see Figure 4). The models used with the feature classification techniques shows the effec- tiveness of n-grams for sentiments tagging and the most reliable methods of classification. 6. Conclusion An important problem in the analysis of sentiments is be- ing able to determine the contextual labels or tags of words and phrases. We addressed this problem in this study by success- fully introducing various machine learning and deep learning approaches to produce the labels or tags of economics and fi- nancial reviews text using n-grams as features. Modeling was performed using different pre-processing techniques in texts, converting the text into vectors, and applying various machine learning and deep learning techniques on the different datasets. The use of multiple classifiers in this analysis led to a better evaluation efficiency than any individual classifier. The findings recorded in this study suggests that the support vector machine and multi-layer perceptron neural networks were the best op- tions for achieving successful results, because they efficiently and effectively classify the sentiment tags behind the sentence in the text. The unigram model, which is an n-gram analysis representation at low level, has a greater predictive potential compared to the bigram and trigram models. While high-level n-gram representations account for the complexities of the hu- man language, their use in predicting consumers’ choices is less efficient than low-level n-gram representations in these eco- nomic reviews. Acknowledgments The work was done with partial support from the Mexi- can Government through the grant A1-S-47854 of the CONA- CYT, Mexico and grants 20211784, 20211884, and 20211178 of the Secretarı́a de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONA- CYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologı́as del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico. The authors also wish to thank Malo Pekka and his colleagues for sharing the Financial Phrasebank Dataset [11] and the Uni- versity of California, Irvine machine learning repository of databases for the datasets. 482 O. E. Ojo et al. / J. Nig. Soc. Phys. Sci. 3 (2021) 477–483 483 References [1] H. Gómez-Adorno, I. Markov, G. Sidorov, J. Posadas-Durán, M. A. Sanchez-Perez, & L. Chanona-Hernandez, “Improving feature repre- sentation based on a neural network for author profiling in social me- dia texts”, Computational Intelligence and Neuroscience 2016 (2016) 1638936. [2] O. Kolesnikova & A. Gelbukh, “A study of lexical function detection with word2vec and supervised machine learning”, Journal of Intelligent & Fuzzy Systems 39 (2020) 1993. [3] S. Poria, E. Cambria, & A. Gelbukh, “Aspect extraction for opinion min- ing with a deep convolutional neural network”, Knowledge-Based Sys- tem 108 (2016) 42. [4] O. E. Ojo, A. Gelbukh, H. Calvo, O. O. Adebanji, & G. Sidorov, “Senti- ment detection in economics texts”, Advances in Computational Intelli- gence @ MICAI 2020 12469 (2020) 271. [5] T. Lugo-Garcia,A. Gelbukh, & G. Sidorov, “Unsupervised learning of word combinations for syntactic disambiguation”, Avances en la Ciencia de la Computación. Proceedings of the Workshop on Human Language Technologies at the 5th Mexican International Conference on Computer Science, ENC-2004 (2004) 311. [6] A. Gelbukh & O. Kolesnikova, “Supervised machine learning for predict- ing the meaning of verb-noun combinations in Spanish” MICAI 2010. Lecture Notes in Artificial Intelligence 6438 (2010) 196 [7] O. Juárez Gambino & H. Calvo, “Predicting emotional reactions to news articles in social networks”, Computer Speech & Language 58 (2019) 280. [8] H. Gómez-Adorno, R. Fuentes-Alba, I. Markov, G. Sidorov & A. Gel- bukh, “A convolutional neural network approach for gender and language variety identification”, Journal of Intelligent & Fuzzy Systems 36 (2019) 4845. [9] P. Pakray, A. Pal, G. Majumder, & A. Gelbukh, “Resource building and parts-of-speech (pos) tagging for the mizo language”, 14th Mexican In- ternational Conference on Artificial Intelligence, MICAI 2015 (2015) 3. [10] S. N. Galicia-Haro, A. Gelbukh, & I. A. Bolshakov, “Identification of composite named entities in a spanish textual database”, 9th Interna- tional Conference on Applications of Natural Languages to Information Systems, Salford, UK 3136 (2004) 395. [11] M. Pekka, A. Sinha, P. Korhonen, J. Wallenius, & P. Takala, “Good debt or bad debt: Detecting semantic orientations in economic texts”, Journal of the Association for Information Science and Technology 65 (2014) 782. [12] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh & L. Chanona- Hernández, “Syntactic n-grams as machine learning features for natural language processing”, Expert Systems with Applications 41 (2014) 853. [13] S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh & A. Hus- sain, “Multimodal sentiment analysis: Addressing key issues and setting up the baselines”, IEEE Intelligent Systems 33 (2018) 17. [14] S. T. Aroyehun & A. Gelbukh, “Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling”, Proceedings of the First Workshop on Trolling, Aggression and Cyber- bullying (TRAC-1) Santa Fe, USA (2018) 90. [15] D. Kotzias, M. Denil, N. de Freitas & P. Smyth, “From group to individual labels using deep features”, KDD 2015 (2015) 597. [16] V. Athanasiou & M. Maragoudakis, “A novel, gradient boosting frame- work for sentiment analysis in languages where NLP resources are not plentiful: A case study for modern Greek”, Algorithms 10 (2017) 34. 483