Comparative analysis of Mechanisms for Categorization and Moderation of User Generated Text Contents on a Social E-Governance Forum 78 Mathematical and Software Engineering, Vol. 3, No. 1 (2017), 78-86. Varεpsilon Ltd, http://varepsilon.com Comparative analysis of Mechanisms for Categorization and Moderation of User Generated Text Contents on a Social E-Governance Forum Imeobong Frank Inyang, Simeon Ozuomba*, and Chinedu Pascal Ezenkwu *Corresponding Author: simeonoz@yahoo.com Electrical/Electronic and Computer Engineering Dept., University of Uyo, Uyo, Nigeria Abstract This paper presents a comparative analysis of two mechanisms for an automated categorization and moderation of User Generated Text Contents (UGTCs) on a social e-governance forum. Posts on the forum are categorized into “relevant”, “irrelevant but interesting” and “must be removed”. Relevant posts are those posts that are capable of supporting government decisions; irrelevant but interesting category consists of posts that are not relevant but can entertain or enlighten other users; must be removed posts consists of abusive or obscene posts. Two classifiers, Support Vector Machine (SVM) with One-Vs-The-Rest technique and Multinomial Naive Bayes were trained, evaluated and compared using Scikit-learn. The results show that SVM with an accuracy score of 96% on test set performs better than Naive Bayes with 88.6% accuracy score on the same test set. Keywords: Moderation; Ranking; UGC; UGTC; web 2.0; Sentiment analysis; Social e-governance. 1. Introduction Growing computerization and increasing Internet connectivity have encouraged the use of Information and Communication Technology (ICT) in the coordination and facilitation of several businesses. Moreover, the application of social network technologies for the purpose of improving governance has generated interest recently due to the emergence of web 2.0. In this paper, this has been referred to as social e-governance. The essence of social e-governance forums is that they encourage candid opinions from the citizens thereby promoting people-oriented decisions by the government. In view of this, there is a need for mechanisms that can be used to categorize and moderate users’ posts to ensure that only relevant or interesting posts are allowed on the platform. Moderation is the process of reviewing a UGC and taking decision on whether to delete it or allow it to be accessed by other users. Moderation can be an automated moderation, using computer applications and algorithms; community moderation, which leverages the online community to self-moderate contents and human moderation, in which there is a dedicated staff acting as a moderator. Three moderation approaches include – pre-moderation, reactive moderation 79 and post-moderation. Unlike in post-moderation where posts are allowed to appear online before moderation, in pre-moderation, all posts are moderated before they appear online. This moderation approach requires more prompt response; as such the best method for pre-moderation is the automated moderation. Human moderation cannot provide 24 hours 7 weeks moderation because posts submitted overnight or in the weekend may not be moderated until the next working days. Moreover, community moderation warrants other users to access posts and react accordingly. This, in other words, is a reactive moderation. Reactive moderation is a variant of post-moderation whereby the online community, instead of a dedicated individual, carryout the function of moderation. The danger with post-moderation is that the post might already have a negative impact on the online community before it is deleted; as such, post-moderation is not encouraged where the risk associated with publishing inappropriate contents is high. Furthermore, community moderation is prone to Sybil or one-man-crowd attack, whereby a user creates multiple accounts or sockpuppets in order to influence votes on posts in an online community. In view of this, automated moderation is indispensable, since it does not give room to Sybil attack, being human independent. There are several sentiment analytic techniques employed to automate the process of UGC moderation on online communities. Some popular machine learning classifiers used in sentiment analysis include Naive Bayes classifier, SVM, decision tree, random forest and so on. In this paper, mechanism for automated moderation of an e-governance forum is presented. The paper considered the performances of two classifiers, which are SVM and Naive Bayes classifiers. The classifiers are trained and evaluated using text corpus generated by a group of three hundred (300) students on a locally hosted e-governance forum. Each student was encouraged to generate at least eight different texts. The texts are to belong to “relevant”, “irrelevant but interesting” and “must be removed” categories. Summarily, the text corpus used for the training and evaluation of the classifiers contains a total of two thousand and twenty (2020) texts. 730 of the texts belong in the relevant category; 653 belong in the irrelevant but interesting category while 637 belong in the must be removed category. Using this text corpus, Support Vector Machine (SVM) with One-Vs-The-Rest technique and Multinomial Naive Bayes were trained using Scikit-learn. SVM proved better than Naive Bayes for the e-governance system. Subsequent sections of the paper include literature review, methodology, results and conclusion. 2. Review of Relevant Literatures 2.1. E-governance According to Keohane and Nye [1], “Governance implies the processes and institutions, both formal and informal, that guide and restrain the collective activities of a group. Government is the subset that acts with authority and creates formal obligations. Governance need not necessarily be conducted exclusively by governments. Private firms, associations of firms, nongovernmental organizations (NGOs), and associations of NGOs all engage in it, often in association with governmental bodies, to create governance; sometimes without governmental authority.” In Kettl [2] view, "Governance is a way of describing the links between government and its broader environment - political, social, and administrative." With the revolutionary changes that ICTs are bringing to our global society, governments worldwide continue to develop more sophisticated ways to digitize its routines and practices so that they can offer the public access to government services in more effective and efficient ways. The delivery 80 of government services and information to the public using ICT is referred to as e-governance [3]. The UNESCO define e-governance as “the public sector’s use of information and communication technologies with the aim of improving information and service delivery, encouraging citizen participation in the decision-making process and making government more accountable, transparent and effective. E-governance involves new styles of leadership, new ways of debating and deciding policy and investment, new ways of accessing education, new ways of listening to citizens and new ways of organizing and delivering information and services. E-governance is generally considered as a wider concept than e-government, since it can bring about a change in the way citizens relate to governments and to each other. E-governance can bring forth new concepts of citizenship, both in terms of citizen needs and responsibilities. Its objective is to engage, enable and empower the citizen” [4]. Social networks provide the technological platform for individuals to connect, produce and share content online [5]. Web 2.0 has changed the one-way notion of traditional e-governance, whereby information only flows from government to the citizens. Nowadays, there is a need for government to access firsthand information from the citizens, so as to encourage grassroots development and targeted governance. The use of social networks as a tool to facilitate e-governance has been referred to this paper as social e-governance. 2.2. Moderation in Social Networks According to Ochoa and Duval [6] “UGC is becoming the most popular and valuable information available on the WWW”. The explosive growth of UGC has stimulated interests in moderation on social networks. Khadilkar, Pai, and Ghadiali [7] observed that “4.1 million minutes of video are uploaded to YouTube everyday … six billion images per month are uploaded to Facebook … 40% of images and 80% of videos [created]are inappropriate for business. UGC comes in different forms, including short-text content family such as tweets and forum comments; long-text posts on blogs and profiles; and multimedia material such as images, audio, video and applications”. Moderation is the review of user generated content and the decision to publish, edit or delete the content or at times to engage with the online community [8]. Interactive advertising bureau Australia [9] opined that all stakeholders have a role in managing user comments on the web, as follows – “ Users should think about the appropriateness of their content before they post it and take responsibility for their comments; Platforms should remove comments reported to them which are illegal or violate their terms and conditions and empower organizations using their platforms with tools to assist them in moderating their properties; The community should report comments that violate applicable rules; and Organizations should engage in responsible moderation of user comments posted to their social media channels”. Maintaining a content is a foundation of a healthy and flourishing community platform. In order to maintain this quality, the community platform needs governance. Governance of a web community can be understood as steering and coordinating the activities of community members. Moderation is extremely important in social networking systems, sorting good from bad content and helping readers to find useful information. Khadilkar et al [7] stated that moderation can be automated moderation; community moderation and human moderation. Automated content moderation has grown into a discipline that requires expertise in pattern detection and labelling, the less downstream volume and analysis [7]. These automated moderation techniques are embodied under the subject of sentiment analysis. According to Liu [10] “sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, 81 appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes”. Most machine learning algorithms are often used for sentiment analysis. The following section reviews Naive Bayes algorithms and SVM. 2.3. Naive Bayes Algorithm Naive Bayes is a family of probabilistic classifiers that leverages the Bayes’ theorem with strong independence assumptions among the features. Naive Bayes has been well-applied in text categorization. An important advantage of naive bayes is that a small number of training data is sufficient to estimate the parameters necessary for out-of-sample classifications [11]. Given a class variable � and a dependent feature vector � through��, Bayes’ theorem states the following relationship: ���|��,……. . ,��� = �� �����,……..,��| �����,……..,��� (1) Introducing the Naive Bayes independence assumption that ����|�,��, …,����,����, . . , ��� = ����|�� for all i, the equation (1) is simplified to equation (2); ���|��,……. . ,��� = �� �∏ ����| � ���� ����,……..,��� (2) ����,……. . , ���is a normaliser and it is constant given the input. Naive bayes uses Maximum a posterior (MAP) decision rule in choosing the hypothesis that is most probable. Naive Bayes classifier uses the classification rule; �� � ������ ����∏ ����|������ (3) Based on the distributions of features, Naive Bayes classifier can be Gaussian, Bernoulli or multinomial. Gaussian Naive Bayes is used when dealing with continuous data with the assumption that the features are distributed according to Gaussian distribution. ����|�� � �� !"#$ exp �( )���*#+ $ "#$ � (4) The parameters , and - are estimated using maximum likelihood. Bernoulli Naive Bayes is for data that is distributed according to Bernoulli distributions. In the case of text classification using multivariate event model, word occurrence vectors, rather than word count vectors, are often used to train and use the classifier. Multinomial Naive Bayes is used for multinomially distributed data. It is uses word count vectors instead of word training vectors in training and using the classifier. The distribution is parameterized by vectors . � �. �, …,. �� for each class �, where / is the number of features (in text classification, the size of the vocabulary) and . � is the probability ����|�� of feature 0 appearing in a sample belonging to class �. The parameter . is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting: .1 � 2#�� 32#� 3� (5) Where, 4 �=∑ ���∈7 is the number of times feature appears in a sample of class � in the training set 8, and 4 = ∑ 4 �|7|��� is the total count of all features for class �. The smoothing priors ∝ : 0 accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting < � 1 is called Laplace smoothing, while < > 1 is called Lidstone smoothing. 82 2.4. Support Vector Machine (SVM) SVM constructs a hyper-plane or a set of hyper-planes in a high dimensional space for the purpose of classification, regression or outline detection. It chooses the hyper-plane that has the largest distance to the nearest data points of any class so as to lower the generalization error of the classifier. Given training vectors ��?ℝA, 0 = 1,…,/,in two classes and a vector �?{1,−1}�, SVM solves the following primal problem: minG,H,I � J7J + L ∑ M� � ��� Subject to ���J 7N���� + O� ≥ 1-M M ≥ 0,0,……/ (6) Its dual is min∝ � ∝7 P ∝ − Q7 ∝ Subject to �7 ∝ = 0 0 ≤∝≤ L,0 = 1,…,/ (7) Where Q is a vector of all ones, C>0 is the upper bound, P is an /O�/ positive semi-definite matrix. P�S = ���ST)��,�S+; where, T)��, �S+ = N���� 7N)�S+is the kernel. The function N implicitly maps the training vectors to higher dimensional space. The decision function is given as (∑ �� ∝� T)��,�S+ � ��� + O). 3. Methodology Figure 1 presents the research process flow. The text corpus comprises 2020 texts generated by 300 university students on a locally hosted e-governance forum. Summarily, the text corpus used for the training and evaluation of the classifiers contains a total of two thousand and twenty (2020) texts. 730 of the texts belong in the relevant category; 653 belong in the irrelevant but interesting category while 637 belong in the must be removed category. The texts were labeled accordingly for supervised learning. Feature Extraction: This is the process of converting the texts in the corpus into numerical features compatible with machine learning techniques. The processes of feature extraction include lower casing; removal of stop words from each text in the text corpus; removal of non-word and word stemming; Lower casing – The entire texts in the corpus are converted to lower case so as to ignore capitalization. Removal of stop words- In python, NLTK library can be used to import stop words in different languages. Using this library, stop words in English language were imported and removed from each text in the text corpus. Removal of words that occur too rarely in the corpus – To avoid over-fitting of the training set, words which occur less than 100 times in the corpus are removed. Removal of non-words - All non-words including punctuations are removed. White spaces such due to tabs, spaces, newlines, etc. are trimmed to single space character. Word Stemming – Words are reduced to their stem forms. For examples, words like discounted and discounting are replaced with discount. Words like include, includes, included and including are reduced to includ. This is achieved in Python using a stemmer function present in NLTK library. 83 Figure 1: Flow Diagram for the Research process Bag-of-words representation - A bag-of-word representation is the representation of a corpus of text documents in a matrix with one row per document and one column per token occurring in the corpus. The texts in the corpus are represented as numerical feature vectors with a fixed size rather than the raw text documents with variable lengths. Scikit-learn has functionalities for building the bag of words. The strings are 1. OBTAIN INPUT TEXT CORPUS 2. FEATURE EXTRACTION 4. DATA SEGMENTATION 5a. TRAINING SET 5b. TEST SET 6. TEXT LEARNING ℋ[ . ] 7. CLASSIFIER 8. PERFORMANCE SCORE 9. GOOD ? TUNE PARAMETERS OF THE LEARNING ALGORITHM NO YES 10. DEPLOY CLASSIFIER IN DEVELOPMENT OF THE APPLICATION 3. PRE-PROCESSING 84 tokenized using white spaces as separators. Integer indexes are given to each possible token. The occurrences of tokens in each text document are counted. Each individual token occurrence frequency is treated as a feature. The vector of all the token frequencies for a given document is considered a multivariate sample. In Scikit-learn, the CountVectorizer function is designed for this purpose. Pre-processing: The features were scaled to lie between 0 and 1. This was achieved in Scikit-learn using MinMaxScaler function present in the preprocessing library of Scikit-learn. Data Segmentation: The data is randomly split such that 80% were used for training while 20% were used for training. The essence of this is to ensure that each classifier is validated with out-of-sample inputs, as such, this is a better proof of the system’s generalization performance. 4. System Implementation 4.1. Training of Classifiers The two classifiers, Naive Bayes and SVM, considered in this paper, were trained on the text corpus. The classifiers were implemented using Scikit-learn. Scikit-learn involves 4-step modelling pattern. In step one the relevant classes are imported. Step two involves the instantiation of the estimator, in which the hyper-parameters can as well be specified or left as defaults. In step three the model is fitted with data and step four is to apply the fitted model on the test set. For example, the 4-step modelling pattern of Scikit –learn for the Multinomial Naive Bayes is shown in the appendix. The SVM classifier was also implemented using the same 4-step modelling pattern in Scikit-learn. The SVC class was used due to its ability to implement multiclass classification on a dataset. The default hyper-parameters were used without tuning. 4.2. Performance Scores of Classifiers Each of the classifiers were evaluated using the accuracy_score function in the accuracy library in Scikit-learn. The result shows that SVM classifier had a 96% out-of-sample performance while Naive Bayes had an 88.6% out-of-sample performance. Figure 2 and 3 show the implementations of the Naïve Bayes and SVM classifiers in Scikit-learn. Figure 2: Multinomial Naïve Bayes implementation in Scikit-learn 85 Figure 3: One-Vs-all SVM implementation in Scikit-learn 4.3. Development of the Application The social e-governance application was developed following an evolutionary software development process model. The process involves the system analysis, design, implementation, testing and deployment. The system was implemented with Python as the scripting language and deployed locally on Google App Engine for demonstration and testing. SVM classifier was used for the users’ posts moderation and categorization. Figure 4 shows the screenshot of the system. Figure 4: Screenshot of the system 5. Conclusion In this paper, two classifiers, Naive Bayes and SVM were compared for UGTCs moderation and categorization using Scikit-learn. The result shows that SVM classifier had a 96% out-of-sample performance while Naive Bayes had an 88.6% out-of-sample performance. The social e-governance application was developed using python as scripting language. The SVM classifier was employed for the users’ posts moderation 86 and categorization. Furthermore, the application was deployed locally on Google App Engine for demonstration and testing. References [1] Keohane, R.O., & Nye, J.S.Jr. (2002). Governance in a globalization world. Power and governance in a partially globalized world, 193-218. [2] Kettl, D.F. (2015). The transformation of governance: Public administration for the twenty-first century. JHU Press. [3] OJO, J. S. (2014). E-governance: An imperative for sustainable grass root development in Nigeria. Journal of Public Administration and Policy Research, 6(4), 77-89. [4] Palvia, S.C.J., & Sharma, S.S. (2007). E-Government and E-Governance: Definitions/Domain Framework and Status around the World. In International Conference on E-governance., 5 International Conference on EGovernance, Foundations of E-Government, 1-12. [5] Cvijikj, I.P. and Michahelles, F. (2012) Understanding the user generated content and interactions on a Facebook brand page, Int. J. Social and Humanistic Computing, Vol. 2, No. 1-2, 118–140. [6] Ochoa, X., Duval, E. (2008). Quantitative analysis of user-generated content on the web. Proceedings of WebEvolve2008: web science workshop at WWW2008, 1-8. [7] Khadilkar, A., Pai, T., Ghadiali, S. (2012). How to De-Risk the Creation and Moderation of User-Generated Content, Available at : http://www.cognizant.ch/InsightsWhitepapers/ How-to-De-Risk-the-Creation-and-Moderation-of-User-Generated-Content.pdf. Accessed on: 10th October 2016. [8] ABC Managing Director (2011). Moderating User Generated Content, 9, Available at: http://about.abc.net.au/wp-content/uploads/2012/06/GNModerationINS.pdf. Accessed on: 10th October 2016. [9] Interactive advertising bureau Australia (2013) Best Practice for User CommentModeration: Including commentary for organisations using social media platforms. Available at: https://www.iabaustralia.com.au/uploads/uploads/2013-09/ 1380477600_b054b0ef30db4de990bd1527ed6758e4.pdf, Accessed on: 10th October 2016. [10] Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167. [11] Kumar, S. A., &Vijayalakshmi, M. N. (2012). Inference of Naïve Baye’s Technique on Student Assessment Data. In Global Trends in Information Systems and Software Applications, Volume 270 of the series Communications in Computer and Information Science, 186-191. Copyright © 2017 Imeobong Frank Inyang, Simeon Ozuomba, and Chinedu Pascal Ezenkwu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.