Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2590 www.etasr.com Shaikh : Keyword Detection Techniques: A Comprehensive Study Keyword Detection Techniques A Comprehensive Study Zaffar Ahmed Shaikh Faculty of Computer Science & Information Technology Benazir Bhutto Shaheed University Lyari, Karachi, Pakistan zashaikh@bbsul.edu.pk Abstract—Automatic identification of influential segments from a large amount of data is an important part of topic detection and tracking (TDT). This can be done using keyword identification via collocation techniques, word co-occurrence networks, topic modeling and other machine learning techniques. This paper reviews existing traditional keyword extraction techniques and analyzes them to make useful insights and to give future directions for better automatic, unsupervised and language independent research. The paper reviews extant literature on existing traditional TDT approaches for automatic identification of influential segments from a large amount of data in keyword detection task. The current keyword detection techniques used by researchers have been discussed. Inferences have been drawn from current keyword detection techniques used by researchers, their advantages and disadvantages over the previous studies and the analysis results have been provided in tabular form. Although keyword detection has been widely explored, there is still a large scope and need for identifying topics from the uncertain user- generated data. Keywords-keyword detection; information retrieval; topic detection; machine learning; comprehensive study I. INTRODUCTION Keyword extraction using manual methods is slow, expensive and bristling with mistakes [1]. In recent years, many automatic bursty keyword extraction techniques have been proposed to extract keywords from large amounts of data. These keywords are helpful in identifying themes and influential segments and framing semantic web and other applications of natural language processing [2, 3]. Automatic keyword detection research area is related to topic detection and tracking (TDT) domain which was proposed in [4]. Various applications use keyword extraction techniques for web search, report generation and cataloguing [5]. This area is intended to identify the most useful terms which include many sub-processes. Documents are introduced in MS Word, html, or pdf formats. Initially, the documents are pre-processed to remove redundant and unimportant information [6, 7]. The data is then processed through different keyword extraction approaches including statistical approach, linguistic approach, machine learning approach, network based approach and topic modelling approach [8, 9]. In statistical approach, term frequency–inverse document frequency (Tf-Idf) is the most widely used technique for keyword extraction. The researchers use Tf-Idf to give a document a score based upon some query. A change in score occurs when a query is changed or updated. Without a query, there is no score [10]. Recently many new techniques have been developed for statistical keyword extraction [11]. These include PageRank, LexRank, etc. In PageRank, the researchers assign a score to a document based upon the documents it links to, and the documents which link to it. It is a global ranking scheme [10]. Therefore, in PageRank, the score does not change (like in Tf-Idf) depending on the query used. As observed, PageRank and LexRank algorithms perform better than Tf-Idf. In linguistic approach, automatically identifying keywords is similar to semantic resemblance [12]. In machine learning approach, the keyword extraction technique is considered as classification technique [13]. Different dictionaries including WordNet, SentiNet and ConceptNet are used for keyword extraction techniques. In network based algorithms, the nature and semantics of word co-occurrence networks is studied to identify important terms. In this, nodes are considered as words and edges are considered as co-occurrence frequency [14]. Many useful insights have been obtained from these algorithms for identifying influential segments and keywords. Topic modelling techniques have been popularized in [15]. Authors introduced Latent Dirichlet Allocation technique which is used to identify which document is related to which topic and to what extent [16]. This has been further improved by Hierarchical Dirichlet Process, Pachinko Allocation Model, Relational Topic Modeling, Conditional Topic Random Fields and recently by Hierarchical Pitman– Yor–Dirichlet Language Model and Graph Topic Model [17]. Although keyword extraction is an important area of research and many researchers and practitioners gave a lot of attention to it, state of the art keyword extraction method is still not observed as compared to many other core natural language processing tasks [18]. This paper reviews existing traditional keyword extraction techniques and analyzes them to make useful insights to give future directions for better automatic, unsupervised and language independent research. Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2591 www.etasr.com Shaikh : Keyword Detection Techniques: A Comprehensive Study II. RELATED WORK Authors in [19] developed a so-called tool ‘Keyword Extractor’ for automatic extraction of most likely terms that closely match experts’ preferences. Their study was related to brain research which involved worldwide collaborations and exchange of information among neuroinformatics centers and portal sites. The main objective of their study was the efficient use of resources and the improvement in the quality of brain research. Each center and site developed their own set of keywords for classification of the main text and the resources. The researchers tested their tool over the abstract database of two science journals. Authors in [20] extracted keywords from a Chinese microblog. To extract keywords, they performed five steps and used three features (i.e., graph model, semantic space, and location of words). In the first step, researchers downloaded microblog API of a user. Secondly, they preprocessed the data by applying data cleaning, word segment, POS tagging, and stop word removal techniques. To extract keywords, researchers in the third step created a graph model that was based on the co-occurrence between words. They assigned sequence numbers to the words according to their location and developed weight of the words by using the score formula. In the fourth step, researchers first created a semantic space that was based on topic extraction and then computed statistical weight of the words by using Tf-Idf. In the fifth and last step, researchers first identified location of words, and then, based on the location of those words, computed the rank value of each word. Authors in [21] focused on the structure approach and graph generation. The approach used in this paper is structure based in which researchers created graph model and identified bursty topics and events. In topic clustering, twitter tweets were separated to produce homogeneous graphs and heterogeneous graphs. For homogeneous graphs, researchers used OSLOM algorithm to find interaction among users. For heterogeneous graphs, rankclus algorithm was used to construct a set of tweets ranked with number. Finally, from both graph results, the concept, theme or event of a tweet was measured by joining tweets with the same name. Researchers planned ahead to develop graph models to be used for different types of events and to construct a method that can define events. Authors in [22] developed a keyword extraction technique for tweets with high variance and lexical variant problems. Lexical variants are examples of free variation in language. They are characterized by similarity in phonetical or spelling form and identity of both meaning and distribution. The authors used brown clustering and continuous word vector methods. In brown clustering method, they clustered words having same meaning (such as no, noo, etc.) and then found out the features for the individual cluster. In continuous word vector method, the authors defined a layer by finding its probability and then the word is changed into continuous word vector. Next, they predicted the length of the keyword by calculating the ratio between the number of keywords and the total number of words in the tweets. In the end, linear regression method was used to predict the number of keywords. Authors in [23] developed a system to detect popular keyword trend and bursty keywords. Their system detects keyword abbreviations and any typing and spacing errors. The first step they took is to collect the candidate keywords (i.e., the first word starting with the capital letter or the word enclosed in quotation mark is considered as candidate keyword). The second step was to merge keywords. To do so, they considered acronyms and typo and spacing errors, and then, found out the Tf accordingly. Finally, they detected popular keywords from the candidate keywords which were merged, and then, selected bursty keywords using the burst ratio technique. Authors in [24] gave the idea of TOPOL (a topic detection method based on topology data analysis) which identifies the irrelevant noisy data from the useful data. The first step of the authors was the preprocessing step in which the elimination of the hashtags, the URLs, and the non-textual symbols from a tweet was done. Their second step was mapping, in which a matrix was generated by applying the SVD technique. In the third step, which the authors called the topic extraction step, the topics were selected based on the interest. Finally, the results were computed based on topic recall, keyword precision, and keyword recall parameters. Authors in [25] presented and discussed different methods and approaches used in the keyword extraction task. They also proposed a graph based keyword extraction method for the Croatian language which is based on extraction of nodes. The authors used selectivity-based keyword extraction method in which text is represented in the form of vertex and edges. The result is computed on the in-degree, out-degree, closeness and selectivity. Authors in [26] developed a keyword extraction method that represents text with a graph, applies the centrality measure and finds the relevant vertices. Authors proposed a three-step based technique called TKG (Twitter Keyword Graph). The first step was the pre-processing in which stop words were removed. In the second step, a graph was developed in which nearest neighbor and all neighbors were considered. Finally, the results were computed based on the precision, recall, F-measure test scores and graph scalability. Authors in [27] proposed an information summarization method for the large quantum of information which is disseminated everyday through tweets. Their method collects tweets using a specific keyword and then, summarizes them to find out the topics. The authors provide two algorithms: Topic extraction using AGF (TDA) and topic clustering and tweet retrieval (TCTR). The methodology first extracts tweets from twitter and then applies the Tf-Idf technique to find out weights and word frequency. The AGF is evaluated using keyword rating. Finally, the results are calculated based on the class entropy, purity, and cluster entropy. Authors in [28] proposed a technique in which a user can search using a search engine but without entering any keywords. The google similarity distance technique is used to find the keywords. A log is maintained in which user behavior and repository is saved. So, the need for the repository is abolished and everything is done online and in real time. Keyword expansion and extraction methods are used to extract relevant and accurate information. In keyword expansion, help is provided to user to enter the exact keyword and to get the exact information. In keyword extraction, the word is analyzed based on the occurrence on the length and frequency. Keyword extraction method relies on statistical approaches and machine learning approaches. The proposed methodology of the authors is composed of three parts: 1-g Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2592 www.etasr.com Shaikh : Keyword Detection Techniques: A Comprehensive Study filtering, google similarity distance calculation, and search results filtering. Finally, the results are calculated based on the parameters of precision and recall. The relationship between top k results is evaluated. Thus, the authors proposed a system in which user just needs to browse the web page and the relevant keywords are generated. The system suits well for the science stream as the words are clear but may not be accurate for the social science. Authors in [29] produced a facility based on Bayesian text classification approach called high relevance keyword extraction (HRKE) to extract the keywords at the stage of classification without the use of pre-classification process. The facility uses a posterior probability value to extract keywords. The HRKE first extracts the words from the text. Next, the posterior probability is calculated. Finally, the Tf-Idf method is used to assign weights to words. Authors claim that the HRKE facility improves the performance and accuracy of the Bayesian classifier and reduces time consumption. The experiment was conducted on three dataset-featured article datasets. In the end, the corresponding threshold and accuracy graph is plotted. Authors in [30] address the problem of part-of-speech (POS) tagging from the richer text of twitter. Authors developed a POS tagset first. Secondly, they performed manual tagging on the dataset. Afterwards, the features for the POS tagger were developed. Finally, the experiments were conducted to develop the annotated dataset for the research community. The hashtags, URLs, and emotions were considered. The results were obtained with 90 percent accuracy. Authors concluded claiming that the approach can be applied to linguistic analysis of social media, and the annotated data can be used in semi supervised learning. Authors in [31] gave a solution to the problem of statistical keyword extraction from the text by adapting entropic and clustering approaches. Authors made changes in these approaches and proposed a new technique which detects keywords as per user’s needs. The main objective of the authors was to find and rank important words in the text. The two approaches were applied on short texts (such as web pages, articles, glossary terms, generic short text etc.) and long texts (such as books, periodicals etc.). Results were evaluated and the clustering approach proved to be better for both cases, while the entropic approach suited well for the long text and did not perform well for the partitioned text. Authors in [32] proposed a metric called entropy difference (ED) for the ranking of the words on a Chinese dataset. Authors used Shannon’s entropy method which is the difference between intrinsic and extrinsic modes. The idea of intrinsic and extrinsic modes is that meaningful words are grouped together. Therefore, the words are extracted and ranked according to the entropy difference. Authors calculated mean, mode and median on entropy differences. Their ED metric proved to be a good choice in word ranking. The method differentiates between the words that define authors’ purpose and the irrelevant words which are present randomly in the text. This method is well suited for single document of which no information is known in advance. Authors in [33] provided a solution to the inherent noisy and short nature tweet problem of Twitter streams called HybridSeg. Authors incorporated local context knowledge of the tweets with global knowledge bases for better tweet segmentation. The tweet segmentation process was performed on two tweet datasets. The tweets were split into segments to extract meaning of the information conveyed through the tweet. Results show that HybridSeg significantly improved tweet segmentation quality compared with other traditional approaches. Authors claim that the segment based entity is better than word based entity. Author in [34] provided a unique solution to the keyword extraction problem called ConceptExtractor. The ConceptExtractor do not decide on the relevance of a term during the extraction phase, instead, it only extracts generic concepts from texts and postpones the decision about relevant terms based on the needs of the downstream applications. Authors claim that unlike other statistical extractors, ConceptExtractor can identify single-word and multi-word expressions using the same methodology. Results were evaluated based on three languages. Precision and recall were used for the result evaluation. Authors also defined a metric to specificity both single and multi-word expressions usable in other languages. Authors in [35] considered various Chinese keyword extraction methods. In this paper, extended Tf approach has been defined which considers Chinese characteristics with Tf method. Authors also developed a classification model based on support vector machine (SVM) algorithm. Many improvement strategies were defined and four experiments were performed to evaluate the results. Results showed that SVM optimized the keywords. Precision and recall rate improved much better. Authors concluded that the improved Tf method is much better than the traditional Tf method in terms of accuracy and precision. Authors in [36] discovered and classified terms that are either document title or ‘title-like’. Their idea was that the terms that are title or title like should behave in the same way in a document. The classifier was trained using distributional and linguistic features to find the behavior of the terms. Different features were considered such as location, frequency, document size etc. The rating was calculated on the basis of topical, thematic and title terms. After this the evaluation was performed based on recall and precision. The recall rate of finding the title terms was high but the precision rate was low because some of the words which were not titles were also identified in title terms. Authors in [37] developed a sensitive text analysis for extracting task- oriented information from unstructured biological text sources using a combination of natural language, dynamic programming techniques and text classification methods. Using computable functions, the model finds out matching sequences, identifies effects of various factors and handles complex information sequences. Authors pre-processed the text contents and applied them with entity tagging component to find out the causes of diseases related to low-quality food. Results show that the bottom-up scanning of key-value pairs improves content finding which can be used to generate relevant sequences to the testing task. The method improves information retrieval accuracy in biological text analysis and reporting applications. III. ANALYSIS OF KEYWORD EXTRACTION APPROACHES Table I provides inferences drawn from modern keyword detection techniques, their advantages and disadvantages over previous studies, and result analysis. Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2593 www.etasr.com Shaikh : Keyword Detection Techniques: A Comprehensive Study TABLE I. ANALYSIS OF EXISTING KEYWORD DETECTION TECHNIQUES Paper Techniques used Advantages Disadvantages Results / Analysis [20] a) Graph model b) Semantic space a) Can detect the words which are wrongly segmented. b) Extracts keywords from a micro blog. a) Not suitable for large texts. b) Some terms will not be distinguished. Best performance obtained is 0.6972 [21] a) OSLOM algorithm b) Page rank algorithm a) Able to identify the topics of twitter event. b) Less expensive. Not able to identify the events based on graph clusters. Best result obtained from structured based approach. [22] a) Brown clustering b) Continuous word vector a) Improved state of the art for keyword extraction. b) Automatically keyword extraction. Not suitable for Facebook text keyword extraction. Accuracy for precision obtained is 72.05, recall 75.16. [24] TOPOL a) Suitable for noisy data. b) Reduces computation time and improves topic extraction result. Suffers from data fragmentation. The result obtained is 0.5380 for recall,0.7500 for precision. [26] a) Tf-Idf b) KEA c) Proposed TKG a) TKG proved to be robust and superior compared to other approaches b) TKG is simpler to use than KEA The best configuration of TKG was not found TKG results better compared to KEA and Tf-Idf. [28] a) Statistics approach b) Machine learning approach a) Search engine which can automatically extract important keywords b) System works well Not suitable for business management domain High recall rate [29] Bayesian approach a) Low cost, simple and efficient method. b) Handles raw data without text preprocessing. a) Presence of noisy data may degrade the performance. b) Feature selection method degrades the efficiency of classification task. Improved accuracy [31] a) Entropic b) Clustering approach a) Suitable for both long and short texts. b) Reliable obtained results. Median and mode did not give the correct result. Good clustering results for both short and long texts. [32] Shannon entropy a) Suitable for text with no information known in advance. b) Easy to numerically implement. Median and mode did not give the correct result. Better results for single document. [33] Hybrid segmentation High quality tweet segmentation. Manual segmentation is expensive. Improved precision. [34] Statistical language independent Good for extracting single and multi- word expressions. Not suitable for long text Improved precision and recall. [36] a) Decision tree classifier b) Pattern recognition Easy title determination a) Not easy to determine the best document size. b) Precision was not significant than recall. Recall 85% was achieved for title like terms. [37] a) Sensitive text analysis b) Context-based extraction method a) Category-oriented approach for extraction of task-specific information b) Investigations into recall and precision were carried out. Not tested on generic data. a) Food safety is analyzed to prevent future consequences. b) Improved classification accuracy by utilizing optimization constraints. c) Causes of diseases related to low-quality food were identified. IV. CONCLUSIONS This paper extends understanding of widely used `existing approaches to keyword detection in the identification of influential segments from a large amount of textual data or documents. Therefore, extant literature on existing traditional TDT approaches to automatic identification of important words was reviewed and discussed. Techniques reviewed include collocation, word co-occurrence networks, topic modelling and other machine learning approaches. Results show that the majority of these techniques is domain dependent and language dependent. It was observed that although traditional keyword extraction techniques have been performing satisfactorily, a need exists to propose unsupervised, domain independent and language independent techniques which use statistically computational methods. Keyword extraction task has been widely explored, but there is still a large scope and gap for identifying topics from the uncertain user-generated data. REFERENCES [1] E. Landhuis, “Neuroscience: Big brain, big data”, Nature, Vol. 541, No. 7638, pp. 559-561, 2017 [2] G. Ercan, I. Cicekli, “Using lexical chains for keyword extraction”, Information Processing & Management, Vol. 43, No. 6, pp. 1705-1714, 2007 [3] R. S. Ramya, K. R. Venugopal, S. S. Iyengar, L. M. Patnaik, “Feature extraction and duplicate detection for text mining: A survey”, Global Journal of Computer Science and Technology, Vol. 16, No. 5, pp. 1-20, 2016 Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2594 www.etasr.com Shaikh : Keyword Detection Techniques: A Comprehensive Study [4] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, Y. Yang, Topic detection and tracking pilot study final report, DARPA Broadcast News Transcription and Understanding Workshop, 1998 [5] P. Eckersley, G. F. Egan, S. Amari, F. Beltrame, R. Bennett, J. G. Bjaalie,T. Dalkara, E. De Schutter, C. Gonzalez, S. Grillner, A. Herz, K. P. Hoffmann, I. P. Jaaskelainen, S. H. Koslow, S.-Y. Lee, L. Matthiessen, P. L. Miller, F. M. da Silva, M. Novak,V. Ravindranath, R. Ritz, U. Ruotsalainen, S. Subramaniam, A. W.Toga, S. Usui, J. van Pelt, P. Verschure, D. Willshaw, A. Wrobel, Tang Yiyuan, “Neuroscience data and tool sharing”, Neuroinformatics, Vol. 1, No. 2, pp. 149-165, 2003 [6] D. Kuttiyapillai, R. Rajeswari, “Insight into information extraction method using natural language processing technique”, International Journal of Computer Science and Mobile Applications, Vol. 1, No. 5, pp. 97-109, 2013 [7] S. Rose, D. Engel, N. Cramer, W. Cowley, Automatic keyword extraction from individual documents, Text Mining: Applications and Theory, John Wiley & Sons, 2010 [8] J. Wu, S. R. Choudhury, A. Chiatti, C. Liang, C. L. Giles, “HESDK: A hybrid approach to extracting scientific domain knowledge entities”, In ACM/IEEE Joint Conference on Digital Libraries, pp. 1-4, 2017 [9] D. B. Bracewell, F. Ren, S. Kuriowa, “Multilingual single document keyword extraction for information retrieval”, IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 517-522, 2005 [10] D. Kuttiyapillai, R. Rajeswari, “Extended text feature classification with information extraction”, International Journal of Applied Engineering Research, Vol. 10, No. 29, pp. 22671-22676, 2015 [11] S. C. Watkins, The young and the digital: What the migration to social- network sites, games, and anytime, anywhere media means for our future, Beacon Press, 2009 [12] I. M. Soboroff, D. P. McCullough, J. Lin, C. Macdonald, I. Ounis, R. McCreadie, “Evaluating real-time search over tweets”, International Conference on Weblogs and Social Media, pp. 943-961, 2012 [13] H. L. Yang, A. F. Chao, “Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations”, Information Systems Frontiers, Vol. 17, No. 6, pp. 1335- 1352, 2015 [14] J. Yang, J. Leskovec, “Patterns of temporal variation in online media”, 4rth ACM international conference on Web search and data mining, pp. 177-186, 2011 [15] D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent dirichlet allocation”, Journal of Machine Learning Research, Vol. 3, No. Jan, pp. 993-1022, 2003 [16] D. M. Blei, J. D. Lafferty, “Dynamic topic models”, 23rd international conference on Machine learning, pp. 113-120, 2006 [17] M. Habibi, A. Popescu-Belis, “Keyword extraction and clustering for document recommendation in conversations”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 4, pp. 746- 759, 2015 [18] S. Beliga, Keyword extraction: A review of methods and approaches, University of Rijeka, Department of Informatics, 2014 [19] S. Usui, P. Palmes, K. Nagata, T. Taniguchi, N. Ueda, “Keyword extraction, ranking, and organization for the neuroinformatics platform”, Biosystems, Vol. 88, No. 3, pp. 334-342, 2007 [20] H. Zhao, Q. Zeng, “Micro-blog keyword extraction method based on graph model and semantic space”, Journal of Multimedia, Vol. 8, No. 5, pp. 611-617, 2013 [21] H. Hromic, N. Prangnawarat, I. Hulpus, M. Karnstedt, C. Hayes, “Graph-based methods for clustering topics of interest in Twitter”, International Conference on Web Engineering, pp. 701-704, Springer, 2015 [22] L. Marujo, W. Ling, I. Trancoso, C. Dyer, A. W. Black, A. Gershman, D. M. de Matos, J. P. Neto, J. G. Carbonell, “Automatic keyword extraction on Twitter”, ACL (2), pp. 637-643, 2015 [23] D. Kim, D. Kim, S. Rho, E. Hwang, “Detecting trend and bursty keywords using characteristics of Twitter stream data”, International Journal of Smart Home, Vol. 7, No. 1, pp. 209-220, 2013 [24] P. Torres-Tramon, H. Hromic, B. R. Heravi, “Topic detection in Twitter using topology data analysis”, International Conference on Web Engineering, pp. 186-197, 2015 [25] S. Beliga, A. Mestrovic, S. Martincic-Ipsic, “An overview of graph- based keyword extraction methods and approaches”, Journal of Information and Organizational Sciences, Vol. 39, No. 1, pp. 1-20, 2015 [26] W. D. Abilhoa, L. N. De Castro, “A keyword extraction method from Twitter messages represented as graphs”, Applied Mathematics and Computation, Vol. 240, pp. 308-325, 2014 [27] A. Benny, M. Philip, “Keyword based tweet extraction and detection of related topics”, Procedia Computer Science, Vol. 46, pp. 364-371, 2015 [28] W. Chung, H. Chen, J. F. Nunamaker Jr, “A visual framework for knowledge discovery on the web: An empirical study of business intelligence exploration”, Journal of Management Information Systems, Vol. 21, No. 4, pp. 57-84, 2005 [29] D. Isa, L. H. Lee, V. P. Kallimani, R. Rajkumar, “Text document preprocessing with the bayes formula for classification using the support vector machine”, IEEE Transactions on Knowledge and Data engineering, Vol. 20, No. 9, pp. 1264-1272, 2008 [30] K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, N. A. Smith, “Part-of-speech tagging for Twitter: Annotation, features, and experiments”, 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, Vol. 2, pp. 42-47, 2011 [31] P. Carpena, P. A. Bernaola-Galvan, C. Carretero-Campos, A. V. Coronado, “Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins”, Physical Review E, Vol. 94, No. 5, pp. 052302, 2016 [32] Z. Yang, K. Gao, K. Fan, Y. Lai, “Sensational headline identification by normalized cross entropy-based metric”, The Computer Journal, Vol. 58, No. 4, pp. 644-655, 2014 [33] C. Li, A. Sun, J. Weng, Q. He, “Exploiting hybrid contexts for tweet segmentation”, 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 523–532, 2013 [34] J. M. J. Ventura, Automatic extraction of concepts from texts and applications, Diss. Universidade Nova de Lisboa, 2014 [35] B. Hong, D. Zhen, “An extended keyword extraction method”, Physics Procedia, Vol. 24B, pp. 1120-1127, 2012 [36] C. W. Wong, R. W. Luk, E. K. Ho, “Discovering ‘title-like2 terms”, Information Processing & Management, Vol. 41, No. 4, pp. 789–800, 2005 [37] D. Kuttiyapillai, R. Rajeswari, “A method for extracting task-oriented information from biological text sources”, International Journal of Data Mining and Bioinformatics, Vol. 12, No. 4, pp. 387-399, 2015 AUTHOR PROFILE Dr. Zaffar Ahmed Shaikh received his PhD in Computer Science from the Institute of Business Administration, Karachi (IBA-Karachi) in 2017. He is currently working as an Assistant Professor at Benazir Bhutto Shaheed University, Lyari, Karachi, Pakistan. He has twenty-three research publications to his credit and has received several research grants from EPFL (Switzerland), Higher Education Commission (Pakistan), Ministry of Higher Education (KSA) and IBA-Karachi. His research interests include Data Sciences, Knowledge Management, Language & Technology, Learning Environments, MOOCs, Social Software, Technology Enhanced Learning etc. Dr. Shaikh is a professional member of ACM and IEEE.