Vol10No2Paper5.2 1 To cite this article: Vegas Fernandez, F. (2020) Intelligent information extraction from scholarly document databases. Journal of Intelligence Studies in Business. 10 (2) 44-61. Article URL: https://ojs.hh.se/index.php/JISIB/article/view/570 This article is Open Access, in compliance with Strategy 2 of the 2002 Budapest Open Access Initiative, which states: Scholars need the means to launch a new generation of journals committed to open access, and to help existing journals that elect to make the transition to open access. Because journal articles should be disseminated as widely as possible, these new journals will no longer invoke copyright to restrict access to and use of the material they publish. Instead they will use copyright and other tools to ensure permanent open access to all the articles they publish. Because price is a barrier to access, these new journals will not charge subscription or access fees, and will turn to other methods for covering their expenses. There are many alternative sources of funds for this purpose, including the foundations and governments that fund research, the universities and laboratories that employ researchers, endowments set up by discipline or institution, friends of the cause of open access, profits from the sale of add-ons to the basic texts, funds freed up by the demise or cancellation of journals charging traditional subscription or access fees, or even contributions from the researchers themselves. There is no need to favor one of these solutions over the others for all disciplines or nations, and no need to stop looking for other, creative alternatives. Journal of Intelligence Studies in Business Publication details, including instructions for authors and subscription information: https://ojs.hh.se/index.php/JISIB/index Intelligent information extraction from scholarly document databases Fernando Vegas Fernandeza* aDepartamento de Ingeniería Civil: Construcción, Universidad Politécnica de Madrid, Spain; *fvegas@ciccp.es Journal of Intelligence Studies in Business PLEASE SCROLL DOWN FOR ARTICLE Editor-in-chief: Klaus SolbergSøilen Included in this printed copy: V ol10,N o 2,2020 Journal ofIntelligenceStudiesin B usiness ISSN: 2001-015X Vol. 10, No. 2, 2020 Thinking methods as a lever to develop collective intelligence Ursula Teubert pp. 6-12 Big data analytics and international market selection: An exploratory study Jonathan Calof and Wilma Viviers pp. 13-25 Atman: Intelligent information gap detection for learning organizations: First steps toward computational collective intelligence for decision making Vincent Grèzes, Riccardo Bonazzi and pp. 26-31 Francesco Maria Cimmino On the relationship between competitive intelligence and innovation Jonathan Calof and Nisha Sewdass pp. 32-43 Intelligent information extraction from scholarly document databases Fernando Vegas Fernandez pp. 44-61 Intelligent information extraction from scholarly document databases Fernando Vegas Fernandeza* aDepartamento de Ingeniería Civil: Construcción, Universidad Politécnica de Madrid, Spain *Corresponding author: fvegas@ciccp.es Received 4 January 2020 Accepted 5 May 2020 ABSTRACT Extracting knowledge from big document databases has long been a challenge. Most researchers do a literature review and manage their document databases with tools that just provide a bibliography and when retrieving information (a list of concepts and ideas), there is a severe lack of functionality. Researchers do need to extract specific information from their scholarly document databases depending on their predefined breakdown structure. Those databases usually contain a few hundred documents, information requirements are distinct in each research project, and technique algorithms are not always the answer. As most retrieving and information extraction algorithms require manual training, supervision, and tuning, it could be shorter and more efficient to do it by hand and dedicate time and effort to perform an effective semantic search list definition that is the key to obtain the desired results. A robust relative importance index definition is the final step to obtain a ranked importance concept list that will be helpful both to measure trends and to find a quick path to the most appropriate paper in each case. KEYWORDS Business intelligence, concept map, information extraction, knowledge management, literature review, natural language process, NLP, semantic search 1. INTRODUCTION According to the Cambridge dictionary, knowledge is “understanding of or information about a subject that you get by experience or study, either known by one person or by people generally”. It could also be defined as “the state of knowing about or being familiar with something” or “the creation of information from structured or unstructured data” (Upadhyay and Fujii 2016). In other words, knowledge is the result of settling information. “The general purpose of knowledge discovery is to extract implicit, previously unknown, and potentially useful information from data” (Matsuo and Ishizuka 2004). Information can be contained in a lot of documents available in several kinds of formats (Mitra and Chaudhuri 2000), as can be seen in Figure 1. Nowadays there is no distinction between electronic and printed formats given that any printed paper can be easily converted to an electronic format with scanning and OCR technologies that are commonplace. A large amount of available information on the Internet has made it easier to reach a constantly increasing number of documents but it has caused the problem of finding the most relevant ones for the specific purpose that the user addresses. Information retrieval (IR) has attracted scientists' attention since the 1960s (Allan et al. 2002). Allan uses Salton’s definition in 1983 for IR: “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information”. Recent publications define IR as “A system to identify a subset of Journal of Intelligence Studies in Business Vol. 10, No. 2 (2020) pp. 44-61 Open Access: Freely available at: https://ojs.hh.se/ 45 documents in a large text database or a library scenario a subset of resources in a library” (Grishman 2019). An information extraction system identifies a subset of information within a document to extract relevant information from documents. Information extraction (IE) should not be confused with the more mature technology of information retrieval (IR) (Gaizauskas and Wilks 1998). To sum it up, IR retrieves relevant documents from collections and IE extracts relevant information from documents. The relevance of extracted information is always related to the interests, goals, and specific information requirements of the researcher and, then, once it has been internally processed, information becomes knowledge. Extracting knowledge from big databases and document databases has long been a challenge because of the large number of documents that make it hard to select the most relevant data. For that reason, a lot of retrieval algorithms have been developed (Ahmad and Ansari 2012; Boden et al. 2012; Karol and Mangat 2013; Koval and Návrat 2012; Wang et al. 2013) applying distinct sophisticated techniques: fuzzy, artificial neural network (ANN), clustering, machine learning, and hybrids. There is a specific scenario where the challenge is not to find the right documents but to extract usable information from them: it is the literature review that every researcher faces when addresses a new research project (Nasar et al. 2018). This is a case of unstructured typed text written information (see Figure 1). In that situation, IR can be easily solved with the available search engines on the Internet. However, it is much harder to extract and manage information because a very high accuracy is needed and information about many distinct concepts should be extracted from documents depending on the researcher’s requirements. In that scenario, knowledge management involves not just information about keywords, tags, and meta-data, but a structured and even quantitative structure of all the concepts that can be relevant for the researcher's objectives. The document database size that researchers use in each specific research project is very small, typically 30 to a few hundred documents, and this situation is far from big data scenarios. For that reason, most of the time and effort should be dedicated to clearly defining specific user information requirements before thinking of a better way to extract information. This article addresses the case of the literature review. Researchers do a literature review, create a document database, and must manage that source of knowledge. There are several tools to manage that kind of document (e.g., EndNote, Mendeley, Word), but they just provide a catalog management functionality, When it comes to extracting knowledge, there is a severe lack of functionality. This case is a “little brother” of the general problem of extracting information from PDF files, but the approach, methodology, and principles used in this case are the same as those used in bigger cases. However, the IT tools required are much simpler. Before searching for concepts in a document database (e.g., ideas, topics) it is necessary to perform a previous concept analysis to define the semantic framework that will be used later (López-Robles et al. 2019; Sarwar and Allan 2019). Sometimes this analysis can be easily performed because it merely consists of defining words to be found in the text (e.g., to achieve a list of possible risks) and other times Figure 1 Distinct information formats. 46 it is harder. This article proposes a simple and effective way to extracting information from research document databases depending on the researcher’s predefined breakdown structure, obtaining a ranked list of concepts and items to define priorities or to make decisions. These results are relevant for researchers and are an example of what companies could do to organize and use their stored information simply and effectively. 2. PROBLEM DESCRIPTION Researchers use literature review as a relevant part of their research studies to know the state of the art and to give a sound basis to the statements they include in their papers. Each new research project leads to a new tailored document database creation with a few hundred documents that, although possibly partially overlapping with previously used databases, is a fully new one from which researchers will take references to include them in their new papers. In fact, they create a library that could be seen as their business intelligence document warehouse (Tseng and Chou 2006), because researchers do not use their document database just to cite previous works but also to extract knowledge from those documents. Scholarly documents address a specific subject and give a conclusion. Researches can read abstracts and even write a summary for each document. But there is much more information there, related to the main subject and related to marginal topics that might concern researchers, for which they might need to keep a record by annotating statements, methods, algorithms, author’s position about specific issues and techniques (Rostami et al. 2015). To do that, researchers could think of a predefined information breakdown structure and a list of premises, concepts, ideas, issues, and techniques that they would like to confirm or refute with the database information. In the end, that’s knowledge (Sirsat et al. 2014), and that sort of virtual list containing a reduced number of entries (typically 20 to 50) is itself a handy knowledge reference. Researchers need tools to efficiently carry out that task, but they usually do it by hand or with the help of desktop cataloging tools such as EndNote, Mendeley, or Word. A survey conducted in Universidad Politécnica de Madrid with a selected group of Ph.D. candidates and researchers confirmed this statement. Sophisticated algorithms are not always the right answer to extract information and knowledge, and most researchers are not opened to them because they do not have enough time to try them. Furthermore, most of the scholarly algorithms proposed require manual training, supervision, and tuning (Sirsat et al. 2014; Upadhyay and Fujii 2016) and, in the end, it is faster and more efficient to do it by hand. Researchers need to retrieve information from scholarly papers and transform it into knowledge. A possible way is to create a list of concepts or items that are representative of each document concerning what researchers are looking for in their research projects. That list of concepts can be weighted later on to achieve a ranked list of relevant concept elements with the overall reviewed literature. 3. OBJECTIVE This article addresses the literature review and the knowledge extraction that researchers carry out using scholarly document databases in their research projects and aims to give an affordable solution to improve that situation. Scientific document databases are much more than a collection of papers that need to be managed and cataloged: a task that several commercial solutions can do. Scientific document databases are a relevant source of information and researchers need to extract knowledge from them and rank results according to their relevance. 4. RESEARCH METHOD This study analyzes the state of the art in intelligence information extraction from scientific document databases. To do that, a systematic literature review and interviews with researchers at Universidad Politécnica de Madrid were carried out. That way, requirements and available resources were identified. This study also takes advantage of my personal experience as a researcher and as a Chief Information Officer in multinational companies. Advances in linguistic structure definitions were studied in depth to try to find the most efficient way to analyze text and to use it for specified purposes. Novelty proposed algorithms were considered to evaluate their adequacy for the objectives proposed. A previous author’s experience related to a competitive intelligence innovation project studied in 2015-2016 to predict risks in projects is a significant reference as to what actual technical solutions can provide and their 47 possibilities to satisfy the requirements proposed in this study. 5. LITERATURE REVIEW A systematic literature review was performed to know the state of the art related to intelligent information extraction following the searching method by Bettany-Saltikov (Bettany-Saltikov 2012; Kasperiuniene and Zydziunaite 2019; Snyder 2019). A systematic search, unlike a narrative search that could yield a subset of haphazard and biased documents, achieves a neutral collection of documents to obtain an objective view of the state of the art. To carry out the information retrieval, the initial idea of using the string “intelligent information extraction” linked to scholarly and scientific documents was completely dismissed because it hardly gave any results; a search for the concept “intelligent information extraction from document databases” was performed in several sources (Renault and Agumba 2016; Xia et al. 2018), with and without quotation marks and sometimes splitting that string into smaller fragments to achieve complementary results. As some sources retrieved more than 313,000 documents (e.g., Google Scholar), the first 400 hits were selected in each source, given that their search engines are supposed to show the most relevant results first. That outcome was filtered screening titles, keywords, and abstracts to rule out documents that did not meet the subject proposed and those that were unreachable. The results obtained prove that distinct sources do not always contain distinct databases; their search engines are different, and, for that reason, their first documents retrieved were distinct. It is possible to find in Google Scholar almost any document found in the other sources. However, by using distinct sources it is possible to get more results. The number of remaining documents, after filtering and deleting duplicated results, was 58. Concepts such as natural language processing, semantics, and ontologies frequently appear in the documents reviewed. A linguistic approach to the ontology concept could be helpful to clarify its meaning with several distinct definitions (Schalley 2019): “An explicit specification of a conceptualization”, “The study of the categories of things that exist or may exist in some domain”, and “Catalog of the types of things that are assumed to exist in a domain of interest D from the perspective of a person who uses a language L for the purpose of talking about D”. Some documents address only IR (Allan et al. 2002; Barde and Bainwad 2018), others only address IE (Lee 1998; Saik et al. 2017), and most of them address both IE and IR. Although IE and IR have been studied from the 1960s, there is a lack of scholarly documents addressing IE and IR from scientific publications: only 7 out of the 58 documents retrieved address them (Esposito et al. 2005; Marinai 2009; Nasar et al. 2018; Rodríguez et al. 2009; Saik et al. 2017; Upadhyay and Fujii 2016; Wang et al. 2013): Esposito addresses a semantic-based tag extraction by using their system DOMINUS, and they achieve accuracies from 93% up to 98% (Esposito et al. 2005). However, those tags are title, author, abstract, and references, and nowadays it is easier to retrieve those tags with Google Scholar and tools such as EndNote and Mendeley. Marinai aims to extract administrative meta-data from digital articles (Marinai 2009). The paper uses the term “administrative meta- data” to describe details such as title, authors, and publisher (named hereinafter “administrative tags” to avoid confusion). Their outcome is, thus, a file card, the sort of data that tools such as EndNote and Mendeley can provide. Nasar et al.’s article distinguishes meta- data extraction and key-insights extraction and says that “the amount of time that is required to conduct a quality review can take up to 1 year” and that a “systematic literature review can take up to 186 weeks with single/multiple human resources”. In the survey, they talk about an average accuracy of 92% in retrieving meta-data when the document includes a Report Document Page and 64% when it does not. When it comes to key-insight extraction, the precision is 42% and the recall is 52% (Nasar et al. 2018). Rodríguez et al. wrote in 2009 a promising article trying to classify software engineering publications with a three-step method using natural language processing (NLP), mainly focused on (but not limited to) HTML documents. No information is provided about their results, precision, and recall rates (Rodríguez et al. 2009). Saik et al.’s article addresses the agricultural biotechnology field to automatically extract medical and biological knowledge from the PubMed texts using semantic analysis and the relational database 48 MySQL. They propose the use of an adapted version of their ANDSystem solution that “involved the creation of a subject domain ontology and semantic linguistic rules (templates) for analyzing natural language texts and extracting knowledge formalized according to a given ontology”. It requires “dictionaries of the objects” that must be first created using templates (Saik et al. 2017). Upadhyay and Fujii propose “a practical sentence extraction procedure and supporting system which we intended to call knowledge extraction system” by applying rules to identify and extract keywords, discourse keywords, and sentences, but human expert support is required and no precision nor recall rates are provided (Upadhyay and Fujii 2016). Wang et al. focus on information retrieval (document retrieval) based on word concepts and text clustering. They apply the COSINE algorithm to classify documents (Wang et al. 2013). Natural language processing (NLP) is a constant reference in most publications (Hassan and Le 2020). Sometimes their proposals ask for structured documents and, when not, they need to transform documents into structured data (Dezsenyi et al. 2007; Oro and Ruffolo 2008). Other times they need to convert the original PDF files into HTML and text format files to be able to proceed (Hassan and Baumgartner 2005a; Rizvi et al. 2018; Seng and Lai 2010). The methods and algorithms proposed frequently require the involvement of experts and manual training and tuning of the system (Chen and Lynch 1992; Koval and Návrat 2012; Lambrix and Shahmehri 2000; Sirsat et al. 2014; Upadhyay and Fujii 2016). The documents analyzed propose algorithm- based systems and agents with rules to query document databases, although it is common to find unsolved problems when there are heterogeneous data sources (Seng and Lai 2010). Sometimes the solution proposed is just a query with Boolean logic (Lambrix and Shahmehri 2000; Lee 1998; Rahman et al. 2017; Sarwar and Allan 2019) and other times they propose sophisticated techniques such as an artificial neural network (Al-Hroob et al. 2018; Matos et al. 2010), machine learning (Fan et al. 2015; Hassan and Le 2020; Seedah and Leite 2015), and artificial intelligence (Ansari et al. 2016; Gupta and Gupta 2012; Matsuo and Ishizuka 2004), even though artificial intelligence is usually related to NLP (Kim and Chi 2019; Lee 1998). Some documents address information extraction from multimedia contents and files (Srihari et al. 2000; Wolf and Jolion 2004). Other works are intended for specific purposes such as biological knowledge extraction from biomedical web documents (Hu et al. 2004), medical document summarization (Afantenos et al. 2005), and software testing (Lutsky 2000). Some studies aim for “automatic keyword extraction” by considering co- occurrence and frequency to extract keywords (Matsuo and Ishizuka 2004), but do not consider the researcher’s interests. Clustering and classifying techniques are often used, such as nearest neighbor classifier, Bayes, and support vector machine (Shrihari and Desai 2015; Song et al. 2007). Attempts to intelligently split unstructured PDF files into segments have been made by using ontologies and queries to generate an XML output with understandable data, trying to simulate how human readers would analyze a page (Hassan and Baumgartner 2005b). That “human visual” approach has also been addressed by other authors trying to make text visual, although there is a generalized lack of references and there are strong limitations (Nualart- Vilaplana et al. 2014). There are many proposals although sometimes they have not been fully tested (Inui et al. 2008) and are just experimental proposals (Fan et al. 2015; Karthik et al. 2008; Li et al. 2015; Milward and Thomas 2000; Xie et al. 2019). The most frequent situation is that the systems proposed need human training, supervision, and tuning (Fan et al. 2015; Sirsat et al. 2014; Upadhyay and Fujii 2016), and even with that, the outcome is not always as good as desired, with poor precision and recall values (Adrian et al. 2015; Al-Hroob et al. 2018; Milward and Thomas 2000). 6. PROPOSED APPROACH In this section, several relevant components of the whole problem are analyzed, creating a breakdown structure to address them separately. The typical path that researchers follow in their literature review process has several stages (Xia et al. 2018). According to Xia, there are three stages: stage 1 includes review planning and searching for relevant articles using electronic databases; stage 2 involves deleting all duplicates according to the title and author and excluding irrelevant papers by reading their titles, abstracts, and keywords; and stage 3 refers to content analysis. We 49 propose a more effective procedure with four stages (Figure 2). 6.1 Stage 1: planning and computer search In stage 1 an electronic search is performed using databases and search engines on the Internet. To do that, a previous selection of databases is done considering the research subject, e.g., Google Scholar, Web of Sciences, Scopus, or ResearchGate. Some of those databases share documents: that means that they could have the same content, although the result of the search performed can be quite different because of their different search engines. It is relevant to notice that Google Scholar contains almost every reference included in the other databases, and Stage 3 will take advantage of this fact to automatically obtain document tags. After having selected the desired databases, it is necessary to define the keywords and patterns that will be used with the search engines selected. As it is very easy to perform search operations, it is possible to use several keywords and patterns, with and without quotation marks and sometimes splitting search strings into smaller fragments to achieve complementary results. With each search operation, the outcome is a list of documents that match the query. When the number of results is too high it is necessary to refine the search by changing the keywords and patterns or to select just the desired number of results. Those outcomes can be easily copied and pasted into a spreadsheet, Proceso.igx S ta ge 1 P la nn in g an d co m pu te r s ea rc h S ta ge 2 Fi lte rin g an d fil e re tri ev al S ta ge 3 Fi le re ad in g an d ta gg in g S ta ge 4 In fo rm at io n ex tra ct io n Steps Outcome Objectives definition Search patterns & keywords T arget databases Information Retrieval Outcome process Document list Document list Filter Duplicates Final list Download files Document files Final list Document files T agging Reading SummarizingHighlighting Reviewed files Catalog Catalog Concept definition Concept annotation Relative Importance Index Knowledge catalog Figure 2 Process stage description. 50 like Excel, to transform them into easy to use reports. Depending on each database, those lists could contain a variable number of identification fields such as title, authors, date, and even abstract and other tags (“administrative tags”). All that information can be used in stage 2 for filtering purposes. The feasibility, agility, and flexibility of modern search engines lead to dismissing, in general, any other possible sophisticated algorithm proposed in the IR literature. 6.2 Stage 2: filtering and file retrieval In stage 2 a filtering operation is performed to refine the results obtained in the previous stage. Excel filters are used to select or unselect document titles to exclude irrelevant documents. For instance, a possible exclusion rule could be to find in the title the words “image”, “video”, and “media”. Additional available information, e.g., keywords, abstract, or other data, can be used to exclude, for instance, documents corresponding to patents: in this case, the filtering rule would be to find the word “patent” close to the title line. If necessary, documents can be downloaded to check their content and decide whether they fit the subject proposed. When the filtering operation is completed, duplicate results are detected according to the title and authors and then deleted. Finally, the documents are downloaded, and all unreachable documents are excluded. The outcome of this stage is a final list of documents and a database with downloaded PDF files. 6.3 Stage 3: file reading and tagging In stage 3, documents retrieved should be tagged and reviewed. Meta-data in scientific documents is information commonly associated with administrative properties, such as author names, title, publication date, or journal (Esposito et al. 2005; Marinai 2009; Tseng and Chou 2006), and many researchers have tried to find ways to retrieve them automatically, even recently (Nasar et al. 2018). However, tagging files is very easy now because it can be done using free tools. For this reason, other possible equivalently sophisticated algorithms proposed in the IR literature were dismissed for this purpose. The most direct way to do it is to look for the document title on Google Scholar and to export the reference obtained to Mendeley, EndNote, or another catalog tool (not all of them are free). Both Mendeley and EndNote are desktop tools to catalog references and to allow researchers to include citation and a reference list properly formatted in their papers. With those tools it is also possible to edit tags and update them automatically. Tags considered in this step are only administrative properties, not other content-related tags (López-Robles et al. 2019; Xie et al. 2019). All documents are read at this stage and researchers begin to achieve knowledge. According to Xia, “the technique of content analysis is employed for compressing many words of text in an organized manner, identifying the focus of subject matter, and diagnosing emerging patterns in the current body of knowledge” (Xia et al. 2018). The researchers interviewed in Universidad Politécnica de Madrid had distinct ways and tools to carry out paper revision, but highlighting and summary elaboration are a constant for all of them. At this stage, the action proposed is a revision of the papers with highlighting of parts of the text using different colors and even writing a short summary (about 150 words) with keywords, tips, and short sentences. This summary is not an abstract summary, but a cue to help them to recall document content later on. 6.4 Stage 4: knowledge extraction According to Hobbs, “Information extraction is the process of scanning text for information relevant to some interest” and “it requires deeper analysis than key word searches” (Hobbs 2002). Natural language process goes beyond the exact term-matching technique (Rahman et al. 2017) and focuses on concepts, semantics, and relationships between terms to try to retrieve most of the original ideas expressed by document writers. It is a hard task for algorithms and programmers to handle entities, relationships, and events to process them automatically with a high level of both precision and recall, and they frequently require human-supervised help (Grishman 2019). However, that task is the daily work of the human brain: every time a person reads a paper, they unconsciously create a mind map which connects the most relevant concepts with their interests to generate knowledge. That virtual mind map could be explicitly created by defining key concepts corresponding to the concepts identified after having analyzed the relevant syntagmas, ontologies, and keywords existing in the text studied (Buzan 2004). The criteria to define those key concepts is not the frequency-based traditional model (Fan et al. 2015; Matsuo and Ishizuka 2004), but a 51 tailored definition that researchers can make according to three factors (Sirsat et al. 2014): 1) the overall contribution of the documents studied to the research project, with concepts that attract researcher’s attention because they appear in several documents of the database studied; 2) the researcher’s previous knowledge that makes them search for specific concepts to clarify authors’ position about them; and 3) the researcher’s experience, which helps them find concepts that could become relevant according to their perception. Some authors call them “keywords” and “discourse words” (Upadhyay and Fujii 2016). This step affects the final outcome and is directly related to the research project purposes (see Figure 3). The aim of defining those concepts is not to summarize documents but to summarize their contribution to the research project, making it possible to characterize documents as a sort of layout and schematic summary in the same line followed by some proposals for document image layout analysis (Oliveira and Viana 2017). According to this, several distinct possible concept types are shown in Table 1. In this table, “type” refers to the way the concept is found in the text reviewed and how it is annotated. Regarding the way to find them (“trigger”), there are two main possibilities: to be a word (or group of words) or to be a sentence. It is a word (or group of words) when their occurrence undoubtedly means a concept expression, e.g., “ANN”, and it is a sentence when concepts are expressed in a more complex way so that no single word is enough to summarize those concepts. Regarding the way concepts may appear (“variation”) they could be specific words and groups of words or an opened or closed name list. Regarding the way concepts are “annotated” in each document, they can be registered just with an “x” mark (they meet the required keyword, idea, or condition) or they can be labeled with a descriptive list element or name. Last, concepts can be numeric values; in that case, the value is annotated. To fully understand Table 1 a detailed description of the types is included in Table 2. Researchers can define as many concepts as needed to cover each detail that is relevant for their research and that they will want to include in their papers. Semantic analysis is an undeniable requirement to achieve a good annotation that is the basis of a key concept definition (Malik et al. 2010). Once the concept definition has been done, a new document review would be needed to identify them in all the documents and to annotate their occurrences. This operation becomes shorter than it could be thought by using desktop tools that make the use of complicated algorithms and programs unnecessary. There are free solutions, such as Adobe Reader and DocFetcher. DocFetcher creates and uses an internal index (the same Table 1 Concept types. Type Trigger Variations Annotation Keyword Word Word, group of words “x” Idea/opinion/statem ent Sentence N.A. “x” Position Sentence N.A. List element Use case Sentence / table / figure List List element Name Sentence List Name Numeric Sentence / table / figure N.A. Value Condition Sentence List “x” Figure 3 Key concept definition. 52 Table 2 Type definition. Type Definition Keyword Applies to the undeniable meaning of a word and group of words in a specific context, e.g., Information Retrieval, Cosine, Query, Machine Learning, Ontology, ANN, or NLP. Idea/opinion/statement Applies to a conceptual meaning that could be expressed with distinct words and sentences, e.g., “Need for improvement”, “Knowledge extraction”, “lack of objectivity”, or “biases”. Position Applies to statements, case of use, and others where authors show whether they approve, reject, or just cite a particular subject, e.g., in regards to a specific technique, they “use or recommend”, they “criticize”, or they “cite”. Use case Applies to distinct options researchers might want to keep track of, such as kind of technology, type of chart, or type of scale. Name Applies to concepts that can be registered with their names, e.g., system, country, or activity. Numeric Applies to concepts that can be quantitatively measured so that it is possible to register their value, e.g., precision or recall. Condition Applies to specific conditions that document scope could accomplish to meet the researcher’s interests, e.g., specific industry or country, or specific field. way as Adobe Acrobat does) that allow users to perform quick Boolean searches for any word and string in a document databases. For instance, to find whether documents indicate that further improvement is needed (an idea/opinion/statement type concept), it would be possible to look for “improve” and “limitation” and retrieve the texts “improving the performance of NLP-based tools” and “there are also practical limitations in rule generation …” (Kim and Chi 2019). However, the text “their sometimes low recall may be compensated by adjusting” (Adrian et al. 2015) and “is prone to several limitations that, in turn, offer opportunities for future research” (Li et al. 2015) would not be retrieved. This manual process is similar to Li et al.’s, which consists of an automated method to retrieve meta-data (Li et al. 2015). Their process lexicon extraction and task identification method for process mining requires manual task annotation to train a statistical model and yields over 75 % classification accuracy, 70 % precision, and 95% recall. The method proposed here improves accuracy, precision, and recall up to 100%, and it is not more manually time-consuming than most of the automated methods proposed in the literature. To efficiently register those knowledge tags, the use of a spreadsheet is suggested. This practice allows for an additional feature: a quantitative measure of the relevance of each concept, i.e., the use of a relative importance index (RII). This idea can be found in many works (Alashwal and Al-Sabahi 2018; Jarkas and Haupt 2015; Nagalla et al. 2018) and for this research project, the solution proposed by Vegas-Fernández was used (Vegas-Fernández 2019; Vegas-Fernández and Rodríguez López 2019). This method applies a weight to each document that considers the document type (standard or regulation, doctoral thesis, book, indexed journal, lecture source, unindexed journal, master thesis, a website run by a renowned organization, or a standard website). The date and their scope are also considered by adding +0.5 to documents after 2010 and by subtracting 0.5 when they are intended for a specific activity or a particular country. The final score is the weight assigned to each document, which is considered when the document matches a concept (regardless if the annotation is an “x”, a name, or a value). The RII is the ratio between the weighted count of documents matching a concept and the maximum value that that weighted count takes for a concept. The outcome at this stage is a ranked list of key concepts, which is a quantitative outcome of knowledge extraction. 7. KNOWLEDGE EXTRACTION EXAMPLE USING THE PROPOSED SYSTEM 53 The process of knowledge extraction carried out for this study is explained next to make it easy to understand the scope, possibilities, and limits of the proposed system. Each one of the distinct steps at each stage is described here with data that will allow readers to make their guess about this system. 7.1 Stage 1: planning and computer search Each researcher is used to searching in scholarly databases, and they choose them according to their preferences. Their previous experience and their knowledge of previous publications related to their research project subject give them the required orientation to select the search strings and the best databases. Searching documents in Google Scholar is a must, but the number of possible retrieved documents can be too high. In this case, the chosen search string was “intelligent information extraction from document databases” without quotation marks to be able to achieve results. That search yielded 313,000 results in Google Scholar, but that outcome was truncated to select just the first 400 most relevant titles. That systematic search process was conducted in eight sources and 974 documents were originally retrieved from Google Scholar, Web of Sciences, Scopus, ScienceDirect, ResearchGate, ASCE, Elsevier, and Mendeley. Outcomes were post-processed in an Excel workbook to manage each database report; that process consisted of converting the HTML information yielded by each search engine into understandable and easy to use Excel rows. This step took less than 3 hours. The number of documents retrieved is displayed in Table 3. Table 3 Information retrieval initial summary (number of documents). Source Initial Outcome Google Scholar 383 Web of Sciences 2 Scopus 85 ScienceDirect 26 ResearchGate 350 ASCE 20 Elsevier 3 Mendeley 105 Total 974 7.2 Stage 2: filtering and file retrieval This stage involves a heavy task because often it is not possible to know whether a document will be useful without reading it. According to their titles, keywords, and abstracts, it is possible to perform an initial filter to reject those that do not meet the requirements. Some search engines do not provide abstracts and keywords in their outcomes and the filter can only consider titles. In those cases, a first filter was applied removing unwanted documents according to their titles, and the remaining were downloaded to check by skim-reading whether they met expectations. Each downloaded document finally accepted was saved in the computer library labeling it with the author-title format. This step took about 60 hours and the number of documents finally selected was 58, after adding manually three more documents. Table 4 shows the number of remaining documents after removing duplicates. There were three types of documents in the list: 62% were journal articles, 36% conference proceedings, and 2% books. Journal article impact distribution is shown in Figure 4. Table 4 Information retrieval final summary (number of documents). Source Initial outcome Resulting outcome Google Scholar 383 24 Web of Sciences 2 2 Scopus 85 6 ScienceDirect 26 0 ResearchGate 350 8 ASCE 20 4 Elsevier 3 0 Mendeley 105 11 Others - 3 Summary 974 58 Figure 4 Impact distribution of the retrieved journal articles (Q factor). 54 7.3 Stage 3: file reading and tagging Two relevant tasks were done at this stage: reading and tagging documents. Google Scholar and its citing tool were used to find each document and to create an entry in the Mendeley catalog (Figure 5). Most tags are automatically saved, and Mendeley, EndNote, and other tools can find reference updates, although sometimes it is necessary to look for a specific missing tag, such as the DOI, Publisher, or the URL for the document (see Figure 6). Figure 5 Tag retrieving with Google Scholar. Figure 6 Tag management with Mendeley. 55 This process does not take long (5 hours for 58 documents), and researchers can perform this part while retrieving and reading documents. Reading documents takes much longer and highlighting and writing the summary proposed in section 6.3 does not account for any significant extra time. 7.4 Stage 4: knowledge extraction At this key stage, 25 concepts were defined using the types defined in Table 2 (see Table 5). An Excel table was used to annotate documents when they met specific criteria, according to Table 5. A part of this work could be done when reading and highlighting documents. To complete this annotation task, the free program DocFetcher was used. Its outcome is a list of the files that meet the search criteria, showing the number of matches in each file, the context paragraph where the keywords were found, and a direct link to the files. These features make it possible to review any concept presence in 5-10 minutes when all the documents have been read, and it becomes extremely easy to carry out efficient searches. It is necessary to reject documents whose matches belong only to the “References” section. The total time dedicated to the 25 concepts defined was less than 4 hours. The outcome of this step is a table with the list of documents, their tags, summary, and concepts (Figure 7). Figure 7 shows the concept map where most of the values are “x”, there are values for precision and recall concepts, and there are names. The bottom line displays the count for the number of documents that meet each concept requirement. The use of the relative importance index (RII) method assigns distinct importance to the hits obtained in each document. This way, a weighted count is obtained for each concept. “Semantics” is the most important concept and is the basis for calculating the RII in every other concept. In this case “semantics” is a sort of wide concept because almost every document talks about semantics without a specific purpose, but that is not a problem as is shown in the next section. Table 5 Key concepts for knowledge extraction. Concept Type Explanation Scientific papers Condition The document addresses scientific papers IE Keyword Information extraction is considered IR Keyword Information retrieval is considered Improvement Idea Need for improvement of current IE/IR techniques Concepts Keyword Concept as an entity, related to semantics and ontologies Cosine Keyword Algorithm intended to evaluate the similarity NLP Keyword Natural language process is cited Knowledge Keyword Knowledge extraction concept is cited ANN Keyword Artificial neural network is cited Fuzzy Keyword Fuzzy techniques and fuzzy logic are cited Bayes Keyword Bayes decision function (classification method) is cited Semantics Keyword Semantics is cited Ontology Keyword Ontology is cited Query Keyword Query is cited, usually related to Boolean operations Rule-based Keyword Rule-based and rule are cited related to queries Clustering Keyword Clustering technique is used to classify documents Machine learning Keyword Machine learning is cited Artificial intelligence Keyword Artificial intelligence is cited Manual Idea Manual operation is needed for supervision, training, etc. System Keyword A system is proposed, although different in each paper Precision Numeric Percentage of precision yielded by the proposed system Recall Numeric Percentage of recall yielded by the proposed system Tags Keyword Administrative tags are used and retrieved Specific activity Name The document addresses some specific kind of papers Specific country Name The document addresses some specific country 56 8. RESULTS AND DISCUSSION The results of the knowledge extraction performed according to the proposed method can be expressed by using the concepts defined and their RII. A ranked list of concepts using the RII gives an accurate view of how scientists address information extraction as a gate to knowledge extraction (Table 6) and a Pareto diagram gives a better understanding of the relative importance of each concept (Figure 8). Table 6 Ranked list of concepts. # Concept RII 1 Semantics 100% 2 Knowledge 81% 3 IE 78% 4 Query 74% 5 Improvement 69% 6 IR 69% 7 Manual 66% 8 Tags 63% 9 Rule-based 61% 10 Machine learning 55% 11 Ontology 49% 12 Concepts 47% 13 Clustering 45% 14 System 44% 15 Precision 40% 16 Recall 38% 17 Specific activity 33% 18 NLP 30% 19 Cosine 23% 20 Fuzzy 17% 21 Artificial intelligence 17% 22 Bayes 14% 23 Scientific papers 12% 24 Specific country 11% 25 ANN 11% It is remarkable that “knowledge extraction” is the second most cited concept, after “semantics,” whose presence is compulsory in this kind of documents. “Information extraction” is placed third in the list and “information retrieval” is sixth, although the search string was “intelligent information extraction”. This proves how close both concepts are in the literature. Figure 8 proves that the results obtained do not follow the Pareto rule. It is possible to differentiate three groups according to concept relevance: 1 to 9, 10 to 18, and 18 to 25. The first group includes basic concepts related to automatization, e.g., “query” and “rule-based”. However, this group contains concepts indicating that there are strong limitations in the state of the art: “Need for improvement of current IE/IR techniques” is placed fifth and “Manual operation is needed for supervision, training, etc.” is placed seventh. “Tags” is placed eighth (administrative tags) and this fact proves that the solutions proposed to extract information frequently address tags, less relevant than insights information. The second group includes concepts related to the technology applied to retrieve and extract information (machine learning, ontologies, concepts, and clustering). It also includes the concept “system” that represents all the systems proposed. All of them are different and, for that reason, they were grouped in that concept to make it possible to give them some visibility. The concept “specific activity,” placed seventeenth, shows that a significant part of the documents studied are intended for a specific purpose, and that fact makes them less applicable to this study. This group includes the concepts “precision” and Reference Type Author Year Title Comments Scientific papersIE IR Improvement Concepts Cosine NLP Knowledge ANN Fuzzy Bayes Semantics Ontology Query Rule-based Clustering Machine learning Artificial intelligenceManual System Precision Recall Tags Specific activity Specific country Journal Arti cl e Adri an, W. T., Leone, N., and Manna, M. 2015 Ontol ogy-dri ven i nformati on extracti on Archi vos con datos no es tructurados homogeneos . Anál i s i s de curri cul um. Res ul tado pobre.x x x x x x x KnowRex 50% 30% Curri cul um Journal Arti cl e Afantenos , S., Karkal ets i s , V., and Stamatopoul os , P. 2005 Summari zati on from medi cal documents : a s urvey Encues ta. Métodos de res umen de documentos médi cos .x x x x x x x x x x x x Review Medi cal Proceedi ng Ahmad, M. W., and Ans ari , M. 2012 A Survey: Soft Computi ng i n Intel l i gent Informati on Retri eval Sys tems Informati on retri eval IR. Survey. Expl i ca métodos IR: al gori tmos Fuzzy, ANN. Al tavi s ta.x x x x x x x x x x x Journal Arti cl e Al -Hroob, A., Imam, A. T., and Al -Hei s a, R. 2018 The us e of arti fi ci al neural networks for extracti ng acti ons and actors from requi rements documentCombi naci ón de NLP y ANN (arti fi ci al neural networks ). Defi ni ci ón de l exi cons , s i ntaxi s y anál i s i s s emánti co. Proponen IT4RE. Semi automáti co. Pobre res ul tado.x x x x x x x x x x IT4RE 47% 79% x Requi rements Proceedi ng Al l an, J., As l am, J., Bel ki n, N., Buckl ey, C., Cal l an, J., Croft, B., Dumai s , S., Fuhr, N., Harman, D., and Harper, D. J.2003 Chal l enges i n i nformati on retri eval and l anguage model i ng: report of a works hop hel d at the center for i ntel l i gent i nformati on retri evalDefi ni ci ón, retos futuro. Lenguaj e. Informati on retri eval IR. Res úmenes . Concl us i ón.x x x x x x x x x x x x x Journal Arti cl e Ans ari , A., Maknoj i a, M., and Shai kh, A. 2016 Intel l i gent i nformati on extracti on bas ed on arti fi ci al neural network QAS (ques ti on ans weri ng s ys tem). NLP (natural l anguage proces s i ng). ANN (Arti fi ci al Neural Network). DNN (Deep Neural Network). Obtenci ón de res pues tas . Us a IE para "extraer" i nformaci ón , no documentos . Muy el emental , no val e para nada.x x x x x x x Proceedi ng Barde, B. V., and Bai nwad, A. M. 2018 An overvi ew of topi c model i ng methods and tool s Cl as i fi caci ón por temas (topi c). IR. NLP. Entrenami ento de model os . Des cri be uti l i dades /herrami entas .x x x x x x x x Journal Arti cl e Boden, C., Lös er, A., Nagel , C., and Pi eper, S. 2012 Fact-aware document retri eval for i nformati on extracti on Bl ueFact. IR i nformati on retri eval . IE Informati on extracti on. Semantyc and Syntacti c i nformati on. Bayes y Heurís ti cos . Bas ado en pal abras . Ori entado a fi l trar documentos porque no obti ene detal l es .x x x x x x x x Bluefact x Journal Arti cl e Chen, H., and Lynch, K. J. 1992 Automati c cons tructi on of networks of concepts characteri zi ng document databas esSi s tema de i ndexaci ón. Informaci ón fragmentada. Indexaci ón manual . Campos como EndNote. Identi fi can conceptos . Cos i ne al gori thm.x x x x x x x x x x Journal Arti cl e Dezs enyi , C., Dobrowi ecki , T. P., and Mes zaros , T. 2007 Adapti ve i nformati on extracti on from uns tructured documents Si s tema. Trans formaci ón documento a es tructurado. No hay s oftware.x x x x x x x x x x x x Proceedi ng Es pos i to, F., Feri l l i , S., Bas i l e, T. M. A., and Di Mauro, N.2005 Semanti c-bas ed acces s to di gi tal document databas es Si s tema DOMINUS para extraer es tructuras . Cl as i fi caci ón documentos . Extracci ón i nformaci ón IE. Tags : ti tl e, authors , abs tract and bi bl i ographi c references .x x x x x x x x DOMINUS x Journal Arti cl e Fan, H., Xue, F., and Li , H. 2015 Proj ect-bas ed as -needed i nformati on retri eval from uns tructured AEC documentsPara proyectos pequeños con un número pequeño de documentos . Al gori tmos machi ne l earni ng y Bayes . Árbol deci s i ón. Encues tas . Experi mental . Supervi s i ón manual .x x x x x x x x x x x System Medi um-s i zed cons tructi on proj ectsHong-Kong Journal Arti cl e Gai zaus kas , R., and Wi l ks , Y. 1998 Informati on extracti on: Beyond document retri eval Di s ti ngue IE e IR. Requi ere expertos . Hi s tori a. Ci ta proyectos académi cos .x x x x x x x x x x x x Journal Arti cl e Gri s hman, R. 2019 Twenty-fi ve years of i nformati on extracti on Definiciones. Excl us i ón conoci mi entos y opi ni ones . NLP. Anal i zar es tructura y generar rel aci ones . IE Informati on extracti on. IR s ubs et of documents . IE s tructure: named enti ti es , enti ti es , rel ati ons , and events .x x x x x x x x 70% 70% x Journal Arti cl e Gupta, P., and Gupta, V. 2012 A s urvey of text ques ti on ans weri ng techni ques Propues ta de arqui tectura. Extracci ón de res pues tas . IE. Nada concreto. Revi s i ón.x x x x x x x x x Journal Arti cl e Has s an, F. u., and Le, T. 2020 Automated Requi rements Identi fi cati on from Cons tructi on Contract Documents Us i ng Natural Language Proces s i ngPara i denti fi car requi s i tos contratos . Natural l anguage proces s i ng (NLP). Regl as (rul e bas ed) + 4 al gori tmos machi ne l earni ng. Preproces o. Por fas es . Bayes y Support Vector Machi nes (SVM). Li mi tado en al cance. Experi mental .x x x x x x x x x x Method 95% 90% x Cons tructi on contracts Proceedi ng Has s an, T., and Baumgartner, R. 2005 Intel l i gent text extracti on from pdf documents Extracci ón de datos de PDF. Convers i ón de PDF a HTML. No l ogran un avance.x x x x x LIXTO x Book Has s an, T., and Baumgartner, R. 2005 Intel l i gent wrappi ng from PDF documents Segmentaci ón documento en bl oques . Ontol ogía. Query. Experi mental .x x x x x x x Journal Arti cl e Hobbs , J. R. 2002 Informati on extracti on from bi omedi cal text Informati on extracti on IE. Defi ni ci ones . Preci s i on. Recal l . It requi res deeper analysis than key word searches. Neces i ta i ntervenci ón manual .x x x 60% 60% Bi omedi ci ne Proceedi ng Hu, X., Li n, T. Y., Song, I., Li n, X., Yoo, I., Lechner, M., and Song, M.2004 Ontol ogy-bas ed s cal abl e and portabl e i nformati on extracti on s ys tem to extract bi ol ogi cal knowl edge from huge col l ecti on of bi omedi cal web documentsSi s tema SPIE. Extracci ón automáti ca en entorno concreto. Al cance concreto. Poca i ntervenci ón manual .x x x x x x x x x x SPIE x Bi ol ogy Proceedi ng Inui , K., Abe, S., Hara, K., Mori ta, H., Sao, C., Eguchi , M., Sumi da, A., Murakami , K., and Mats uyos hi , S.2008 Experi ence mi ni ng: Bui l di ng a l arge-s cal e databas e of pers onal experi ences and opi ni ons from web documentsTecnol ogía de proces o de l enguaj e s obre conteni do Web para extraer i nformaci ón de experi enci as y opi ni ones . Pendi ente de val oraci ón.x x x x x x Experience Minning x Japanes e Web Japan Journal Arti cl e Karol , S., and Mangat, V. 2013 Eval uati on of text document cl us teri ng approach bas ed on parti cl e s warm opti mi zati onCl us ter. Cl as i fi caci ón documentos con técni cas Fuzzy. Informati on Retri eval IR. Propone dos técni cas híbri das : KPSO y FCPSO. Prueba con 3.000 documentos .x x x x x x x x x x x x Proceedi ng Karthi k, M., Mari kkannan, M., and Kannan, A. 2008 An i ntel l i gent s ys tem for s emanti c i nformati on retri eval i nformati on from textual web documentsExtraen i nformaci ón. Us an al gori tmo compl ej o en fas es . Mej oran res ul tados de XML. Experi mental .x x x x x x SEMINRET x Journal Arti cl e Ki m, T., and Chi , S. 2019 Acci dent cas e retri eval and anal ys es : us i ng natural l anguage proces s i ng i n the cons tructi on i ndus tryIE con regl as y condi ti onal random fi el d CRF. Extraen i nformaci ón de i nformes de acci dente. IE con OKAPI BM25. NLP. Semánti ca. Tokeni zaci ón. Li mi taci ones . Poca i nformaci ón.x x x x x x x x x x x x x System (Python) 85% 68% Cons tructi on acci dent Journal Arti cl e Koval , R., and Návrat, P. 2012 Intel l i gent s upport for i nformati on retri eval of web documents Informati on retri eval . IE. Obtenci ón documentos en l a Web que cumpl an con requi s i tos . Intervenci ón manual . Cl us teri ng.x x x x x x x x x x Tree Clustering 80% x Web Journal Arti cl e Lambri x, P., and Shahmehri , N. 2000 Queryi ng documents us i ng content, s tructure and properti es Bús queda en propi edades y conteni do. Bus ca palabras. Cons ul ta manual y query. Toma deci s i ones . Creaci ón índi ce. Bús queda adaptada al conoci mi ento previ o. Al tavi s ta.x x x x x x x x x Query x Proceedi ng Lee, R. 1998 Automati c i nformati on extracti on from documents : A tool for i ntel l i gence and l aw enforcement anal ys tsSi s tema con querys para obtener i nformaci ón. No l a cl as i fi ca, s ól o l a al macena. Enti dades . IE i nformati on extracti on. Revi s i ón manual .x x x x x x Journal Arti cl e Li , J., Wang, H. J., and Bai , X. 2015 An i ntel l i gent approach to data extracti on and tas k i denti fi cati on for proces s mi ni ngExtracci ón i nformaci ón IE. Cons i guen metadatos . Experi mental . Machi ne l earni ng. Preci s i ón 90%. Fal s os pos i ti vos 30%.x x x x x x x x x Method 70% 87% Journal Arti cl e López-Robl es , J.-R., Gual l ar, J., Otegi -Ol as o, J.-R., and Gamboa-Ros al es , N.-K.2019 Bi bl i ometri c and themati c anal ys i s (2006-2017) Anal i za evol uci ón revi s ta EPI. Sci MAT para anál i s i s . Local i za l os temas (conceptos ). Interconexi ones .x x x x x x x Journal Arti cl e Luts ky, P. 2000 Informati on extracti on from documents for automati ng s oftware tes ti ng Us o de l enguaj e natural NLP. Comprobaci ón de s oftware. Val i daci ón. Si s tema s peci fi cati on i nformati on from text (SIFT).x x x SIFT x Software Journal Arti cl e Mal i k, S. K., Prakas h, N., and Ri zvi , S. 2010 Semanti c annotati on framework for i ntel l i gent i nformati on retri eval us i ng KIM archi tectureSi s tema. Entorno Web. Semánti ca. Ontol ogías . Lenguaj e natural NLP.x x x x x x x x x x KIM Proceedi ng Mari nai , S. 2009 Metadata extracti on from PDF papers for di gi tal l i brary i nges t Extracci ón metadatos de PDF. Convi erten PDF a XML. Us an Greens tone.x x x x x x pdf2gsdl 23% 74% Proceedi ng Matos , P. F., Lombardi , L. O., Pardo, T. A., Ci ferri , C. D., Vi ei ra, M. T., and Ci ferri , R. R.2010 An envi ronment for data anal ys i s i n bi omedi cal domai n: i nformati on extracti on for deci s i on s upport s ys temsOri entado a bi omedi ci na. Anemi a de cél ul as fal ci formes . Informatoi n extracti on IE. Datos numéri cos . Documentos no es tructurados .x x x x x x x x x x x x x x Bi omedi ci na Journal Arti cl e Mats uo, Y., and Is hi zuka, M. 2004 Keyword extracti on from a s i ngl e document us i ng word co-occurrence s tati s ti cal i nformati onExtrae pal abras con al gori tmo. No val ora el s enti do. Obti ene l os que más aparecen. Co-occurrence.x x x x x Proceedi ng Mi l ward, D., and Thomas , J. 2000 From i nformati on retri eval to i nformati on extracti on IE, IR. NLP. Hi ghl i ght. Query con operadores Bool eanos . Experi mental . Res ul tados pobres y l i mi tados .x x x x x x x x 77% 55% x Journal Arti cl e Mi tra, M., and Chaudhuri , B. 2000 Informati on retri eval from documents : A s urvey Encues ta es tado arte en bús queda e i ndexaci ón. Ti pos documentos . Des es tructuraci ón. Mul ti -domi ni o de ori gen. Model o Bool eano. Al gori tmos . OCR.x x x x x x x x Journal Arti cl e Nas ar, Z., Jaffry, S. W., and Mal i k, M. K. 2018 Informati on extracti on from s ci enti fi c arti cl es : a s urvey Extracci ón i nformaci ón artícul os académi cos . Al gori tmos HMM, CORA, CRF, SVM. Extrae metadatos (datos artícul o) y Key-i ns i ghts (mens aj es dentro del texto).x x x x x x x x x x x x x 42% 52% x Journal Arti cl e Nual art-Vi l apl ana, J., Pérez-Montoro, M., and Whi tel aw, M.2014 Cómo di buj amos textos : Revi s i ón de propues tas de vi s ual i zaci ón y expl oraci ón textualVi s i ón mul ti di mens i onal del texto. Mi nería de datos . Textos i ndi vi dual es y col ecci ones . Anál i s i s vi s ual de es tructura. Intentan es tructurar.x x x x Proceedi ng Ol i vei ra, D. A. B., and Vi ana, M. P. 2017 Fas t CNN-bas ed document l ayout anal ys i s Si s tema uni di mens i onal anál i s i s automáti co. CNN (convol uti onal neural networks ). Anal i zan i mágenes .x x x x CNN Proceedi ng Oro, E., and Ruffol o, M. 2008 Xonto: An ontol ogy-bas ed s ys tem for s emanti c i nformati on extracti on from pdf documentsExtracci ón de PDF. Ontol ogía. ontol ogy-bas ed s ys tem for s emanti c IE from PDF documents XONTO. Convers i ón de documentos no es tructurados a es tructurados .x x x x x x x x x XONTO x Proceedi ng Rahman, N. A., Soom, A. B. M., and Is mai l , N. K. 2017 Enhanci ng Latent Semanti c Anal ys i s by Embeddi ng Taggi ng Al gori thm i n Retri evi ng Mal ay Text DocumentsLatent Semanti c Indexi ng (LSI). Apl i caci ón a l engua Mal ay. Mej ora de LSI. Términos y conceptos. Definiciones. Us a eti quetas (tags ).x x x x x x x LSAT 65% 70% x Mal ay l anguage Proceedi ng Ri zvi , S. T. R., Merci er, D., Agne, S., Erkel , S., Dengel , A., and Ahmed, S.2018 Ontol ogy-bas ed Informati on Extracti on from Techni cal Documents Extracci ón de i nformaci ón de tabl as . Convers i ón de PDF a HTML. Bas ado en ontol ogías . Automáti co.x x x x x x 88% 100% Tabl es Proceedi ng Rodríguez, A., Col omo, R., Gómez, J. M., Al or-Hernandez, G., Pos ada-Gomez, R., Juarez-Marti nez, U., Gayo, J. E. L., and Vi dyas ankar, K.2009 A propos al for a s emanti c i ntel l i gent document repos i tory archi tecture Li teratura académi ca. IE. IR. SIDRA s i s tema híbri do. Ori entado a HTML. Ontol ogía. Query. Keywords . Ranki ng por rel evanci a en cuanto al número de ci tas .x x x x x x x x SIDRA x Software Irel and Journal Arti cl e Ros tami , N. A. 2014 Integrati on of Bus i nes s Intel l i gence and Knowl edge Management – A l i terature revi ewDefi ne Knowl edge management. Rel aci ón con BI. x x x Journal Arti cl e Sai k, O., Demenkov, P., Ivani s enko, T., Kol chanov, N., and Ivani s enko, V.2017 Devel opment of methods for automati c extracti on of knowl edge from texts of s ci enti fi c publ i cati ons for the creati on of a knowl edge bas e Sol anum TUBEROSUMCi ta s i s temas de extracci ón ori entados a temas bi ol ogía. Us a bas e datos MySQL. Semánti ca.x x x x x x ANDSystem Agri cul tural bi otechnol ogy Proceedi ng Sarwar, S. M., and Al l an, J. 2019 A Retri eval Approach for Informati on Extracti on Si s tema Search IE. Informati on extracti on. Query. Cas o de pocas apari ci ones de un concepto. Lenguaj e natural NLP.x x x x x x x SearchIE x Journal Arti cl e Schal l ey, A. C. 2019 Ontol ogi es and ontol ogi cal methods i n l i ngui s ti cs Defi ne ontol ogía. Li ngüís ti ca. x x x x Proceedi ng Seedah, D. P., and Lei te, F. 2015 Informati on Extracti on for Frei ght-Rel ated Natural Language Queri es Proponen un s i s tema híbri do para fl etes que combi na vari as técni cas . Lenguaj e natural NLP muy l i mi tado. Informati on extracti on IE. Named Enti ty Recogni ti on NER. Técni cas es peci al es de domi no. Regl as y machi ne l earni ng. Cl as i fi cador enti dades .x x x x x x x x x x 20% 40% x Frei ghts Journal Arti cl e Seng, J.-L., and Lai , J. 2010 An Intel l i gent i nformati on s egmentati on approach to extract fi nanci al data for bus i nes s val uati onDatos fi nanci eros . Documentos es tructurados . Lenguaj e natural . NLP. Probl ema múl ti pl es fuentes heterogéneas . Convers i ón PDF a TXT.x x x x x x x x x 88% 89% x Fi nanci al Chi na Journal Arti cl e Shri hari , R. C., and Des ai , A. 2015 A revi ew on knowl edge di s covery us i ng text cl as s i fi cati on techni ques i n text mi ni ngPorquería. Compara técni cas . Mal os res ul tados de preci s i ón.x x x x x x x x x 78% 80% Journal Arti cl e Si rs at, S. R., Chavan, V., and Des hpande, S. P. 2014 Mi ni ng knowl edge from text repos i tori es us i ng i nformati on extracti on: A revi ewExtraer conoci mi ento. IE requi ere s upervi s i ón manual . Knowl edge di s covery from databas e (KDD). Convers i ón de des es tructurado a es tructurado. Regl as .x x x x x x x x x 71% 74% x Journal Arti cl e Song, D., Lau, R. Y., Bruza, P. D., Wong, K.-F., and Chen, D.-Y.2007 An i ntel l i gent i nformati on agent for document ti tl e cl as s i fi cati on and fi l teri ng i n document-i ntens i ve domai nsCl as i fi can l os documentos en bas e a s u títul o. NO extraen i nformaci ón. Preferenci as us uari os .x x x x x x x x x Hybrid 46% 86% Journal Arti cl e Sri hari , R. K., Zhang, Z., and Rao, A. 2000 Intel l i gent i ndexi ng and s emanti c retri eval of mul ti modal documents Referenci a para bús queda en documentos mul ti medi a. OCR.x x x x x x x x Mul ti medi a Journal Arti cl e Ts eng, F. S., and Chou, A. Y. 2006 The concept of document warehous i ng for mul ti -di mens i onal model i ng of textual -bas ed bus i nes s i ntel l i genceDefi ne metadatos meta-data como EndNote. 80% i nformaci ón no es numéri ca. Mul ti -di mens i ón. Data warehous e. Document warehous e. Pl antea XML.x x x x x x x x x Tai wan Proceedi ng Upadhyay, R., and Fuj i i , A. 2016 Semanti c knowl edge extracti on from res earch documents Combi nati on of s emanti cs of s entences and natural l anguage proces s i ng techni que over the s entences . Rul es . Keywords . Query. Apoyo manual .x x x x x x x x x x x Proceedi ng Wang, Q., Qu, S. N., Du, T., and Zhang, M. J. 2013 The Res earch and Appl i cati on i n Intel l i gent Document Retri eval Bas ed on Text Quanti fi cati on and Subj ect Mappi ngDocument retri eval . IE. Word concept. Bus ca pal abras por s u s emánti ca. Cl as i fi caci ón de documentos por temas . Keywords para cl as i fi car. Correl aci ón entre pal abras . Al gori tmo COSINE.x x x x x x x x x x Journal Arti cl e Wol f, C., and Jol i on, J.-M. 2004 Extracti on and recogni ti on of arti fi ci al text i n mul ti medi a documents Referenci a para bús queda en documentos mul ti medi a. OCR.x x x x x x x x System (OCR) 88% 76% Mul ti medi a Journal Arti cl e Xi e, X., Fu, Y., Ji n, H., Zhao, Y., and Cao, W. 2019 A novel text mi ni ng approach for s chol ar i nformati on extracti on from web content i n Chi nes eSi s tema experi mental para extraer i nformaci ón de Web en chi no. Extrae atributos expertos . Bas ado en pal abras y regl as . x x x x x 44% 47% x Chi na 58 58 58 7 42 35 35 23 10 16 42 6 9 8 52 26 38 31 21 27 9 34 26 20 19 32 20 6 Figure 7 Reference list with concepts. 57 “recall”: the average values for precision and recall in the literature review performed are 64% and 70%, respectively, which are very far from a comfortable confidence level. The third group contains the least relevant concepts and they are related to the most sophisticated techniques, e.g., “artificial intelligence.” This seems to prove that they are far from a mature state that would allow them to be commonplace. The concept “scientific papers” is placed twenty-third because only seven out of the 58 documents studied address this subject. The specific field of knowledge extraction from scholarly documents asks for affordable solutions that are easy to work with. Nassar says that “Manual analysis is not scalable and efficient” and cites other authors who state that a systematic literature review could take 1 to 3 years (Nasar et al. 2018). This study has used a manual method to extract knowledge starting with a systematic literature review, and the whole process took less than one month. The results presented in this study prove that knowledge extraction can be efficiently performed manually with the help of desktop tools that are commonplace. It does not matter that manual analysis is not scalable because researchers usually face a scholarly library with only a few hundred documents in each research project. The method proposed was also used in a distinct research project with a library that held 300 documents (Vegas- Fernández 2019). In practice, document reading takes up most of the time dedicated to literature review in a research project, much more than retrieving and organizing documents. This paper proposes a feasible way to optimize knowledge extraction, giving up, for now, the option of a fully automatic information retrieval and extraction system, and proposing “concept definition” as the most relevant task. 9. CONCLUSIONS Technique algorithms are not always the answer to efficient extraction of information from scholarly document databases and sophisticated automatic systems do not seem to be the best fit to solve the researcher’s needs. Any possible automated solution that requires manual training, supervision, and tuning is not worthwhile because it requires too much time dedicated to those tasks and it is shorter and more efficient to do it by hand. The relevance of concept definition has frequently been underestimated and this paper proposes and proves that proper concept definition is key to achieve outstanding knowledge extraction. The results of the analysis conducted with a scholarly document database confirm the suitability of the approach and the method that has been explained. This paper has presented a simple but efficient method that takes advantage of free desktop tools that are commonplace. By following this method, it is very easy to carry out a systematic literature review, in order to Figure 8 Pareto diagram of concepts using their RII. 58 retrieve, filter, and organize results, and to extract information to transform it into knowledge. The conceptual basis is a semantics-oriented concept definition and a relative importance index to measure concept relevance in the literature studied. The detailed explanation of the proposed procedure in four steps shows that most of the tasks require mental activity that cannot be helped by automated systems. The method proposed is intended for knowledge extraction from scholarly document databases, but it could also be used in other projects such as departmental document databases whenever the total number of documents in the library is only a few hundred. 10. REFERENCES Adrian, W. T., Leone, N., and Manna, M. (2015). "Ontology-driven information extraction." arXiv preprint arXiv:1512.06034. Afantenos, S., Karkaletsis, V., and Stamatopoulos, P. (2005). "Summarization from medical documents: a survey." Artificial intelligence in medicine, 33(2), 157-177. Ahmad, M. W., and Ansari, M. "A survey: soft computing in intelligent information retrieval systems." Proc., 2012 12th International Conference on Computational Science and Its Applications, IEEE, 26-34. Al-Hroob, A., Imam, A. T., and Al-Heisa, R. (2018). "The use of artificial neural networks for extracting actions and actors from requirements document." Information and Software Technology, 101(2018), 1-15. Alashwal, A. M., and Al-Sabahi, M. H. (2018). "Risk factors in construction projects during unrest period in Yemen." Journal of Construction in Developing Countries, 23(2), 43–62. Allan, J., Aslam, J., Belkin, N., Buckley, C., Callan, J., Croft, B., Dumais, S., Fuhr, N., Harman, D., and Harper, D. J. "Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval." Proc., ACM SIGIR Forum, ACM New York, NY, USA, 31- 47. Ansari, A., Maknojia, M., and Shaikh, A. (2016). "Intelligent information extraction based on artificial neural network." International Journal in Foundations of Computer Science & Technology, 6(1). Barde, B. V., and Bainwad, A. M. (2018). "An overview of topic modeling methods and tools." Proc., 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), IEEE, 745-750. Bettany-Saltikov, J. (2012). How to do a systematic literature review in nursing: a step- by-step guide, McGraw-Hill Education (UK), Maidenhead, UK. Boden, C., Löser, A., Nagel, C., and Pieper, S. (2012). "Fact-aware document retrieval for information extraction." Datenbank- Spektrum, 12(2), 89-100. Buzan, T. (2004). Cómo crear mapas mentales, Ediciones Urano, Barcelona, Spain. Chen, H., and Lynch, K. J. (1992). "Automatic construction of networks of concepts characterizing document databases." Ieee T Syst Man Cyb, 22(5), 885-902. Dezsenyi, C., Dobrowiecki, T. P., and Meszaros, T. (2007). "Adaptive information extraction from unstructured documents." International Journal of Intelligent Information and Database Systems, 1(2), 156-180. Esposito, F., Ferilli, S., Basile, T. M. A., and Di Mauro, N. (2005). "Semantic-based access to digital document databases." Proc., International Symposium on Methodologies for Intelligent Systems, Springer, Berlin, Heidelberg, Germany, 373-381. Fan, H., Xue, F., and Li, H. (2015). "Project-based as-needed information retrieval from unstructured AEC documents." Journal of Management in Engineering, 31(1), A4014012. Gaizauskas, R., and Wilks, Y. (1998). "Information extraction: Beyond document retrieval." Journal of documentation, 54(1), 70-105. Grishman, R. (2019). "Twenty-five years of information extraction." Natural Language Engineering, 25(6), 677-692. Gupta, P., and Gupta, V. (2012). "A survey of text question answering techniques." International Journal of Computer Applications, 53(4), 1–8. Hassan, F. u., and Le, T. (2020). "Automated Requirements Identification from Construction Contract Documents Using Natural Language Processing." Journal of Legal Affairs and Dispute Resolution in Engineering and Construction, 12(2), 04520009. 59 Hassan, T., and Baumgartner, R. "Intelligent text extraction from pdf documents." Proc., International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06), IEEE, 2–6. Hassan, T., and Baumgartner, R. (2005b). Intelligent wrapping from PDF documents, CEUR Workshop Proceedings, Točná, Czech Republic. Hobbs, J. R. (2002). "Information extraction from biomedical text." Journal of biomedical informatics, 35(4), 260-264. Hu, X., Lin, T. Y., Song, I., Lin, X., Yoo, I., Lechner, M., and Song, M. "Ontology-based scalable and portable information extraction system to extract biological knowledge from huge collection of biomedical web documents." Proc., IEEE/WIC/ACM International Conference on Web Intelligence (WI'04), IEEE, 77-83. Inui, K., Abe, S., Hara, K., Morita, H., Sao, C., Eguchi, M., Sumida, A., Murakami, K., and Matsuyoshi, S. "Experience mining: Building a large-scale database of personal experiences and opinions from web documents." Proc., 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE, 314-321. Jarkas, A. M., and Haupt, T. C. (2015). "Major construction risk factors considered by general contractors in Qatar." Journal of Engineering, Design and Technology, 13(1), 165–194. Karol, S., and Mangat, V. (2013). "Evaluation of text document clustering approach based on particle swarm optimization." Open Computer Science, 3(2), 69-90. Karthik, M., Marikkannan, M., and Kannan, A. "An intelligent system for semantic information retrieval information from textual web documents." Proc., International Workshop on Computational Forensics, Springer, Berlin, Heidelberg, Germany, 135- 146. Kasperiuniene, J., and Zydziunaite, V. (2019). "A systematic literature review on professional identity construction in social media." SAGE Open, 9(1), 2158244019828847. Kim, T., and Chi, S. (2019). "Accident case retrieval and analyses: using natural language processing in the construction industry." Journal of Construction Engineering and Management, 145(3), 04019004. Koval, R., and Návrat, P. (2012). "Intelligent support for information retrieval of web documents." Computing and Informatics, 21(5), 509–528. Lambrix, P., and Shahmehri, N. (2000). "Querying documents using content, structure and properties." Journal of Intelligent Information Systems, 15(3), 287-307. Lee, R. "Automatic information extraction from documents: A tool for intelligence and law enforcement analysts." Proc., Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, AAAI Press Menlo Park, CA. Li, J., Wang, H. J., and Bai, X. (2015). "An intelligent approach to data extraction and task identification for process mining." Information Systems Frontiers, 17(6), 1195- 1208. López-Robles, J.-R., Guallar, J., Otegi-Olaso, J.- R., and Gamboa-Rosales, N.-K. (2019). "Bibliometric and thematic analysis (2006- 2017)." El profesional de la información, 28(4), e280417. Lutsky, P. (2000). "Information extraction from documents for automating software testing." Artificial Intelligence in Engineering, 14(1), 63-69. Malik, S. K., Prakash, N., and Rizvi, S. (2010). "Semantic annotation framework for intelligent information retrieval using KIM architecture." International Journal of Web & Semantic Technology (IJWest), 1(4), 12-26. Marinai, S. "Metadata extraction from PDF papers for digital library ingest." Proc., 2009 10th International conference on document analysis and recognition, IEEE, 251-255. Matos, P. F., Lombardi, L. O., Pardo, T. A., Ciferri, C. D., Vieira, M. T., and Ciferri, R. R. (2010). "An environment for data analysis in biomedical domain: information extraction for decision support systems." Proc., International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Springer, Berlin, Heidelberg, Germany, 306-316. Matsuo, Y., and Ishizuka, M. (2004). "Keyword extraction from a single document using word 60 co-occurrence statistical information." International Journal on Artificial Intelligence Tools, 13(01), 157-169. Milward, D., and Thomas, J. "From information retrieval to information extraction." Proc., ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, 85-97. Mitra, M., and Chaudhuri, B. (2000). "Information retrieval from documents: A survey." Information retrieval, 2(2-3), 141-163. Nagalla, V., Dendukuri, S. C., and Asadi, S. S. (2018). "Analysis of risk assessment in construction of highway projects using relative importance index method." International Journal of Mechanical Engineering and Technology, 9(3), 1–6. Nasar, Z., Jaffry, S. W., and Malik, M. K. (2018). "Information extraction from scientific articles: a survey." Scientometrics, 117(3), 1931-1990. Nualart-Vilaplana, J., Pérez-Montoro, M., and Whitelaw, M. (2014). "Cómo dibujamos textos: Revisión de propuestas de visualización y exploración textual." El profesional de la información, 23(3), 221-235. Oliveira, D. A. B., and Viana, M. P. (2018). "Fast CNN-based document layout analysis." Proc., Proceedings of the IEEE International Conference on Computer Vision Workshops, IEEE Computer Society, 1173-1180. Oro, E., and Ruffolo, M. "Xonto: An ontology- based system for semantic information extraction from pdf documents." Proc., 2008 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE, 118-125. Rahman, N. A., Soom, A. B. M., and Ismail, N. K. "Enhancing Latent Semantic Analysis by Embedding Tagging Algorithm in Retrieving Malay Text Documents." Proc., Asian Conference on Intelligent Information and Database Systems, Springer, 309-319. Renault, B. Y., and Agumba, J. N. (2016). "Risk management in the construction industry: a new literature review." MATEC Web of Conferences, 66(2016), 0008. Rizvi, S. T. R., Mercier, D., Agne, S., Erkel, S., Dengel, A., and Ahmed, S. (2018). "Ontology- based Information Extraction from Technical Documents." Proc., ICAART (2), Science and Technology Publications, Lda, 493-500. Rodríguez, A., Colomo, R., Gómez, J. M., Alor- Hernandez, G., Posada-Gomez, R., Juarez- Martinez, U., Gayo, J. E. L., and Vidyasankar, K. "A proposal for a semantic intelligent document repository architecture." Proc., 2009 Electronics, Robotics and Automotive Mechanics Conference (CERMA), IEEE, 69-75. Rostami, A., Sommerville, J., Wong, I. L., and Lee, C. (2015). "Risk management implementation in small and medium enterprises in the UK construction industry." Engineering, Construction and Architectural Management, 22(1), 91–107. Saik, O., Demenkov, P., Ivanisenko, T., Kolchanov, N., and Ivanisenko, V. (2017). "Development of methods for automatic extraction of knowledge from texts of scientific publications for the creation of a knowledge base Solanum TUBEROSUM." Agricultural Biology, 52(1), 1. Sarwar, S. M., and Allan, J. "A Retrieval Approach for Information Extraction." Proc., Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, Association for Computing Machinery, 249-252. Schalley, A. C. (2019). "Ontologies and ontological methods in linguistics." Language and Linguistics Compass, 13(11), e12356. Seedah, D. P., and Leite, F. (2015). "Information Extraction for Freight-Related Natural Language Queries." Proc., Computing in Civil Engineering 2015, American Society of Civil Engineers, 427-435. Seng, J.-L., and Lai, J. (2010). "An Intelligent information segmentation approach to extract financial data for business valuation." Expert Systems with Applications, 37(9), 6515-6530. Shrihari, R. C., and Desai, A. (2015). "A review on knowledge discovery using text classification techniques in text mining." International Journal of Computer Applications, 111(6). Sirsat, S. R., Chavan, V., and Deshpande, S. P. (2014). "Mining knowledge from text repositories using information extraction: A review." Sadhana-Acad P Eng S, 39(1), 53-62. Snyder, H. (2019). "Literature review as a research methodology: An overview and guidelines." Journal of Business Research, 104(2019), 333–339. Song, D., Lau, R. Y., Bruza, P. D., Wong, K.-F., and Chen, D.-Y. (2007). "An intelligent 61 information agent for document title classification and filtering in document- intensive domains." Decision Support Systems, 44(1), 251-265. Srihari, R. K., Zhang, Z., and Rao, A. (2000). "Intelligent indexing and semantic retrieval of multimodal documents." Information Retrieval, 2(2-3), 245-275. Tseng, F. S., and Chou, A. Y. (2006). "The concept of document warehousing for multi- dimensional modeling of textual-based business intelligence." Decision Support Systems, 42(2), 727-744. Upadhyay, R., and Fujii, A. "Semantic knowledge extraction from research documents." Proc., 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE, 439–445. Vegas-Fernández, F. (2019). "Factor de visibilidad. Nuevo indicador para la evaluación cuantitativa de riesgos." PhD PhD, Universidad Politécnica de Madrid, Universidad Politécnica de Madrid. Vegas-Fernández, F., and Rodríguez López, F. (2019). "Risk management improvement drivers for effective risk-based decision- making." Journal of Business, Economics and Finance (JBEF), 8(4), 223–234. Wang, Q., Qu, S. N., Du, T., and Zhang, M. J. "The Research and Application in Intelligent Document Retrieval Based on Text Quantification and Subject Mapping." Proc., Advanced Materials Research, Trans Tech Publ, 2561-2568. Wolf, C., and Jolion, J.-M. (2004). "Extraction and recognition of artificial text in multimedia documents." Formal Pattern Analysis & Applications, 6(4), 309-326. Xia, N., Zou, P. X., Griffin, M. A., Wang, X., and Zhong, R. (2018). "Towards integrating construction risk management and stakeholder management: A systematic literature review and future research agendas." International Journal of Project Management, 36(5), 701–715. Xie, X., Fu, Y., Jin, H., Zhao, Y., and Cao, W. (2019). "A novel text mining approach for scholar information extraction from web content in Chinese." Future Generation Computer Systems.