INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 14(1), 107-123, February 2019. A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation Q. Zhang, S. Liu, D. Gong, Q. Tu Qianqian Zhang, Shifeng Liu, Daqing Gong*, Qun Tu School of Economics and Management Beijing Jiaotong University, China 100044 No.3 Shangyuancun, Haidian, Beijing, China 15113121@bjtu.edu.cn, shfliu@bjtu.edu.cn *Corresponding author:gongdq@bjtu.edu.cn 17113133@bjtu.edu.cn Abstract: This paper proposed a method for building enterprise’s technological innovation domain ontology automatically from plain text corpus based on Latent Dirichlet Allocation (LDA). The proposed method consisted of four modules: 1) in- troducing the seed ontology for domain of enterprise’s technological innovation, 2) us- ing Natural Language Processing (NLP) technique to preprocess the collected textual data, 3) mining domain specific terms from document collections based on LDA, 4) obtaining the relationship between the terms through the defined relevant rules. The experiments have been carried out to demonstrate the effectiveness of this method and the results indicated that many terms in domain of enterprise’s technological innova- tion and the semantic relations between terms are discovered. The proposed method is a process of continuously cycles and iterations, that is the obtained objective ontol- ogy can be re-iterated as initial seed ontology. The constant knowledge acquisition in the domain of enterprise’s technological innovation to update and perfect the initial seed ontology. Keywords: Latent Dirichlet Allocation (LDA), ontology extension, enterprise’s tech- nological innovation, semantic web, text mining. 1 Introduction With the pace of globalization of economy accelerated significantly, the market has stepped into the information age from the era of industrialization. As the market demand changes at a faster pace, the competition of the market has become extremely fierce. In this context, the technological innovation is increasingly becoming the inner motivation and main source of en- terprise development. There is important significance to evaluate the capability of enterprise technological innovation scientifically and efficiently to set the technological innovation policy for government, revise technological innovation strategy reasonably for enterprises and improve the technological innovation ability. The evaluation of enterprise’s technological innovation abil- ity has drawn extensive attention of much scholars. Although much progress has been made on the theoretical research of enterprise’s technological innovation [13, 33]. There still exist many problems such as evaluation mechanism and evaluation methodology, namely, the biggish sub- jectivity of evaluating indexes, strong dependence on declared data of evaluated enterprise, low evaluation accuracy, poor coincidence of evaluate results, etc. Enterprises produce large amounts of textual information in technological innovation process, including technological innovation ac- tivity report, meeting minutes, annual report and patent file. Hence, enterprises need to not only make use of these documents but to mine and discover valuable and hidden knowledge from large collections of data. It is also a pressing problem to transform massive textual data into knowledge that can serve and utilize for technological innovation of enterprise and provide decision-making Copyright ©2019 CC BY-NC 108 Q. Zhang, S. Liu, D. Gong, Q. Tu for technological innovation of enterprise. Therefore, it is the field which has not been involved by using techniques of text mining and machine learning to analyze massive textual information that generated by enterprise’s technological innovation and further determined the enterprise’s technological innovation ability from objective data. In this paper, we deal with three major problems as follows: • Is it possible to discover the concepts from large amount of textual corpus of domain of enterprise’s technological innovation? • Is it possible to build rules for semantic relationship recognition to make the enterprise’s technological innovation ontology subsumption hierarchy? • Is it possible to make the enterprise’s technological innovation domain ontology extension automatically? To improve this situation, this paper presents an approach to extract core concepts from large textual data and proposes a new method of building rules for semantic relationship recognition based on LDA algorithm. The rest of paper is organized as follows, section 2 provides some background knowledge concerning concept and relative literature reviews. In section 3 explains the proposed methods, while section 4 presents the experimental results. Section 5 concludes the paper. 2 Background knowledge and related works 2.1 Technological innovation capability Technological Innovation Capability (TIC) has become the key to improve productivity and maintain competitiveness in the constantly fluctuating environments for enterprises. However, the definition of TIC is hard to agree upon since the technological innovation involves numerous organizational functions and resources integration among various department [26]. The concept of innovation originally from the innovation theory proposed by Schumpeter. On the base of it, Burgelaman et al. [5] put forward that all TIC can be defined as a series of characteristics in an organization facilitating and supporting an innovation strategy. Based on differing perspectives, there are many scholars proposed various components of TICs of a firm [22, 30].Therefore, the measurement of TIC is difficult and complicated since the perceive objectives and criteria for TIC is different. Tsai et al. [24] established an evaluation model for the TIC of high-tech industries based on the AHP method. Wang and Chang [25] proposed a model for diagnose the value of TIC in enterprise and established an evaluation system by AHP method. Wang et al. [26] evaluated and analyzed TIC combined with fuzzy evaluation and non-additional fuzzy evaluation. Deng et al. [12] established a TIC evaluation system by factor analysis and the fuzzy synthetic assessment method is used to evaluate TIC. Guan et al. [13] developed an innovation measurement framework based on the traditional DEA method. By looking at literatures of the measurements of TIC [8, 32], few studies can avoid to involve the subjective judgement, previous experience and uncertain assessment by experts. 2.2 Ontologies construction and extension In the last decade, many scholars have done a lot of researches on ontology definition, con- struction, extension and application aspects. Ontologies were defined as "an explicit specifica- tion of shared conceptualization" [14] provide the key to machine-processable data on Semantic Web, being fundamental components for sharing, reusing as well as reasoning over knowledge A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 109 domains [1]. Although there is a great progress in knowledge acquisition and ontology construc- tion, the current ontology construction methods still rely heavily on manual parsing and existing knowledge bases. The process of ontology learning and extending is a costly, time-consuming and error-prone task when done manually. With the constant emergence of new domain knowledge, the domain ontology automatic updates are facing new challenge. Many researchers have engaged into ontology construction and enriching automatically in recent years. In previous work, the machine learning and statistical analysis method has great advantages in accuracy and recall rate and has been proposed to solve this problem [9]. For instance, Jeroen et al. [10] proposed the subsumption method and a hierarchical clustering algo- rithm to arrange the domain terms hierarchically and compared the two methods performances. Researches [18] and [23] using the fuzzy mechanism to extract domain concept and generate the domain ontology through the fuzzy conceptual clustering. Khan and Luo [17] presented a modified self-organizing tree algorithm (SOTA) which is performs better than the hierarchical agglomerative clustering (HAZ) on ontology construction automatically. Gilles et al. [1] put for- ward a Mo’k workbench which is a framework using the agglomerative clustering techniques to generate concept hierarchies from parsed corpora. Cimiano and Völker [6] presented a Text2Onto which implementing variety algorithms and techniques for ontology learning. However, most of the existing methods require a certain scale of supervised training corpus as the learning object, and the result seldom consider semantic-aware which is difficult to recognize the relationship between terms in domain ontology. Although the ontology construction and extending automatically has been achieved some progress, there are still some problems in this field. For example, the non-taxonomic relationship among terms were often omit in the ontology hierarchical relations construction. Besides, the parameter setting in the model and complex computing in the process cause the heavy computing burden and make the model overfitting which limits their application. 2.3 LDA topic model The latent topic discovery researches have gained much attention to hierarchical relation learning in recent years. Latent topic discovery is invented to overcome the bottleneck of bag- of words processing model in information retrieval area, trying to advance the text processing technology from pattern to semantic calculation [23]. For the research in latent topic discovery, an earlier work in literatures is Latent semantic indexing (LSI), which is a retrieval technique to learn latent topic by performing a matrix decomposition (SVD) on the term-document matrix [31]. Through this technique, latent topics are revealed which are actually distributions over the words of the term space of the corpus [10]. For example, the work in [3] uses the technique of LSI to identify relationships among entities in large collections of text. The author in [4] also using the LSI for discovering new information relevant to a given topic in large textual databases. Although the LSI based on SVD having some early success on latent topic discovery and relationship identification, it lacks rigorous mathematical and statistical basis and the SVD decomposition is time-consuming. Probabilistic Latent Semantic Indexing (PLSI) was proposed to extend the LSI assuming which associates a latent context variable with each word occurrence and can deal with synonymy and polysemous words. The author in [16] proposed that PLSI has been considered as an unsupervised learning method used in the task of text learning. The work in [15] also using the PLSI to represent sentences and queries as probability distributions over latent topic to solve the multi-document summarization problem. Other than LSI and PLSI, the algorithm of Latent Dirichlet Allocation (LDA) is more advantageous since LDA model can avoid overfitting and large sets of parameters. LDA model, proposed by David Blei et al. [2], is a statistical topic model and can analyzes 110 Q. Zhang, S. Liu, D. Gong, Q. Tu hidden topics in large-scale data. Ontology learning using LDA model is a relative new research approach. Elias et al. [34, 35] used the LDA model for discovery of topics that represent on- tology concepts and comparing the high-probability terms in topics to arrange concepts in a subsumption hierarchy. However, it cannot infer subsumption relations in the case where a topic subsumes only one other topic. Yeh and Yang [29] developed an automatic domain ontology con- struction for historical documents. LDA model was used to extract latent topic from raw textual Chinese Recorder data and the basic cosine similarity with hierarchical agglomerative clustering is used to clustering the topic, but the relationship between the topic cannot be defined since the clustered latent topic is a hierarchical tree structure. Francesco et al. [7] present an automatic terminological ontological learning system which the common hypernyms between the aggregate root node and aggregate words are determined through the LDA model and then added the semantically similar root node to the ontology. however, the measurement in large set of data may cause heavy computing burden. Ni et al. [20] also used the LDA model to select the domain terms and through the word association analysis to discover the hierarchical relations among domain terms. Raghuveer [21] using the LDA model to obtain the topics from legal documents and clustering legal judgments by cosine similarity. 3 The proposed method The paper has combined the ontology technique and LDA topic model, used the initial seed ontology guiding the LDA model to obtain the concept in the field of enterprise technological innovation. Adding the new concept to the initial domain ontology by defined rules to realize the iteratively updating and perfection of ontology. The framework of enterprise’s technological innovation domain concept acquisition contains the following four modules: • The module of seed ontology introducing. The paper needs to construct a seed ontology to guide the concept acquisition for enterprise’s technological innovation domain. The basic concept and relationship of seed ontology in domain enterprise technological innovation mainly extracted from Chinese Classified Thesaurus. The protege 4.3 was used to visualize the construction of seed ontology. More details will be introduced in the next chapter. • The module of text preprocessing. This is the process of converting a text into individual words or sequences of words which using the Natural Language Processing (NLP) tech- nique including of word segmentation, Part-of-Speech (POS)tagging, stop-word filtering preprocessed the collected Chinese textual documents. Two words merging needs to satisfy adjacency and frequent co-occurrence both, the calculation method as follows. In order to guarantee the semantic accuracy after word segmentation, the method of entropy was adopted to merge the words [27,28]. E(wm−1,wm) = p(wm−1wm) minp(wm−1),p(wm) (1) where p(wm) denotes the frequency of word wm in documents and p(wm−1wm) denotes the continuous frequency of word wm−1 and wm in documents. • The module of mining domain specific terms. LDA (Latent Dirichlet Allocation) is a three-level hierarchical Bayesian model which proposed by Blei [34]. It assumes that each document in corpus is represented as random mixtures over latent topic, where topic is characterized by a distribution over all the words. LDA is constructed for documents with "bag-of-words" which uses the statistical information of words to represent text in vector A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 111 space and explores the probabilistic relationships between words and text. In this paper, we use the LDA model was described below. LDA taking the corpus D which after the preprocessing by module B as input and output the topic distributions and the distribution of words for each topic by training. The LDA generates the words in a two-stage process: words are generated from topics and topics are generated by documents. The graphical model of LDA is shown in Fig. 1. The terms of LDA was defined as follows: A document is a sequence of N words denoted by w = (w1,w2, · · · ,wn) where wn is the nth word in the sequence, and a corpus is a collection of M documents denoted by D = d1,d2, · · · ,dM; Wm,n Zm,nk Wm,n Zm,n n [1,Nm] m [1,M] k K Figure 1: Graphical model representation of LDA α and β are Dirichlet prior hyperparameters; All the words in document M will be clustered into Z topics, for each topic Z ∈ 1, 2, · · · ,k , sample a word distribution φk ∼ Direchlet(β); – Choose N ∼ Possion(ξ) – Choose a topic distribution θm ∼ Direchlet(α) – For each of word wm,n in mth document: ∗ Choose a topic of the word Zm,n ∼ Multinomial(θm) ∗ Choose a word wm,n ∼ Multinomial(φZm,n ) Since the process to generate the topic for M documents are independent of one another, we can have M conjugated structures and the generative process of probabilistic of topics in corpus is as follows: p(~z|~α) = M∏ m=1 p( ~zm|~α) = M∏ m=1 ∆( ~nm + ~α) ∆(~α) (2) 112 Q. Zhang, S. Liu, D. Gong, Q. Tu The process to generate words for K topics are independent of one another, we can have K conjugated structures and the probabilistic of words in corpus is as follows: p(~w|~z, ~β) = k∏ k=1 p( ~wk|~zk, ~β) = k∏ k=1 ∆( ~ nk + ~β) ∆(~β) (3) Thus, within a document, the probability distribution over words specified by the LDA model is given as follows: p(~w,~z|~α, ~β) = p(~w,~z|~β) ∗p(~z|~α) = k∏ k=1 ∆( ~nk + ~β) ∆(~β) ∗ M∏ m=1 ∆( ~nm + ~α) ∆(~α) (4) Thus, in this paper, the LDA topic model was used to train the term candidate set which obtained by the module of text preprocessing and to obtain the word probabilistic of domain concepts (topics) as shown in Fig.2. Topic ... Word w . . . ... ... ... ... . . . . . . . . . . . . . . . Figure 2: Words distribution probabilistic of topics where pwnk represents the probability of the word n in the topic k. • The module of domain ontology updating. The module is the key point and difficulty of this paper. Take each concept in the initial enterprise technological innovation ontology as a document into the module (3) trained LDA model. We can get the topics probabilistic of documents as shown in Fig.3. Where, a corpus is a collection of M ontology concepts denoted by C = (c1,c2, · · · ,cm−1,cm); Where pzkm represents the probability of the topic k in concept(document) m. According to the LDA algorithm, we can get the term probabilistic of documents, namely, the probabilistic of words in documents and concepts in initial ontology denoted as p(wn|cm). Then by using the relevant rules to judge the relationship between topics generated by LDA model and concepts in initial domain ontology. p(wn|cm) = K∑ j=1 p(wn|z = j) ∗p(z = j|cm) (5) A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 113 Concept ... Topic . . . ... ... ... ... . . . . . . . . . . . . . . . Figure 3: Topics distribution probabilistic of documents When the p(wn|cm) greater than the threshold value TH and the word n is not in the list of C = (c1,c2, ...,cm−1,cm), therefore, the term wn is an associated term of cm. p(wn|cm) > TH (6) Algorithm: Rules of the semantic relationship recognition were defined as follows: W(Wn,Cm) = p(Wn|Cm) p(z = j|Cm) + p(Wn|Cm) = K∑ j=1 p(wn|z = j) ∗p(z = j|cm) p(z = j|Cm) + K∑ j=1 p(wn|z = j) ∗p(z = j|cm) (7) • Rule 1: Rules for synonymy relations recognition. If the W(Wn,Cm) ≥ 0.01, the semantic relationship between word and concept is equivalent, namely, the related terms extracted by LDA is equal to the existed concept. • Rule 2: Rules for hyponymy relations recognition. When the Rule 1 cannot be satisfied, if the W(Wn,Cm) ≥ 0.004, the word includes the concept, namely, the related terms extracted by LDA is superclass of the existed concept, the relationship as "is-a" or "sub- class". • Rule 3: Rules for correlation recognition. When the Rule 1 and Rule 2 are cannot be satisfied, the relationship between existed concept and related terms can be recognized as related or using people to identify the specific semantic relationship by external knowledge base. Based on the above rules, the semantic relations between the existing concepts and their related terms are identified, add the obtained related terms and semantic relations to the original ontology O, the original ontology O was updated to Oi. 114 Q. Zhang, S. Liu, D. Gong, Q. Tu 4 Experiments and result 4.1 Ontology acquisition Enterprise ontology and TOVE (Toronto Virtual Enterprise Ontology) are the most popular ontology-based enterprise modeling methodologies. The two projects all point out the common key influencing factors in the process of enterprise ontology construction including of resources, organization, strategy, market and activity. In this paper, the five factors also considered as the first class of the enterprise’s technological innovation ontology. The Chinese Classified Thesaurus has clear semantic structure which is more suitable for the extraction of concepts and relationship between concepts. Transforming thesaurus into ontology through further concepts analysis and semantic relationship adjustment of the words in F27 category of Enterprise Economy in Chinese Classified Thesaurus. There are 5 concepts extracted from the thesaurus including of Innovation resources, Marketing innovation, Strategic innovation, Organizational innovation and Innovation activities. The nested composite view provides a representation of the interrelation between the first classes in the entire ontology structure. It is convenient for considering whether the constructed domain ontology meets actual needs. The nested composite view of enterprise’s technological innovation domain is shown as Fig.4. The relationship between domain ontology concepts includes the hyponymy relations and complex non-hierarchical relationship for specific application. The Fig.5 shows that the relationship between domain ontology concepts which takes the Strategic innovation as the center and reflects the complex relationship between concepts. The ontology of enterprise’s technological innovation is a prototype, in which many concepts and relationships are still insufficient and need to continuously improved. Figure 4: Nested composite view of enterprise technological innovation domain 4.2 Textual data collection There are two aspects to collect the textual data of enterprise technological innovation, one is the internal information generated from daily production activities such as internal R&D, in- novation activities, etc. The other type of collected data is generated when enterprise interacting with external customers and partners by social networks, mobile applications, etc. 863 sets of valid data are obtained which includes of 413 enterprise technology centers. A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 115 Figure 5: Visualization of relationship between concepts 4.3 Text preprocessing Domain-Specific Dictionary Entropy + Manual Screening Corpus Segmentation ICTCLAS POS Selection on P Corpus1 Corpus2 Corpus3 Stop-word Dictionary Stop-word Filtering Figure 6: Process of text preprocessing Chinese segmentation Firstly, constructing the domain-specific dictionary for the field of enterprise’s technological innovation by widely collected materials such as the cell thesaurus and imported the dictionary into the ICTCLAS segmentation system [32] which developed by the Chinese Academy of Sci- ences. Secondly, the result of segmentation will appear the problem due to a Chinese phrase was wrongly divided into many words. For example, the "enterprise’s technological innovation" was divided into three small-grained words such as "enterprise", "technological" and "innovation". The method of entropy was adopted to merge the words which shown as equation (1). Combin- ing two words that satisfy the conditions into a new phrase and adding to the domain-specific dictionary by manual screening. Then, segment the source document and iterate repeatedly. 116 Q. Zhang, S. Liu, D. Gong, Q. Tu POS selection The documents of enterprise’s technological innovation are the synthetic texts, in which, nouns are more representative important for semantic information in source documents. Hence, selecting the nouns and the word similar to nouns as the research object such as the verb with noun function, the adjective with noun function, etc. Elimination of stop-words Useless words selected from the domain of enterprise’s technological innovation is used to build stop words dictionary. Filtering the stop words in documents which processed by the above two steps. It can reduce the size of the indexing structure considerably by elimination of stop words. 4.4 Mining domain terms from text corpus based on LDA • Terms selection. According to the word frequency of terms in all corpora, the word fre- quency of [50, 1000] were selected as terms to represent each document in vector space model. • Optimal number of topics. The perplexity index is adopted in optimal topic selection. Perplexity is an effective measurement to verify the model generalization ability. A lower perplexity indicates the better generalization performance. The perplexity is defined as follows: perplexity(Wn|Cm) = e − ∑ log(p(Wn|Cm)) N (8) Where p(Wn|Cm) is the probability of each word in candidate term set, N is the number of words. The perplexity of all documents generated under different topic numbers is shown as Fig.7 It looks like the 160-topic model has the lowest perplexity score. Hence, the optimal number of topic 160 (k=160) is selected for all corpus by perplexity analysis. The smoothing parameters α and β were fixed at 0.1 and 0.3. The threshold TH was set to 0.001. Figure 7: Perplexity result on enterprise technological innovation corpora for LDA model • When the number of topics is 160, the LDA topic modelling is carried out to obtain the distribution of terms, that is each topic comprises of a series of related words. The order of the top terms of each topic is arranged by the probability and presented in the Table.1, in which only the first 10 topics with high probability of topic distribution were shown. The Fig.8 shows the resulting graph visualization LDA model for top terms of topics. A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 117 Table 1: The distribution probability of topics and words when topic K=160 Topic 5 P (wn|k = 160) Topic 13 P (wn|k = 160) Topic 23 P (wn|k = 160) Technology 0.001665396 Expert 0.0021303003 Patent 0.0095653171 Material 0.0014465271 Doctor 0.0009977445 Name 0.0047526103 Technique 0.0012552338 Senior engineer 0.0007695822 Number 0.0042560613 New Material 0.0007991531 Counselor 0.0006120452 Technology 0.0031178757 Product 0.0005905334 Master 0.0005237754 Information 0.0028379832 Technological innovation 0.0005886399 Bachelor 0.0005183982 Type 0.0027637654 High- performance 0.0005476442 Post- doctoral 0.0003545205 Invent 0.0038754337 Precision 0.0005436358 Degree 0.0002742724 Copyright 0.0016382793 New- technology 0.0004137853 Professor 0.0002726482 Authorized- patent 0.0007759799 New- product 0.0003711188 College 0.0002557352 Conservation 0.0006379388 Stability 0.0003021399 Associate- professor 0.0001320247 Authorization 0.0002794288 Practical 0.0002943092 Academic 0.0001297645 Intellectual- property 0.0002544165 Topic 27 P (wn|k = 160) Topic 30 P (wn|k = 160) Topic 39 P (wn|k = 160) Project 0.0008273301 Enterprise 0.0010061373 New- product 0.0008340621 Types 0.0008237544 Name 0.0009097208 New- techniques 0.0008340621 Invisible- asset 0.0004237544 Development Organiza- tion 0.0008218773 Name 0.0004272026 Fix asset 0.0004223029 Contact- telephone 0.0006368324 Market Occupancy 0.0004272025 Equipment 0.0004208453 Organization 0.0004940051 Profit 0.0004272025 Facility 0.0003637534 Company 0.0004912549 Period 0.0004272025 Total- amount 0.0003230094 Department 0.0004209615 Sale-quota 0.0004263008 Quantity 0.0003034324 Laboratory 0.0004209615 Sales- volume 0.0004262023 Cost 0.0002784593 Contact person 0.0004209615 Competitive 0.0004262010 Fund 0.0002764534 Research- institute 0.0004209615 Economic- benefit 0.0004260232 Amount 0.0002230895 Contact details 0.0004209615 Popularization0.0003037646 Instrument 0.0002234943 Information 0.0003026468 Technical manage- ment 0.0003037564 118 Q. Zhang, S. Liu, D. Gong, Q. Tu Technology Material Technique New Material Product Technological innovation High- performance Precision New- technology New- product Stability Practical Expert Doctor Senior engineer Counselor Master Bachelor Post- doctoral Degree ProfessorCollege Associate- professor Academic Patent Name Number Information Type Invent Copyright Copyright Authorized -patent Conservation Intellectual -property Project Invisible -asset Fix asset Equipment Facility Total-amount Quantity Cost Fund Amount Instrument Enterprise Development Organization Contact-telephone Organization Company DepartmentLaboratory Contact person Research-institute Contact details New- product New- techniques Market Occupancy Profit PeriodSale-quota Sales- volume Competitive Economic- benefit Popularization Technical management Topic 5 Topic 39 Topic 30 Topic 27 Topic 23 Topic 13 Figure 8: Graph of LDA for top terms of topics 4.5 Learning hierarchical relations among terms Using the trained LDA model to infer each concept in the initial ontology and taking each concept (or word) as a document to calculate the topic probability of the document. Identify the semantic relations between existing concepts and its related terms, and add the related terms as the domain ontology concept to the appropriate position of the existing ontology to complete an update process of the domain ontology. Table.2 shows the results of the conceptual related terms extraction and relations recognition. Table 2: Related terms extraction and relations recognition Existing Concepts Topic P (cm|z = j) Related terms Weights Appli- cable rules Semantic relations Profit Management 79 0.438621 Innovation Resources 0.00413586 (2) Subclass Total Amount 0.00624005 (2) Subclass Technical Information 139 0.388462 Material 0.00651796 (2) Subclass Painting 0.00331878 (3) Related Alloy 0.00243573 (3) Related Visible Asset 27 0.236543 Fixed asset 0.01342533 (1) Equivalent Equipment 0.00523433 (2) Subclass Instrument 0.00243234 (2) Subclass Visible Asset 27 0.236543 Fixed asset 0.01342533 (1) Equivalent Equipment 0.00523433 (2) Subclass Instrument 0.00243234 (2) Subclass Technical- Quality 5 0.388462 Precision 0.00257653 (3) Related New Technology 0.00323643 (3) Related A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 119 Existing Concepts Topic P (cm|z = j) Related terms Weights Appli- cable rules Semantic relations High-tech Product 136 0.446243 Strategy Innovation 0.00276406 (3) Related High-tech Product 136 0.446243 Technological Innovation 0.00143524 (3) Related Product Innovation 23 0.237643 Patent 0.00332763 (3) Related Brand 0.00236232 (3) Related Copyright 0.00323422 (3) Related New Product 0.00332542 (3) Related Staff Management 13 0.376432 Expert 0.00335476 (3) Related Doctor 0.003276543 (3) Related Degree 0.002387432 (3) Related Senior engineer 0.003723423 (3) Related Wage management 63 0.3412663 Wage 0.01472652 (1) Equivalent Subsidy 0.00234653 (3) Related Bonus 0.00334523 (3) Related Insurence 0.00343263 (3) Related Innovation Activities Organizational Innovation Strategic Innovation Marketing Innovation Innovation Resources Personnel Management Staff Management Technological Innovation Technology Quality Technology Information Technological Strategy Product Technology Product Innovation Product Strategy Product Information Product Development Product Quality High-tech product Product Quality Certification Profit Management Cost Management Economic Activities Analysis Assets Management Invisible Asset Visible Asset Total Amount Material Painting Precision Patent Brand Copyright New Product Equipment Instrument Experts Doctor Wage Management Wage Subsidy Bonus Insurance Alloy Market Occupancy Sale Quota Competitiveness Period Degree Senior Engineer WaWaWaWaWaWaWaWa Fixed Asset Figure 9: Parts of produced enterprise technological innovation domain ontology The Fig.9 shows part of produced enterprise’s technological innovation domain ontology. The blue dots represent the original terms of initial domain ontology and the red dots stand for the produced new terms. The original relations among entities in ontology are shown with solid lines, the dashed lines represent the new relations. The total amount of new terms in enterprise technological innovation domain ontology has updated about 163, the figure only shows parts of the result due to space limitation. By looking at the literature of ontology evaluation, there are two approaches for measuring the ontology including of manual evaluation by human experts and gold standard-based approaches [11]. The first evaluation approach presents the learned ontology to one or more human experts and judge how far the extracted information is correct. The second method compare the learned 120 Q. Zhang, S. Liu, D. Gong, Q. Tu ontology with a previously created gold ontology which example for this kind of evaluation can be found in papers like [34]. The degree of matching between learned ontology and gold ontology determines the precision of learning ontology. The evaluation of ontologies when these ontologies are produced by an automated learning procedure is an open field of research. Since the enterprise’s technological innovation is a new developing academic field which has not formed a generally acknowledged ontology yet. Therefore, the manual evaluation by human experts was the best way so far. The research chosen 5 groups and 20 terms and relations for each group in the updated enterprise’s technological innovation domain ontology randomly. The assisted algorithm like following equation was defined as the ration between the right terms and relationships which evaluated by human experts and the total terms and relationships in ontology. According to the validation about the correct terms and relations with domain experts, the result of accuracy test is shown as Table 3. precision = righttermsandrelationships totaltermsandrelationshipsontology (9) Table 3: The accurate rate of the concepts in enterprise technological innovation domain No. Number of groups Accurate number of groups Precision Group 1 20 19 95% Group 2 20 19 95% Group 3 20 18 90% Group 4 20 17 85% Group 5 20 19 95% Compared with the traditional ontology construction methods such as OntoLearn and Text2- Onto, the proposed method has same precision which the average accuracy rate is 92%. The semantic content and relationship in the produced ontology is basically correct. The proposed automatic ontology extension method reduces the manual labor for ontology updating and solved the problem of automatic domain ontology acquisition and dynamic maintenance. 5 Conclusion and future work This paper presented an automatic ontology extension method for the domain of enterprise’s technological innovation. The main contributions of this paper present as follows: Firstly, this paper proposes an ontology-based LDA topic model for concept extraction and applies it to the realm of enterprise technological innovation, which not only discover the concepts from large amount of textual corpus, but also can provides data support for ontology construction. Sec- ondly, this article takes a huge amount of enterprise technological innovation information in unstructured texts as the data source and proposes a method of building rules for semantic rela- tionship recognition based on LDA topic probability distribution, and the process of automated domain ontology updating based on the LDA topic model is realized. Finally, the experiment results demonstrate the efficiency and validation of proposed method. The method focuses on discovering the domain terms via latent topics found by LDA algorithm from plain text corpus and recognizing the semantic relations among domain terms based on word association analysis. The proposed method is a process of continuously cycles and iterations, the domain ontology of enterprise’s technological innovation will be updated and perfected automatically with the con- stant knowledge acquisition in the domain. The paper introduces the ontology on the basis of A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 121 the LDA topic model and the ontology is extended by the obtained related topics. The proposed method is an improvement for the single LDA algorithm. The future work needs to solve several problems, firstly, improving the proposed method to achieve a better performance and continuing exploring automatic evaluation approaches on the- saurus constructing methods. Secondly, using the constructed enterprise technological innovation ontology and combined with the text mining methods to construct the mechanism of evaluation for enterprise’s technological innovation. Funding This paper is supported by the Fundamental Research Funds for the Central Universities (2018YJS051,B18RC00070) and Beijing Social Science Funds (18JDGLA018). Bibliography [1] Bisson, G.; Nédellec, C. Canamero, D.(2000); Designing Clustering Methods for Ontology Building-The Mo’K Workbench, ECAI workshop on ontology learning, 31, 2000. [2] Blei, D.M.; Ng, A.Y.; Jordan, M.I. (2003); Latent dirichlet allocation, Journal of machine Learning research, 3(Jan), 993–1022, 2003. [3] Bradford, R.B. (2006); Relationship discovery in large text collections using latent semantic indexing, Proceedings of the Fourth Workshop on Link Analysis, Counterterrorism, and Security, 2006. [4] Bradford, R.B. (2005); Efficient discovery of new information in large text databases, Inter- national Conference on Intelligence and Security Informatics, 374–380, 2005. [5] Burgelman, R.A.; Maidique, M.A.; Wheelwright, S.C. (1996); Strategic Management of Technology and Innovation, Chicago,IL:lrwin, 1996. [6] Cimiano, P.; and Völker, J. (2005); text2onto, International conference on application of natural language to information systems, 227–238, 2005. [7] Colace, F.; De Santo, M.; Greco, L.; Amato, F.; Moscato, V.; Picariello, A. (2014); Ter- minological ontology learning and population using latent dirichlet allocation, Journal of Visual Languages & Computing, 25(6), 818-826, 2014. [8] Dai, Y.; Wu, W.; Zhou, H.B.; Zhang, J.; Ma, F.Y. (2018); Numerical simulation and optimization of oil jet lubrication for rotorcraft meshing gears, International Journal of Simulation Modelling, 17(2), 318–326, 2018. [9] Dai, Y.; Zhu, X.; Zhou, H.; Mao, Z.; Wu, W.(2018); Trajectory tracking control for seafloor tracked vehicle by adaptive neural-fuzzy inference system algorithm, International Journal of Computers, Communications & Control 13(4), 465–476, 2018. [10] De Knijff, J.; Frasincar, F.;Hogenboom, F. (2013); Domain taxonomy learning from text: The subsumption method versus hierarchical clustering Data & Knowledge Engineering, 83, 54-69, 2013. [11] Dellschaft, K; Staab, S. (2008); Strategies for the evaluation of ontology learning, Ontology Learning and Population, 167, 253–272, 2008. 122 Q. Zhang, S. Liu, D. Gong, Q. Tu [12] Deng, L; Wang, X; Lin, Y; He, F.Z. (2005); Model of Multiple Fuzzy Synthetical Evaluation for Enterprise Technology Innovation, Journal of Chongqing University (Natural Science Edition), 7, 004, 2005. [13] Guan, J.C.; Yam, R.C.; Mok, C.K.; Ma, N. (2006); A study of the relationship between competitiveness and technological innovation capability based on DEA models, European Journal of Operational Research, 170(3), 971-986, 2006. [14] Guarino, N.; Poli, R. (1993); Toward principles for the design of ontologies used for knowl- edge sharing, In Formal Ontology in Conceptual Analysis and Knowledge Representation, Kluwer Academic Publishers, in press. Substantial revision of paper presented at the Inter- national Workshop on Formal Ontology, 1993. [15] Hennig, L. (2009); Topic-based multi-document summarization with probabilistic latent semantic analysis, Proceedings of the International Conference RANLP-2009, 144–149, 2009. [16] Hofmann, T. (2001); Unsupervised learning by probabilistic latent semantic analysis, Ma- chine learning, 42(1-2), 177–196, 2001. [17] Khan, L.; Luo, F. (2002); Ontology construction for information selection, Proceeding of Tools with Artificial Intelligence, 122-127, 2002. [18] Lee, C.S.; Kao, Y.F.; Kuo, Y.H.; Wang, M. H. (2007); Automated ontology construction for unstructured text documents, Data & Knowledge Engineering, 60(3), 547–566, 2007. [19] Liu, Q.; Zhang, H.; Yu, H.; Cheng, X. (2004); Chinese lexical analysis using cascaded hidden markov model, Journal of Computer Research and Development, 41(8), 1421–1429, 2004. [20] Ni, N.; Liu, K.; Li, Y. (2011); An automatic multi-domain thesauri construction method based on lda, 2011 10th International Conference on Machine Learning and Applications Workshops, 235-240, 2011. [21] Raghuveer, K. (2012); Legal documents clustering using latent dirichlet allocation, Interna- tional Journal of Applied Information Systems, 2(1), 34-37, 2012. [22] Saunila, M.; Ukko, J. (2012); A conceptual for the measurement of innovation capability and its effects, Baltic Journal of Management, 7(4), 355–375, 2012. [23] Tho, Q.T.; Hui, S.C.; Fong, A.C.M.; Cao, T.H. (2006); Automatic fuzzy ontology generation for semantic web, IEEE transactions on knowledge and data engineering, 18(6), 842-856, 2006. [24] Tsai, M.T; Chuang, S.S; Hsieh W.P. (2008); Using Analytic Hierarchy Process to Evalu- ate Organizational Innovativeness in High-Tech Industry, Decision Sciences Institute 2008 Annual Meeting (DSI), 1231-1236, 2008. [25] Wang, T. J; Chang, L. (2011); The development of the enterprise innovation value diagnosis system with the use of systems engineering, System Science and Engineering (ICSSE), 2011 International Conference on IEEE, 373–378, 2011. [26] Wang, C; Lu, I; Chen, C. (2008); Evaluating firm technological innovation capability under uncertainty, Technovation, 28(6), 349–363, 2008. [27] Wei, W.; Guo, C.; Chen, J.; Tang, L.; Sun, L. (2017); CCODM: conditional co-occurrence degree matrix document representation method, Soft Computing, 1-17, 2017. A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation 123 [28] Wei, W.; Guo, C.; Chen, J.;Zhang, Z. (2017); Textual topic evolution analysis based on term co-occurrence: A case study on the government work report of the State Council (1954–2017), Intelligent Systems and Knowledge Engineering, 1-6, 2017. [29] Yeh, J.H.; Yang, N. (2008); Ontology construction based on latent topic extraction in a digital library, International Conference on Asian Digital Libraries, 93–103, 2008. [30] Yliherva, J. (2004); Management model of an organization’s innovation capabilities; develop- ment of innovation capabilities as part of the management system, dissertation, Department of Industrial Engineering and Management, University of Oulu. [31] Zhang, W.; Zhang, Z.; Chao, H.C.; Tseng, F.H. (2018); Kernel mixture model for probability density estimation in Bayesian classifiers. Data Mining and Knowledge Discovery, Data Mining and Knowledge Discovery, 32(3), 675–707, 2018. [32] Zhang, W.; Zhang, Z.; Qi, D.; Liu, Y. (2014); Automatic crack detection and classification method for subway tunnel safety monitoring, Sensors , 14(10), 19307–19328, 2014. [33] Zhao, W.; Zeng, Y. (2011); Construction and design of evaluation index system of innovative enterprises on innovative capacities, Science and Technology Management Research, 1, 005, 2011. [34] Zavitsanos, E.; Paliouras, G.; Vouros, G.A.; Petridis, S. (2010); Learning subsumption hierarchies of ontology concepts from texts, Web Intelligence and Agent Systems: An Inter- national Journal, 8(1), 37-51, 2010. [35] Zavitsanos, E.; Paliouras, G.; Vouros, G.A.; Petridis, S. (2010); Discovering subsumption hierarchies of ontology concepts from text corpora, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, 402–408, 2007.