ijcccv11n4.dvi INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 11(4):538-552, August 2016. Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model G.L. Shen, X.P. Yang, J. Sun Gui-lan Shen*; Xiao-ping Yang Information School, Renmin University Beijng, China guilan.shen@buu.edu.cn; yang@ruc.edu.cn *Corresponding author: guilan.shen@buu.edu.cn Jie Sun Business School, Beijing Union University Beijing,China jie.sun@buu.edu.cn Abstract: A large number of emerging information networks brings new challenges to the overlapping community detection. The meaningful community should be topic- oriented. However, the topology-based methods only reflect the strength of connec- tion, but ignore the consistency of the topics. This paper explores a topic-oriented overlapping community detection method for information work. The method utilizes a hybrid hypergraph model to combine the node content and structure information naturally. Two connections for hyperedge pair, including real connection and vir- tual connection are defined. A novel hyperedge pair similarity measure is proposed by combining linearly extended common neighbors metric for real connection and incremental fitness for virtual connection. Extensive experiments on two real-world datasets validate our proposed method outperforms other baseline algorithms. Keywords: information network, overlapping community detection, topic-oriented, hybrid hypergraph model. 1 Introduction Community is considered to be a fundamental property of complex network. Despite the variety of complex networks, community often accounts for the functionality of the system [1]. Research of recent years shows that the structure of community is not disjoint. Overlapping is an important property of many real-world networks, i.e., they are naturally characterized by multiple community memberships. For example, a person could join in several hobby groups in social networks; one academic paper could cover a number of fields, etc. It is therefore a very essential work to develop approach for efficient overlapping community detection, which will contribute to the links prediction, collaborative recommendation and influence propagation in many application fields. Although numerous techniques have been developed for overlapping community detection in recent years, most of them only focus on the structure information for real network. It is well understood, however, that there exist a large quantity of real networks with node content or semantic information, which is referred to as information network, such as www, scientific citation network, and social network. The meaningful detected community of the information network should be topic-oriented, which has two characteristics: the nodes inside one community should have dense connections and consistent or similar topics. Communities identified via those topological methods often incorporate different topics since stronger connections represent the interactions that occur across several different topics, which would confuse the meanings of the topic-oriented community [2]. Copyright © 2006-2016 by CCC Publications Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 539 In this paper, we propose a topic-oriented overlapping community method for information network which combines node content information and link information. Firstly, information network is modeled as a hybrid hypergraph composed of hyperedge that features the collection of nodes with common attributes. For different information network, the node attribute could be represented by interest, tab, word or topic. Secondly, a hyperedge pair similarity calculation method is proposed, which combines the content information by calculating common neighbors of hyperedge pair and structure information by measuring the link relationships between the nodes involved into two hyperedges. Finally, an agglomerative hierarchical clustering algorithm is applied to partition the hybrid hypergraph model into different topic-oriented overlapping communities. Compared with the existing methods, our method can identify communities from the per- spective of both content and link structure for information network. From this result, we can easily find more meaningful communities, such as topics, research fields and so forth. Moreover, with the inherent characteristics of hybrid hypergraph model, overlapping of communities could be identified easily. We proceed to report our work in the rest of the paper as follows. We discuss the related works in Section 2. In section 3, we propose our approach for identifying the meaningful overlapping communities based on hybrid hypergraph model for information network. In order to verify our approach, we conducted extensive experiments. The experimental design and results analysis are given in Section 4. Finally, a conclusion is drawn in Section 5. 2 Related works Overlapping community detection using topology. Some methods have been proposed to detect overlapping communities in a network. LFM proposed by Lancichinetti et al [3] is a kind of algorithms utilizing local expansion and optimization. This method relies on a local benefit function that characterizes the quality of a densely connected group of nodes. LFM expands a community from a random seed node to form a natural community until the fitness function f(c) = kcin (kcin + k c out) α (1) is locally maximal, where kcin and k c out are the total internal and external degree of the community c, and α is the resolution parameter controlling the size of the communities. After detecting one community, LFM randomly selects another node not assigned to any community to expand another new community. This method obviously can identify the overlapping community, since they allow a single node to be put into different community owing to different optimization process. In real-world network, it’s difficult to decide how many communities that a single node should be put in, however it’s very clear whether the edge incident on the node is in the community or not. Now, researcher suggests using links to defining community [4], owing to an edge only is in one community, but the nodes connected by edge could be put into different communities. Some methods [5–7] using line graph and link partitioning to detect overlapping community have been proposed. Among them, Ahn [6] partitions links into clusters via hierarchical clustering of edge similarity. Given a pair of links eik and ejk incident on a node k, the edge pair similarity can be computed via the Jarracd Index defined as, sim(eik,ejk) = |Nb+(i) ∩Nb+(j)| |Nb+(i) ∪Nb+(j)| (2) 540 G.L. Shen, X.P. Yang, J. Sun Where Nb+(i) is the inclusive neighbors of a node i, which the set contains the node itself and its neighbors. With this similarity, single-linkage hierarchical clustering is then used to build a link dendrogram. Cutting this dendrogram at a special threshold yields link communities. Although the link partitioning method can detect the overlapping communities naturally, there is no guarantee that is provides high quality detection for information network because the method only relies on links of the network while ignores the node content totally. There are many other methods to detect overlapping community. For example, the ones based on subgraphs, such as CPM [8], CPMw [9] etc.al, treat community structure as the com- position of adjacent subgraphs, as one node can belong to several subgraphs. However, these methods are usually considered to solve the pattern matching of complex networks rather than finding communities. In addition, the methods extended Girvan and Newman’s divisive clus- tering algorithm [10], such as CONGA [11], CONCO [12], allow a node to split into multiple copies. Despite the use of different techniques, the above methods can always detect overlapping dense connections in network. However, they only focus on the topology information but ignore the content information that contributes to improve the quality of the community [13]. Topic-oriented community detection using topology and content. Based on the assumption that the content information can improve the quality of the detected community, various approaches have been combined the links and contents for community detection. Some approaches have combined content information with structure information for community dis- covery. One of them is generative probabilistic modeling which considers both contents and links as being dependent on one or more latent variables, and then estimates the conditional distri- butions to find community assignments. PLSA-PHITS [4], Community-User-Topic model [15] and PMC [16] are three representatives in this category. Other fusing the content and structure methods, such as SA-clustering [17] via augmenting the underlying network to take into account the content information, heuristic algorithm CKC [18] to solve the connected k-Center problem, subspace clustering algorithm [19] on graphs with feature vectors. Different from those methods using topology, these methods account for the content informa- tion of the nodes, so the division results for the network is more cohesive in the topics. However, considering the content of nodes, the complexity of the algorithm is greatly increased which will lead to some new challenges, such as how to deal with the high dimensional sparse for node attributes. Furthermore, those methods are not designed for overlapping communities. 3 Methodology In this section, we present our method for fusing structure and content via hybrid hypergraph to detect the overlapping communities for information networks. Firstly, we present the defini- tions, and then introduce how to build the hybrid hypergraph model for information network. Thirdly, we give the method that how to measure hyperedge pairs similarity. Finally, we briefly introduce algorithmic details of HLP (Hyper Link Partition), a novel method extended from link graph partition algorithm. 3.1 Definitions Link partitioning method is a kind of topology methodology based on classic graph theory. It’s very simple and distinct to model information networks as simple graphs, in which nodes indicate entity object, and links indicate the binary relationships between node pairs. However, real information networks are characterized by node content attribute, hence simple graph is not suitable for representing the content information. As the generalization of simple graph, Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 541 hypergraph can represent the multiple relationships for nodes in finite set, and describe the relationship between general discrete structures, which overcome the defect of the knowledge represented by the simple graph. Hypergraph characterized by one hyperedge incident on any number of nodes, is a graph in generalization. The definition of hypergraph is provided as follows. Definition 1. A hypergraph [20], H = (V,E) is defined as a set of vertices V = {v1,v2, · · · ,vn}, and a set of hyperedges E = {e1,e2, · · · ,em}, where: (1) e1 6= φ(i = 1,2, · · · ,m) (2) m ⋃ i=1 ei = V According to definition, a hyperedge essentially is the set of vertices which are independent. That is, hypergraph cannot represent the original topology of vertices. Thus, we give the defini- tion of hybrid hypergraph. Definition 2. A Hybrid Hypergraph, HH = (V,E,ε,ψ) is defined as a set of vertices V = {v1,v2, · · · ,vn}, a set of hyperedges E = {e1,e2, · · · ,em} and a set of edges ε where: (1) ei 6= ∅ (i = 1,2, · · · ,m) (2) m ⋃ i=1 ei = V (3) ψ(εi) = (vi,vj) (i,j = 1,2, · · · ,n) 3.2 Modeling information network as hybrid hypergraph We need to extend a structural graph with tuples describing node attributes. This can be formally expressed as a quad AG = {V,ε,FV ,ψ}, where each node v is associated with a feature vector f(v). FV is the set of features for all nodes, where, f(v) ⊆ FV ,v ∈ V . Feature selection is an important issue in system anomaly detection applications [21]. With different node attributes in different information network, the feature vector f(v) can be varied as a topic, a keyword, a place, an author, an activity. The number of features of FV is m, formally, m = ∣ ∣ ∣ ∣ ∣ f(v) ⋃ v∈V ∣ ∣ ∣ ∣ ∣ . So, the question of how to build the information network as the hybrid hypergraph model is simplified as how to map a quad attribute graph AG into a quad HH. Here, we take each feature fi as a basic unit to build hyperedge e, when fi ∈ f(v) , the node v ∈ e We use incidence matrix and adjacent matrix to represent the data structures with related to hybrid hypergraph. The Nv ×ME Incidence matrix for a hybrid hypergraph HH, say I, is defined as that Ive = { 1, if v ∈ e 0, otherwise (3) The ME ×NV transposed matrix for I, say K, is defined as that Kev = { 1, if v ∈ e 0, otherwise (4) The NV ×NV adjacent matrix for a hybrid hypergraph HH, say A, is defined as that Aij = { 1, if (i,j) ∈ ε 0, otherwise (5) The ME ×ME similarity matrix for a hybrid hypergraph HH, say Sim, is defined as that Simij = simlarity(ei,ej) ei,ej ∈ E (6) 542 G.L. Shen, X.P. Yang, J. Sun 3.3 Similarity for hyperedge pairs In this section, we present the method of how to calculate the similarity of hyperedge pairs in HH. This technique was originally introduced by link partitioning algorithm and local expansion algorithm for the purposes of identifying the overlapping communities in network. However, Link partitioning algorithm and local expansion algorithm both only focus on topological information, but ignore the content information. We argue the HH model for information network can fuse the content information and struc- ture information naturally. According to the definition of HH, a hyperedge can be regarded as a sub-community characterized by a special feature, because the hyperedge is a set of nodes with the same feature. We know clearly that the sub-community represented by the hyperedge is different from the final detected community. The reason is that one node is usually associated with more than one feature, but a hyperedge only reflect one of the features. Nevertheless, we can determine whether the hyperedge pairs have the same topic or not by exploring the link relationship between them. If the node only has one feature, moreover, such as topic, our work can be simplified as the work presented by zhao [22]. The relationship between two hyperedges is the link between two node sets which is more complicated, rather than just that between the two nodes in the simple graph. The link type for two hyperedges comes into two kinds: one kind is shared common nodes; the other is that the nodes in one hyperedge are connected with those in the other hyperedge. In order to better illustrate this, we offer two formal definitions as follows. Definition 3. Given two hyperedge ei and ej, is real connection, where ei ∩ej 6= ∅. For instance, two hyperedges e1, e2 are shared with two common nodes v2, v3 (Fig1.a), that is e1 ∩e2 = {v2,v3} 6= ∅, therefore the link type of e1, e2 is real connection. Definition 4. Given two hyperedge ei and ej, is virtual connection, where (1) ei ∩ej 6= ∅ (2) ∀ε ∈{< vm,vn > |∃vm ∈ ei,∃vn ∈ ej} 6= ∅ Or (3) ei ∩ej 6= ∅ (4) ∀ε ∈{< vm,vn > |∃vm ∈ ei −ei ∩ej,∃vn ∈ ej −ei ∩ej} 6= ∅ For instance, as shown in Fig1.b e1 ∩ e2 = ∅, and, there are links < v1,v2 >, < v2,v3 > between nodes for e1 and e2, this is one case of virtual connection. Another case is also shown in Fig1.a, e1 ∩ e2 = {v2,v3} = ∅, considering edge < v4,v6 >, which match the conditions (4) v4 ∈ e1 −e2 ∩e2 and v6 ∈ e2 −e1 ∩e2. Therefore, hyperedge e1 and e2 has both real connection and virtual connection. Obviously, when measuring the hyperedge similarity, both virtual connection and real con- nection are meaningful. Virtual connection reflects the structure information of the original network essentially, while real connection reflects the number of nodes with common attributes. Intuitively, hyperedge similarity is dependent on the tightness of hyperedge pairs. As mentioned above, hyperedge is the group of nodes; therefore, measuring the tightness of hyperedge pair can be converted into how to measure the tightness of two groups of nodes. If there are a lot of links between two sets of nodes, the two sets are strongly tight. For instance, in citation network, if article a1 with keyword k1 cited article a2 with keyword k2, then k1 and k2 have certain correlation or similarity. Similarly, the more articles involving the keyword k1 are cited by the articles involving k2 keyword, that is, k1 set has strong tightness with k2 set, which turns out that k1 and k2 have higher correlation. Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 543 (a) Real connection (b) Virtual connection Figure 1: Two kinds of link type for hyperedge To better quantify the tightness between the two groups of nodes, we extend the local fitness function in LFM [3]. Local fitness function for a given local community S, is formally given as fs = ksin ksin + k s out (7) Where ksin and k s out are the total internal and external degree of the community. Fitness function fs can measure the internal and external tightness for a local community. In topology detecting community methods, it has shown a good performance, and was expanded or applied by more researchers [23, 24]. Inspired by this thought, we propose the fitness function for a hyperedge e to be given as fe = kein kein + k e out (8) Where kein and k e out are the total internal and external degree of the hyperedge. The total of k e in and keout is the total degree of all nodes in this hyperedge. Similar to fs, fe can measure the internal and external tightness for hyperedge. Whether to merge two hyperedge into a larger hyperedge depends on how the changed numbers of virtual connection influence the fitness of the combined hyperedge. We define incremental fitness for combined hyperedge as the similarity for virtual connection of two hyperedges. Given two hyperedge ei and ej Simvirtual(ei,ej) = ∆fij = fei ∪fei = k eiej in k eiej in + k eiej out (9) Where k eiej in is the numbers of virtual connection between ei and ej. The total of k eiej in and k eiej out is the total degree of all nodes in merged hyperedge. For instance, in Fig2, considering two hyperedge e1 and e4, k e1e4 in = 4, k e1e4 in + k e1e4 out = 32, so, ∆f14 = fe1 ∪fe4 = 18 . Another example in this figure is, ke3e4in = 3 andk e3e4 in + k e3e4 out = 41, so, ∆f34 = fe3 ∪fe4 = 341 . Whether to merge two hyperedges into a larger hyperedge depends on the value of incremental fitness function. Virtual connection reflects the tightness of hyperedge pair via the structure information, while, real connection reflects the semantic similarity of hyperedge via the content information. In our method, we do not directly calculate the similarity of the features implied by hyperedges, such as that of keyword or topic, instead, we evaluate the similarity via CN metric. CN(Common Neighbors) [25] is also called structural equivalence, namely, the nodes are similar if they share a lot of common neighbors. CN is one of the most widely used metrics when measuring the similarity in local community detection methods. Therefore, we extend the CN metric to make it suitable for measuring the similarity between two hyperedges, or, two groups of nodes. To define clearly the neighbors set for hyperedges, we give the following definitions. 544 G.L. Shen, X.P. Yang, J. Sun Figure 2: A sample of hybrid hypergraph model Definition 5 (Inductive nodes set). Given two hyperedge ei, ej in HH, where ei, ej has real connection, the Inductive nodes set for ei related to ej, say pj(ei) is formally as pj(ei) = {v|v ∈ ei −ei ∩ej}. In order to illustrate the neighbors for hyperedges, we specially designate those neighbors related to Inductive nodes set. Definition 6 (Extended Common Neighbors of Inductive nodes set). Given two hyperedge ei, ej and inductive nodes set pj(ei) in HH, the extended neighbors of pj(ei) including pj(ei), say n+(pj(ei)),is formally as n+(pj(ei)) = {x|d(v,x) ≤ 1,v ∈ pj(ei)} Where d(x,y) is the distance of two nodes, formally as follows: d(x,y) ≤ 1 if x ∈ ei,y ∈ ej and ei ∩ej 6= ∅ (10) In Fig2, e1 ∩ e2 = {v1}, p2(e1) = {v2}, p1(e2) = {v3,v4}, so we can calculate the ex- tended common neighbors of inductive nodes sets n+(p1(e2)) = {v1,v3,v4,v5,v6,v7,v8,v9,v10}, n+(p2(e1)) = {v1,v2,v9,v10,v11}. Based on Jaccard index, we propose the method that how to calculate the similarity for hyperedge pair ei, ej with real connection. Formally as: Simreal(ei,ej) = |n+(pj(ei)) ∩n+(pi(ej))| |n+(pj(ei)) ∪n+(pi(ej))| (11) In fact, this definition is consistent with the similarity for the link pair in Ahn[14_6] research. For instance, in Fig2, s(e1,e2) = 3 11 . We combine the Jarracd index and fitness function to compute the similarity for hyperedge pair ei, ej, formally as: Simij = Sim(ei,ej) = λSimreal(ei,ej) + (1 −λ)Simvirtual(ei,ej) (12) We give the detail of how to compute the similarity for Hyperedge pair in Algorithm 1. 3.4 Stop criterion In our method, we adopt divisive hierarchical clustering to cluster the hyperedges. The results of hierarchical clustering are presented in a dendrogram. One important job in this method is Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 545 Algorithm 1: Computing the similarity for hyperedge pair Input: e1, e2, K Output: simValue j = 0, f = 0 check e1 and e2 is whether real connection or virtual connection If (e1 ∩e2 6= ∅) //real connection for each v ∈ V if (v ∈ e1 & &v ∈ e2) vComm ← v //get extended neighbors of Inductive nodes set NIV1 ← veNeighbors(K,e1,vComm) NIV2 ← veNeighbors(K,e2,vComm) If NIV1 6= ∅ && NIV2 6= ∅ interNum=interSection(NIV1,NIV2) unionNum=unionSection(NIV1,NIV2) Jaccard = interNum / unionNum //fitness fin=blink(e1,e2) ftotal=totalDegree(e1)+ totalDegree(e2) f=fin/ftotal else fin=blink(e1, e2) ftotal=totalDegree(e1)+ totalDegree(e2) f=fin/ftotal simValue = λ*Jaccard + (1 - λ)*f Figure 3: Algorithm Similarity deciding the stop criterion for clustering. We define partition density D, as a function of the dendrogram cut threshold. The maximum of D indicates the discovered hyperedge communities are well structured. For a HH, {HP1,HP2, · · · ,HPK} is a partition of the hyperedges into K clusters. Cluster HPK has mK hyperedges, and nK = ∣ ∣ ∣ ∣ ∣ ⋃ ei∈HPK {v ∈ ei} ∣ ∣ ∣ ∣ ∣ nodes. Then we define this as normalized form: DK = mHPK −1 mK −1 (13) Where, mHPk = nK ∑ i=1 miHPK / n is the mean value for the nodes located in a number of hyperedges. miHPK is the number of hyperedges in which node i is located. The larger Dk indicates the probability, at which nodes are clustering into one cluster, the tighter internal connection in the cluster is. The partition density D, is the average of Dk. D = K ∑ i=1 Di / K (14) The goal of hierarchical clustering is to find K clusters when partition density D is maximal. When D = 1, all clusters are merged into one community. 546 G.L. Shen, X.P. Yang, J. Sun 3.5 HLP algorithm Based on the HH model for information network, we apply the previously-defined similarity algorithm to pairs of all hyperedges, and produce a similarity matrix S. In the process of clus- tering, the clustered hyperedge contain more than one initial hyperedge. For instance, ei′ has n initial hyperedges, ej′ has m initial hyperedges. The Jarracd similarity of ei′, ej′ is computed as: Simreal(ei′,e′j) = n ∑ t m ∑ c Simreal(e c i,e t j) / n∗m (15) We give a simple description of HLP in Algorithm 2. We obtain the clusters of merged hyperedges via this algorithm. The nodes located in each clustered hyperedges constitute the communities related to some special topics. Algorithm2: Clustering for hyperedges Input: K Output: K′ S ← 0, m = 0, n = 0, K′← K m = LengthofRow(K) n = LengthofCol(K) //Initialize the similarity matrix for K for(int i = 0; i < m; i++) for(int j = 0; j < n; j++) S(i,j) = Similarity(i,j) S′← S While m > 1 //find the coordinates for the maximal similarity in S′ maxi = maxI(S′) maxj = maxJ(S′) K′← clustering(K′(, maxi), K(, maxj)) If Density(K′) = maxDensity(K′) return K′ Figure 4: Algorithm HLP 4 Experiments In this section, we present experiments on real datasets to evaluate the performance of our method. We first applied our method to two datasets to choose the optimal Value of λ. Then we compared the performance of our method with two baseline methods. Before going to details, we first describe the datasets and the method to extract the node feature, and introduce the performance metric to be used in our experiments. 4.1 Datasets Two real datasets used in our experiments are described in the following: Cora Dataset: The Cora dataset [26] consists of the abstracts and references of about 34000 computer science research papers. Three subfields of Machine Learning (ML), Programming (PL) and Database (DB) are used and those articles without references to other articles in the set are removed. The detailed information about each subfield is shown in Table1. Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 547 Table 1: Cora dataset Area #of #of #of #of subfields total papers used papers links ML 7 4218 2708 5249 PL 9 4496 3292 7772 DB 9 1396 1060 2522 WebKB Dataset: The WebKB dataset [27] consists of about 6000 web pages from computer science departments of four schools. The web pages are classified into seven categories, including, staff(SA), course(CS),department(DP), faculty(FA), student(SD),project(PJ),other(OT). We se- lect five categories excluding DP and OT, in that the two categories contain only a few pages. The detailed information about these web pages is shown in Table 2. Table 2: WebKB dataset School #of #of #of #of #of #of #of CS FA SD PJ SA Total links CN 44 34 18 128 21 245 304 TX 36 46 20 148 2 252 328 WA 77 30 18 123 10 258 446 WI 85 38 25 156 12 316 530 4.2 Preprocessing For Cora, we treated each subfield as an independent dataset. After stemming and removing stopwords we were left with vocabulary of stemmed unique words for each subfield respectively. All words with document frequency less than 15 were removed. Taking DB dataset as an example, a vocabulary of size 5911 unique words was reduced to size 1086 words by removing the low document frequency words. The word cloud for preprocessed DB dataset is shown in Fig.5 Figure 5: Word cloud for DB dataset The same preprocessing is handled by WebKB. We extracted words whose document fre- 548 G.L. Shen, X.P. Yang, J. Sun quency is more than 10. 4.3 Evaluation metrics In our paper, we focus on the topic similarity of the detected overlapping communities, therefore when the ground-truth community is known, we utilize two measures of purity and NMI to evaluate the quality of overlapping communities detected by different methods. Purity measures the internal topic similarity within the community, and NMI is the most widely used measure to account for overlapping communities. Given the ground-truth community structure, G = {G1,G2, · · · ,GS} where GS contains the set of nodes that are in the sth community. The community structure given by the algorithms is represented by C = {C1,C2, · · · ,CS}, where Ck contains the set of nodes that are in the kth community. The purity of Ci is defined as: Purity(Ci) = 1 |Ci| max j {Ci ∩Gj} (16) Usually, the detected community Ci includes nodes that belong to other Gj in the ground- truth. For Ci, we compute the intersection set with each standard community Gj, then take the maximum as the final purity for it. The purity of C is defined as: Purity(C) = 1 K K ∑ i=1 Purity(Ci) (17) The higher the purity, the better the communities are partitioned from the perspective of topics. The mutual information between G and C is defined as MI(G,C) = ∑ x∈G,y∈C p(x,y) log p(x,y) p(x)p(y) (18) The NMI(normalized mutual information) is defined by NMI(G,C) = MI(G,C) max(H(G),H(C)) (19) where H(G) and H(C) are the entropies of the partitions G and C. The higher the NMI, the closer the partition is to the ground truth. 4.4 Optimal value of λ As we discussed in section 3, the parameter λ balances the Jaccard index and fitness function value when compute the similarity of hyperedge pairs. We perform experiments to study how the λ value affects the purity of detected communities. We set the step 0.1, the result is shown in Figure 4. The result shows that λ value is decided by the structure of network and the number of hyperedges. Jaccard index and fitness function value both affect the purity of the detected communities. It is proved that the structure and content information both have influence on the topical community detection. However, we observe that the characteristics of information networks detemine the value of parameter λ . Different network will lead to different λ . This is Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 549 Figure 6: The result of choosing parameter λ because there are noise problems to various degree, including both links and node content in the information network. For WebKB dataset,the best performace is achieved when "λ = 0.6", and for Cora dataset, the optimal value "λ = 0.5" which are used as default settings for the following experiments. 4.5 Results To evaluate the effectiveness of HLP, we compare our method with two baseline methods: one is topology-based method, Line graph partition [6], the other is LDA to cluster the nodes by using content information only. We use purity and NMI quantifying the performance each algorithm. The details are shown in Figure 5, which illustrates HLP achieve the best performance in real information networks. From the results, we also can observe some interesting things. In some datasets, such as DB,PL,CN and WS, LDA algorithm can achieve better performance than Line Graph algorithm, In some other datasets, nevertheless, such as ML,TA and WC, the results are opposite. This confirms our assumption again that both node information and link information affect the quality of detected overlapping communities. Therefore, we are sure that the combination of node information and link information can improve the quality of overlapping community detection. (a) Cora dataset (b) WebKB dataset Figure 7: The evaluations of community algorithms over two real information networks. 550 G.L. Shen, X.P. Yang, J. Sun Conclusion In this paper, we propose a topic-oriented overlapping community detection approach based on hierarchical clustering for hybrid hypergraph model, which can combine the content and structure information of information network naturally. Considering the complex of the hybrid hypergraph model, we classify the connections of hyperedges into real connection related to content information, and virtual connection related to structure information. We present incre- mental fitness to evaluate the tightness for hyperedge pairs in virtual connection. Meanwhile, we extend CN metric on hyperedge pairs to conduct the semantic similarity calculation in real connection. In order to balance the influence of two connections, we combine linearly the two measures for similarity of hyperedge pairs. The density function is employed to determine the appropriate number of communities. To evaluate the performance, we conducted experiments on two real datasets. Compared with the benchmark, Line graph partition algorithm focusing on topological detection, LDA focusing on clustering node contents, our approach gained a better performance in information network. Furthermore, the overlapping communities detected by our approach were more meaningful since they are topic-oriented. Our approach has many potential applications. It can be applied to many kinds of information networks, where nodes contain content. With our method detecting the communities, we are able to improve the efficiency of collaborative scientific research, discover experts for each topic, and analyze topic-oriented influence propagation. Future work includes qualifying the weight of each node in the hyperedge to improve the purity of detected communities. We also intend to take the time factor into account, so that we can detect the evolution communities. Acknowledgment This paper is supported by Natural Science Foundation of China (No.71572015), Scientific Research Project of Beijing Union University (No. Zk10201506), Beijing Higher Education Young Elite Teacher Project (No.YETP1503). Bibliography [1] Cobanoglu B, Zengin A, Ekiz H, et al (2014); Implementation of DEVS Based Distributed Network Simulator for Large-Scale Networks [J]. International Journal of Simulation Mod- elling (IJSIMM), 13(2): 147-158. [2] Ding Y. (2011), Community detection: topological vs. topical, Journal of Informetrics, DOI: 10.1016/j.joi.2011.02.006, 5(4): 498-514. [3] Lancichinetti, Andrea, Santo Fortunato, and János Kertész (2009); Detecting the overlap- ping and hierarchical community structure in complex networks, New Journal of Physics, 11(3): 033015. [4] Xie J., Kelley S., Szymanski B. K. (2013), Overlapping community detection in networks: The state-of-the-art and comparative study, ACM Computing Surveys (CSUR), 45(4): 43- 79. [5] Evans T. S., Lambiotte R. (2009), Line graphs, link partitions, and overlapping communities, Physical Review E, DOI:http://dx.doi.org/10.1103/PhysRevE.80.016105, 8(1): 92-105. Detecting Topic-oriented Overlapping Community Using Hybrid a Hypergraph Model 551 [6] Ahn Y. Y., Bagrow J. P., Lehmann S. (2010), Link communities reveal multiscale complexity in networks, Nature, doi:10.1038/nature09182, 466: 761-764. [7] He, C., Ma, H., Kang, S., Cui, R. (2014), An Overlapping Community Detection Algorithm Based on Link Clustering in Complex Networks, In Military Communications Conference (MILCOM) IEEE, 865-870. [8] Palla, G., Derényi, I., Farkas, I., & Vicsek, T. (2005), Uncovering the overlapping community structure of complex networks in nature and society, Nature, 435: 814-818. [9] Farkas, I., Ábel, D., Palla, G., & Vicsek, T. (2007), Weighted network modules, New Journal of Physics, 9: 80-198. [10] Girvan M., Newman M. E. J. (2002), Community structure in social and biological networks, Proc. of the National Academy of Sciences, 99: 7821-7826. [11] Gregory S. (2007), An algorithm to find overlapping community structure in networks, Knowledge discovery in databases: PKDD 2007, Springer Berlin Heidelberg, 91-102. [12] Gregory S. (2008), A fast algorithm to find overlapping communities in networks, Machine learning and knowledge discovery in databases, Springer Berlin Heidelberg, 408-423. [13] T. Yang, R. Jin, Y. Chi, S. Zhu (2009), Combining link and content for community detection: a discriminative approach, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Paris, 927-936. [14] Hric D., Darst R. K., Fortunato S. (2014), Community detection in networks: structural clusters versus ground truth, arXiv preprint arXiv:1406.0146. [15] Hofmann, David Cohn Thomas (2001), The missing link-a probabilistic model of document content and hypertext connectivity, Proceedings of the 2000 Conference on Advances in Neural Information Processing Systems, Vancouver, 430-436. [16] D. Zhou, E. Manavoglu, J. Li, C. Giles, and H. Zha (2006), Probabilistic models for dis- covering e-communities, In Proceedings of the 15th international conference on World Wide Web, Banff, 173-182. [17] Yang, B., Di, J., Liu, J., Liu, D. (2013), Hierarchical community detection with applications to real-world network analysis, Data & Knowledge Engineering, 83: 20-38. [18] M. Ester, R. Ge, B. Gao, Z. Hu, and B. Ben-Moshe (2006), Joint cluster analysis of attribute data and relationship data: the connected k-center problem, Proceedings of the 2006 SIAM International Conference on Data Mining, Maryland, USA, 25-46. [19] Günnemann S, B. Boden, and T. Seidl (2011), Db-csc: a density-based approach forsubspace clustering in graphs with feature vectors, Machine Learning and Knowledge Discovery in Databases, Springer Berlin Heidelberg, 565-580. [20] Berge, Claude (1989), Hypergraphs: Combinatorics of Finite Sets, North Holland. [21] Zhou X. et al.(2014); Information-value-based feature selection algorithm for anomaly de- tection over data streams [J]. Tehnički vjesnik, 21: 223-232. [22] Zhao, Z., Feng, S., Wang, Q., Huang, J. Z., Williams, G. J., Fan, J. (2012), Topic oriented community detection through social objects and link analysis in social networks, Knowledge- Based Systems, 26: 164-173. 552 G.L. Shen, X.P. Yang, J. Sun [23] Darst R. K., Nussinov Z., Fortunato S. (2014), Improving the performance of algorithms to find communities in networks, Physical Review E, 89(3): 42-58. [24] McAuley J., Leskovec J. (2014), Discovering social circles in ego networks, ACM Transac- tions on Knowledge Discovery from Data (TKDD), 8(1): 10-16. [25] Lorrain F., White H. C. (1971) , Structural equivalence of individuals in social networks, The Journal of mathematical sociology,1: 49-80. [26] Rayid Ghani (2014), CMU World Wide Knowledge Base (WebKB) project, Jan, 2001.[On- line]. Available: http://www.cs.cmu.edu/∼webkb. [Accessed: April 9, 2014]