INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 13(3), 365-382, June 2018. Identifying Essential Proteins in Dynamic PPI Network with Improved FOA X. Lei, S. Wang, L. Pan Xiujuan Lei, Siguo Wang School of Computer Science Shaanxi Normal University Xian 710119,Shaanxi, China xjlei@snnu.edu.cn, wangsiguo@snnu.edu.cn Linqiang Pan* 1. Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China School of Automation Huazhong University of Science and Technology Wuhan 430074, Hubei, China 2. School of Electric and Information Engineering Zhengzhou University of Light Industry Zhengzhou 450002, Henan, China *Corresponding author: lqpan@mail.hust.edu.cn Abstract: Identification of essential proteins plays an important role for under- standing the cellular life activity and development in postgenomic era. Identification of essential proteins from the protein-protein interaction (PPI) networks has become a hot topic in recent years. In this work, fruit fly optimization algorithm (FOA) is extended for identifying essential proteins, the extended algorithm is called EPFOA, which merges FOA with topological properties and biological information for essen- tial proteins identification. The algorithm EPFOA has the advantage of identifying multiple essential proteins simultaneously rather than completely relying on ranking score identification individually. The performance of EPFOA is analyzed on dynamic PPI networks, which are constructed by combining the gene expression data. The experimental results demonstrate that EPFOA is more efficient in detecting essential proteins than the state-of-the-art essential proteins detection methods. Keywords: essential proteins, protein-protein interaction (PPI), dynamic PPI net- works, subcellular localization data, fruit fly optimization algorithm (FOA). 1 Introduction Protein plays an important role in the cellular life activity, and essential proteins are critical for the growth and development of organisms under a variety of conditions [27]. The absence of a single essential protein is sufficient to cause lethality or infertility [50]. Some recent results suggest that a comprehensive analysis of essential proteins can provide a deeper understanding of the relationship between mutations and human diseases, revealing the general principles of human diseases [12,15,59]. Therefore, the identification of essential proteins is closely related to disease prediction and drug design [53]. With the development of high-throughput technologies, various biological data are available, e.g., yeast-two-hybrid, tandem affinity purification, and mass spectrometry. In [2], a greedy algorithm is proposed to optimize the detection of protein communities. Existing methods for identifying essential proteins can be roughly divided into two types. The first type includes the biological experiment-based methods, e.g., gene knockouts [11], RNA interference [7], and conditional knockouts [35], which are expensive and time-consuming. The Copyright ©2018 CC BY-NC 366 X. Lei, S. Wang, L. Pan other type includes the topology-based centrality method, e.g., Degree Centrality (DC) [14], Betweenness Centrality (BC) [24], Closeness Centrality (CC) [52], Subgraph Centrality (SC) [9], Eigenvector Centrality (EC) [3], Information Centrality (IC) [40], Neighborhood Centrality (NC) [20], and Local Average Connectivity-based method (LAC) [19]. By defining and computing the topologically potential value of each protein, these methods can obtain a precise ranking score reflecting the importance of proteins in the protein-protein interaction (PPI) network [18]. Some centrality analysis tools and RNA detection tools [54] have been developed. For example, CytoNCA [43], a Cytoscape plugin, integrated eight centrality measures, i.e., DC, BC, CC, EC, IC, SC, NC and LAC. Obviously, the topology-based centrality methods can improve the efficiency with less cost. However, these centrality methods also have their own shortcomings. It is well known that the performance of topology-based methods is closely related to the quality of the PPI networks, but there are many false positive and false negative in the PPI networks. In order to deal with the drawbacks of these methods, some new methods are proposed to predict essential proteins by integrating their topological properties with their biological prop- erties. Considering the interaction data and Gene Ontology (GO) annotations, Hsing et al. introduced a method to predict highly-interacting proteins [13]. Later, a new prediction method called PeC was proposed by Li et al. [21], and another method called WDC was proposed by Tang et al. [41], which integrate network topology with gene expression profiles. Afterwards, Tang introduced a new method to identify essential proteins in which topological features of PPI network is combined with subcellular localization information [42]. Next, a new centrality mea- sure is proposed by Ren et al. to discover essential proteins, named harmonic centricity, which merges subgraph centrality with protein complexes to discover essential proteins [34]. Recently, a new prediction method, named UDoNC, that combine the domain features of proteins with their topological properties in PPI networks, was proposed by Peng et al. [30]. Some machine learning methods, e.g., Support Vector Machine, Naive Bayes, Bayes Network, and NBTree, were also adopted to predict essential proteins by using different features. For example, the random forest was adopted to predict essential proteins by Qin et al. [32]. These methods that combine the network topological features with biological data is capable of improving the accuracy and efficiency of prediction significantly. These existing methods regard the PPI networks as static networks that ignore the time-course of the networks. The real PPI networks in cell keeps chang- ing over different stages of the cell cycle [31], and they can be classified into stable or transient PPI networks [46], which are usually described as dynamic PPI networks (DPIN). Thus it is im- portant to construct dynamic PPI networks to investigate the temporal properties of individual proteins and protein interactions. Based on dynamic network topology and complex informa- tion, Luo and Kuang proposed a new method to predict essential proteins [22]. The results show that the identification of essential proteins in dynamic networks is more conducive than in static networks. Fruit fly optimization algorithm (FOA) is a novel swarm intelligent algorithm that mimics the foraging behavior of fruit flies for global optimization [25]. FOA is easy to be understood and implemented, which has few parameters to be adjusted. Due to its simplicity and efficiency, FOA showed great success in solving some real-world complex problems like multidimensional knapsack problem [48]. Here, FOA will be used to find the essential proteins. In this work, we present a new algorithm, called EPFOA, in which FOA is merged with topological properties and biological information for essential proteins identification. To the best of my knowledge, most of the methods of essential proteins identification focus on static PPI networks and ignore the intrinsic features of organisms. In our method, we first integrate gene expression data with static PPI network to construct the dynamic network model. Then a new topological centrality method that combines GO annotation and edge aggregation coefficient (ECC) is proposed to measure the topological char- Identifying Essential Proteins in Dynamic PPI Network with Improved FOA 367 acteristic of PPI networks with modular local average connectivity (LAC) in dynamic networks. Furthermore, the distribution of proteins in each compartment according to subcellular localiza- tion data is obtained, and the role of components in identifying essential proteins is analyzed. Finally, EPFOA is designed to identify essential proteins. To assess the performance of our method, EPFOA is compared with some existing methods including DC, EC, IC, SC, NC, LAC, PeC and UDoNC, and the experimental results indicate that our method significantly outperforms with the existing methods. 2 Method 2.1 Fruit fly optimization algorithm Fruit fly optimization algorithm (FOA) is a novel method for global optimization, which is inspired by the foraging behavior of fruit flies. In sensory perception, the fruit fly is superior to the other species, especially in olfactory and vision. The olfactory organs of fruit flies can collect all kinds of scents floating in the air, even smell the food source from 40 kilometers away. After the fruit fly gets close to the food, it can also use the sensitive vision to find food and the companyĄŻs flocking location, and fly to the direction [25]. The procedure of FOA is presented in pseudo code as follows. Step 1. Randomly initialize the location of the fruit flies (Xaxis,Yaxis). Step 2. Give the random direction and distance for the search of food using osphresis by an individual fruit fly. { Xi = Xaxis + RandomV alue Yi = Yaxis + RandomV alue (1) Step 3. The distance (Disti) to the origin is estimated, then the smell concentration judgment value (Si) is calculated, which is the reciprocal of the distance.{ Disti = √ x2 + y2 Si = 1 Disti (2) Step 4. Substitute smell concentration judgment value (Si) into smell concentration judgment function (or called fitness function) to find the smell concentration (Smelli) of the individ- ual location of the fruit fly. Smelli = Function(Si) (3) Step 5. Find the individual with the maximal smell concentration among the fruit fly swarm according to the smell concentration value. [bestSmell bestInedx] = max(Smell) (4) Step 6. Maintain the best smell concentration value x and y, where the fruit fly swarm will use vision to fly towards that location.  Smell = bestSmell Xaxis = X(bestIndex) Yaxis = Y (bestIndex) (5) Step 7. Repeat steps 2-5 until the smell concentration is superior to the previous smell concen- tration; otherwise, go to step 6. 368 X. Lei, S. Wang, L. Pan 2.2 Dynamic PPI network model construction Gene expression data is valuable for revealing the dynamic properties of proteins and PPI. We integrate gene expression data with high-throughput PPI data to construct a dynamic PPI network. Note that protein does not always become active at a cell cycle, a protein is active at the highest gene expression level. In order to mark the active time of each gene, the active threshold of each gene should be calculated, and the gene is active if its expression value is greater than the active threshold. The calculation of active threshold is proceeded on the 3-sigma model [45]. AT(p) = µ(p) + 3 ×σ(p) × (1 − 1 1 + σ(p)2 ), (6) where µ(i) is the mean gene expression value of protein i and σ(i) is the algorithm standard deviation of the expression values over time 1 to T for protein i. Since the gene expression data has three cycles and each cycle has 12 times tamps, the final gene expression at each time point is the average of the three cycles, which is defined as follows [16]: FT(i) = T(i) + T(i + 12) + T(i + 24) 3 , (i ∈ [1, 12]), (7) where T(i) denotes the gene expression value at time point i. At a certain times tamp, if both proteins are active with an interaction, the interaction of the two proteins is also active. Eventually the entire PPI network was divided into 12 sub-networks, the dynamic PPI network was constructed. 2.3 Topological characteristics of dynamic networks A PPI network is not only an important biological network but also a typical complex network, which meets the topological characteristics of complex network, such as small-world [49], scale- free [51], and modularity [10]. In this part, the role of the topological characteristics in the process of essential proteins identification is investigated, and a new topological centrality method based on the ECC and GO annotation is proposed. Furthermore, the modularity of the network that applied LAC is also considered. Dynamic network topology centrality strategy A PPI network can usually be expressed as an undirected graph G = (V,E), where the set of vertices V represents protein, and E represents all of interactions between pairs of proteins. In order to assess the centrality of dynamic network topology, we introduce the GO annotation (since the ECC cannot fully reflect the characteristics). GO annotation provides valuable information and a convenient method to study the gene function similarity, some researches have shown that the adoption of GO semantic similarity term can improve the prediction accuracy of protein complexes gene and disease [36,56,57]. Weighting the networks via ECC In order to measure the tightness of the two nodes, we use the ECC [41], which is defined as follows: ECC(u,v) = |Nu ∩Nv| + 1 min{du,dv} , (8) where Nu (or Nv) refers to the set of neighbours of node u (or v) in PPI networks, |Nu ∩ Nv| is the number of common neighbor nodes of u and v, which is consistent with the number of triangles which edge (u,v) belongs to du (or dv) indicates the degrees of node u (or v). Weighting the networks using the Gene Ontology Identifying Essential Proteins in Dynamic PPI Network with Improved FOA 369 The GO information consists of three sub-ontologies: Biological Process (BP), Cellular Com- ponent (CC) and Molecular function (MF) [6]. In order to measure the semantic similarity between the GO terms to protein annotations in an interaction network, we applied the method developed by Wang et al. [47]: GO_sim(u,v) = ∑ t∈Tu ⋂ Tv (Su(t) + Sv(t))∑ t∈Tu Su(t) + ∑ t∈Tv Sv(t) , (9) where where Tu and Tv are the annotations of protein u and v; Su(t) is the S-value of GO term t related to term u and Sv(t) is the S-value of GO term t related to term v. Generating new weighted networks Based on the definition of the ECC and gene functional similarity, a new centrality measure, named EG, is proposed. For a protein u, the essentiality EG(u) is defined as the probability between the ECC and GO information: EG(u) = ∑ v∈Nu ECC(u,v) ×GO_sim(u,v), (10) where N(u) denotes the set of all neighbors of node u. When computing dynamic EG(u), we should consider the number of times that each node appears in a dynamic PPI network, since some nodes are not included in all the time networks. Dynamic EG(u) can be defined as the follows: DEG(u) = ∑N i=1 EG i(u) tim(u) , (11) where N is the number of temporal networks in the dynamic network, EGi(u) the EG of node u in the ith time point, tim(u) the number of time networks that contain node u. If node u does not appear at time point i, EGi(u) is equal to zero. Dynamic local average connectivity The LAC of a node indicates its closeness [49], and the LAC of a node v is defined as: LAC(u) = ∑ v∈Nu deg Cu(v) |Nu| , (12) where Nu is the neighbors of node v, |Nu| the number of nodes in Nu, and Cu the subgraph induced by Nu. For a node u in Cv, its local connectivity in Cu is represented as deg(Cu)(v). Similar to DEG(u), we define Dynamic LAC as follows [30]: DLAC(u) = ∑N i=1 LAC i(u) tim(u) , (13) where N is the number of temporal networks in the dynamic network, LACi(u) the LAC of node v in the ith time point, and tim(u) the number of time networks that contain node u. 2.4 Subcellular localization score Subcellular location is divided into different compartments, different compartments play dif- ferent roles in cell activities. In order to understand the relationship between subcellular local- ization and essential proteins, we analyze the number of essential proteins in each subcellular location and propose a method to evaluate subcellular localization in previous research. Assume 370 X. Lei, S. Wang, L. Pan that in the Nucleus, the wider the distribution of the proteins is, the greater the possibility of essential protein becomes [17]. Let Cmax denote the protein with the largest number of times appearing in subcellular lo- calization of the nucleus, |u| represents the number of times of the protein u appearing in the nucleus. The importance of protein u, denoted as NSL(u), is calculated by the ratio of its size to the largest size of the nucleus. The value of NSL(u) is in the range of (0, 1]. NSL(u) = |u| |Cmax| (14) 2.5 EPFOA algorithm In order to make up for the shortcomings of traditional identification of essential proteins one by one, we propose the algorithm EPFOA. The algorithm can identify p candidate essential proteins simultaneously, which greatly improves the recognition efficiency. In what follows, we introduce the algorithm EPFOA. First, initialize the position of fruit fly and set the rules of location updating. Then find p candidate essential proteins according to the characteristic of FOA. Finally, the identified p essential proteins are compared with known essential proteins to verify the number of essential proteins identified correctly. The initialization and update of the location of fruit flies The initialization and location update rules of fruit fly play an important role in the perfor- mance of EPFOA. The position of the fruit fly is encoded as an integer set of p-dimensional set Xi = (xi1,xi2, . . . ,xip), (i = 1, 2, . . . ,n) which denotes a candidate essential protein set. Each element xij (xij ≤ |N|,j = 1, 2, . . . ,p) in Xi is the sequence number of a protein. First we randomly selected p proteins to initialize a fruit fly position Xi. Then we compare the selected p proteins with the known essential proteins and keep the proteins that are successfully matched. After that the remaining positions that represent proteins are updated. In order to speed up the convergence of the proposed EPFOA algorithm, we sort all the proteins based on degree except selected p proteins. A random value is assigned to the individual that is not essential protein in the Xi and update the position in a sequence that is ranked by degree. Encoding and decoding of EPFOA The framework of EPFOA is shown in Fig.2. We set every fruit fly as essential protein candidate set, and the location of fruit fly is the serial number of the candidate proteins. For the purpose of evaluating the topological characteristics of the network comprehensively, we combine LAC that represents the network modularity with the new network centrality. Thus, when a fruit fly is in a certain position, we suppose its smell concentration judgment value S(i) can be calculated as following equation: S(i) = p∑ j=1 (DLAC(µj) + DEG(µj)), (15) where DLAC(uj) denotes the dynamic local average connectivity of the jth protein among the p candidate essential proteins and DEG(uj) denotes dynamic network topology centrality of the jth protein among the p candidate essential proteins. The topological characteristics and biological data are both indispensable in the process of identifying essential proteins and subcellular localization data plays an important role in essential proteins identification. We set the following smell concentration judgement function to measure the possibility of essential proteins represented by a fruit fly individual: Fit(i) = α×S(i) + (1 −α) × p∑ j=1 NSL(µj), (16) Identifying Essential Proteins in Dynamic PPI Network with Improved FOA 371 !"#$$ %&'%#'()*(+&' ,Smelli !""#$%&'( )*+%#&$", -'$.&.'%#,"#% /+-'%&+$ 0 1 2 3 450065016 6507 451065116 6517 452065216 6527 453065316 6537 8%'%&-,))9, $#%:+*; 8<=-#((<('*, (+-'(&>'%&+$, .'%' ?#$#, #57*#""&+$, .'%' ?#$#,+$%+(+@A, .'%' BA$'3&-,))9, .'%' -#$%*'(&%A, 3#%C+. 4B!? 83#((,-+$-#$%*'%&+$, D<.@3#$%,E'(<#,S-i !"#$%%! &'(&$()*+),'(! -Smelli ! "#$%&'($)*(+,$&-%.(/-*..(0+'0*'$1%$&+'(%'2( .+0%$&+' 3&456&786&98 86&, : ;*$*1-&'*(<)*$)*1($)*( -%6&-=-('=-#*1( +>(&$*1%$&+'(&/(1*%0)*2 ;?'%-&0(@AB ;?'%-&0(CBB >&'2($)*(>1=&$(>.?(<&$)($)*(-%6&-%.(/-*..( 0+'0*'$1%$&+'(%'2(1*0+12(&$D/(.+0%$&+'(8$)*'( =,2%$*($)*(.+0%$&+'(%00+12&'E($+($)*(1=.*/ Figure 1: The framework of the algorithm EPFOA. 372 X. Lei, S. Wang, L. Pan where NSL(uj) denotes subcellular localization score of the jth protein among the p candidate essential proteins and α ∈ [0, 1], α is a parameter that regulates the proportion of the network topology and biological information in the process of identifying essential proteins. If α = 0, only subcellular location information works; else if α = 1, only network topology works. Pseudo code of EPFOA The process of EPFOA can be divided into two steps. The first step calculates the topological and biological characteristics of protein nodes. The second step applies the process of FOA algorithm to seek the optimal to find the essential proteins. The pseudo code of EPFOA is shown in Algorithm 1. Algorithm 1 The pseudo code of EPFOA Ensure: G = (V,E) (the PPI network), Gene expression data, Gene Ontology GO, Subcellular location data. Require: Essential protein set. 1: Construct the dynamic PPI network 2: for each interacting protein pair (a,b) in PPI do 3: Calculate ECC /*The closeness of the two nodes*/ 4: Calculate GO /*The functional similarity of the two nodes based on GO annotation*/ 5: end for 6: for each node in G do 7: Update the centrality DEG(u) 8: Calculate DLAC(u) 9: Calculate subcellular location score NSL(u) 10: end for 11: for fruit fly i do 12: Initialize location x(i) and its best location b_x(i) 13: Calculate the smell concentration smell(i)= Fit(S(i)) 14: end for 15: for m in [1,maxiter] do 16: for fruit fly i do 17: Update location X(i) = X(i) + random 18: if smell(i)