key: cord-314498-zwq67aph authors: van heck, eric; vervest, peter title: smart business networks: concepts and empirical evidence date: 2009-05-15 journal: decis support syst doi: 10.1016/j.dss.2009.05.002 sha: doc_id: 314498 cord_uid: zwq67aph nan organizations are moving, or must move, from today's relatively stable and slow-moving business networks to an open digital platform where business is conducted across a rapidly-formed network with anyone, anywhere, anytime despite different business processes and computer systems. table 1 provides an overview of the characteristics of the traditional and new business network approaches [2] . the disadvantages and associated costs of the more traditional approaches are caused by the inability to provide relative complex, bundled, and fast delivered products and services. the potential of the new business network approach is to create these types of products and services with the help of combining business network insights with telecommunication capabilities. the "business" is no longer a self-contained organization working together with closely coupled partners. it is a participant in a number of networks where it may lead or act together with others. the "network" takes additional layers of meaningfrom the ict infrastructures to the interactions between businesses and individuals. rather than viewing the business as a sequential chain of events (a value chain), actors in a smart business network seek linkages that are novel and different creating remarkable, "better than usual" results. "smart" has a connotation with fashionable and distinguished and also with short-lived: what is smart today will be considered common tomorrow. "smart" is therefore a relative rather than an absolute term. smartness means that the network of co-operating businesses can create "better" results than other, less smart, business networks or other forms of business arrangement. to be "smart in business" is to be smarter than the competitors just as an athlete who is considered fast means is faster than the others. the pivotal question of smart business networks concerns the relationship between the strategy and structure of the business network on one hand and the underlying infrastructure on the other. as new technologies, such as rfid, allow networks of organizations almost complete insight into where its people, materials, suppliers and customers are at any point in time, it is able to organize differently. but if all other players in the network space have that same insight, the result of the interactions may not be competitive. therefore it is necessary to develop a profound understanding about the functioning of these types of business networks and its impact on networked decision making and decision support systems. the key characteristics of a smart business network are that it has the ability to "rapidly pick, plug, and play" to configure rapidly to meet a specific objective, for example, to react to a customer order or an unexpected situation (for example dealing with emergencies) [4] . one might regard a smart business network as an expectant web of participants ready to jump into action (pick) and combine rapidly (plug) to meet the requirements of a specific situation (play). on completion they are dispersed to "rest" while, perhaps, being active in other business networks or more traditional supply chains. this combination of "pick, plug, play and disperse" means that the fundamental organizing capabilities for a smart business network are: (1) the ability for quick connect and disconnect with an actor; (2) the selection and execution of business processes across the network; and (3) establishing the decision rules and the embedded logic within the business network. we have organized in june 2006 the second sbni discovery session that attracted both academics and executives to analyze and discover the smartness of business networks [1] . we received 32 submissions and four papers were chosen as the best papers that are suitable for this special issue. the four papers put forward new insights about the concept of smart business networks and also provide empirical evidence about the functioning and outcome of these business networks and its potential impact on networked decision making and decision support systems. the first paper deals with the fundamental organizing ability to "rapidly pick, plug, and play" to configure rapidly to meet a specific objective, in this case to find a solution to stop the outbreak of the severe acute respiratory syndrome (sars) virus. peter van baalen and paul van fenema show how the instantiation of a global crisis network of laboratories around the world cooperated and competed to find out how this deadly virus is working. the second paper deals with the business network as orchestrated by the spanish grupo multiasistencia. javier busquets, juan rodón, and jonathan wareham show how the smart business network approach with embedded business processes lead to substantial business advantages. the paper also shows the importance of information sharing in the business network and the design and set up of the decision support and infrastructure. the third paper focus on how buyer-seller relationships in online markets develop over time e.g. how even in market relationships buyers and sellers connect (to form a contract and legal relationship) and disconnect (by finishing the transaction) and later come back to each other (and form a relationship again). ulad radkevitch, eric van heck, and otto koppius identify four types of clusters in an online market of it services. empirical evidence reveals that these four portfolio clusters rely on either arms-length relationships supported by reverse auctions, or recurrent buying with negotiations or a mixed mode, using both exchange mechanisms almost equally (two clusters). the fourth paper puts forward the role and impact of intelligent agents and machine learning in networks and markets. the capability of agents to quickly execute tasks with other agents and systems will be a potential, sustainable and profitable strategy to act faster and better for business networks. wolf ketter, john collins, maria gini, alok gupta, and paul schrater identify how agents are able to learn from historical data and can detect different economic regimes, such as under-supply and over-supply in markets. therefore, agents are able to characterize the economic regimes of markets and forecast the next, future regime in the market to facilitate tactical and strategic decision making. they provide empirical evidence from the analysis of the trading agent competition for supply chain management (tac scm). we identify three important potential directions for future research. the first research stream deals with advanced network orchestration with distributed control and decision making. the first two papers indicate that network orchestration is a key critical component of successful business networks. research of intelligent agents is showing that distributed and decentralized decision making might provide smart solutions because it combines local knowledge of actors and agents in the network with coordination and control of the network as a whole. agents can help to reveal business rules in business networks, or gather pro-actively new knowledge about the business network and will empower the next generation of decision support systems. the second research stream deals with information sharing over and with network partners. for example, diederik van liere explores in his phd dissertation the concept of the "network horizon": the number of nodes that an actor can "see" from a specific position in the network [3] . most companies have a network horizon of "1". they know and exchange information with their suppliers and customers. however, what about the supplier of the suppliers, or the customer of their customers? one develops then a network horizon of "2". diederik van liere provides empirical evidence that with a larger network horizon a company can take a more advantageous network position depending on the distribution of the network horizons across all actors and up to a certain saturation point. the results indicate that the expansion of the network horizon will be in the near future a crucial success factor for companies. future research will shed more light on this type of network analysis and its impact on network performance. the third research stream will focus on the network platform with a networked business operating system (bos). most of the network scientists analyze the structure and dynamics of the business networks independent of the technologies that enable it to perform. it concentrates on what makes the network effective, the linked relationships between the actors, and how their intelligence is combined to reach the network's goals. digital technologies play a fundamental role in today's networks. they have facilitated improvements and fundamental changes in the ways in which organizations and individuals interact and combine as well as revealing unexpected capabilities that create new markets and opportunities. the introduction of new networked business operating systems will be feasible and these operating systems will go beyond the networked linking of traditional enterprise resource planning (erp) systems with customer relationship management (crm) software packages. implementation of a bos enables the portability of business processes and facilitates the end-to-end management of processes running across many different organizations in many different forms. it coordinates the processes among the networked businesses and its logic is embedded in the systems used by these businesses. smart business network initiative smart business networks: how the network wins network horizon and dynamics of network positions eric van heck holds the chair of information management and markets at rotterdam school of management, erasmus university, where he is conducting research and is teaching on the strategic and operational use of information technologies for companies and markets vervest is professor of business networks at the rotterdam school of management, erasmus university, and partner of d-age, corporate counsellors and investment managers for digital age companies firstly, we would like to thank the participants of the 2006 sbni discovery session that was held at the vanenburg castle in putten, the netherlands. inspiring sessions among academics and executives shed light on the characteristics and the functioning of smart business networks.secondly, we thank the reviewers of the papers for all their excellent reviews. we had an intensive review process and would like to thank the authors for their perseverance and hard work to create an excellent contribution to this special issue. we thank kevin desouza, max egenhofer, ali farhoomand, erwin fielt, shirley gregor, lorike hagdorn, chris holland, benn konsynski, kenny preiss, amrit tiwana, jacques trienekens, and dj wu for their excellent help in reviewing the papers.thirdly, we thank andy whinston for creating the opportunity to prepare this special issue of decision support systems on smart business networks. key: cord-307735-6pf7fkvq authors: walkey, allan j.; kumar, vishakha k.; harhay, michael o.; bolesta, scott; bansal, vikas; gajic, ognjen; kashyap, rahul title: the viral infection and respiratory illness universal study (virus): an international registry of coronavirus 2019-related critical illness date: 2020-04-29 journal: crit care explor doi: 10.1097/cce.0000000000000113 sha: doc_id: 307735 cord_uid: 6pf7fkvq the coronavirus disease 2019 pandemic has disproportionally strained intensive care services worldwide. large areas of uncertainly regarding epidemiology, physiology, practice patterns, and resource demands for patients with coronavirus disease 2019 require rapid collection and dissemination of data. we describe the conception and implementation of an intensive care database rapidly developed and designed to meet data analytic needs in response to the coronavirus disease 2019 pandemic—the multicenter, international society of critical care medicine discovery network viral infection and respiratory illness universal study. design: prospective cohort study and disease registry. setting: multinational cohort of icus. patients: critically ill patients with a diagnosis of coronavirus disease 2019. interventions: none. measurements and main results: within 2 weeks of conception of the society of critical care medicine discovery network viral infection and respiratory illness universal study, study leadership was convened, registry case report forms were designed, electronic data entry set up, and more than 250 centers had submitted the protocol for institutional review board approval, with more than 100 cases entered. conclusions: the society of critical care medicine discovery network viral infection and respiratory illness universal study provides an example of a rapidly deployed, international, pandemic registry that seeks to provide near real-time analytics and information regarding intensive care treatments and outcomes for patients with coronavirus disease 2019. objectives: the coronavirus disease 2019 pandemic has disproportionally strained intensive care services worldwide. large areas of uncertainly regarding epidemiology, physiology, practice patterns, and resource demands for patients with coronavirus disease 2019 require rapid collection and dissemination of data. we describe the conception and implementation of an intensive care database rapidly developed and designed to meet data analytic needs in response to the coronavirus disease 2019 pandemic-the multicenter, international society of critical care medicine discovery network viral infection and respiratory illness universal study. design: prospective cohort study and disease registry. setting: multinational cohort of icus. patients: critically ill patients with a diagnosis of coronavirus disease 2019. interventions: none. measurements and main results: within 2 weeks of conception of the society of critical care medicine discovery network viral infection and respiratory illness universal study, study leadership was convened, registry case report forms were designed, electronic data entry set up, and more than 250 centers had submitted the protocol for institutional review board approval, with more than 100 cases entered. conclusions: the society of critical care medicine discovery network viral infection and respiratory illness universal study provides an example of a rapidly deployed, international, pandemic registry that seeks to provide near real-time analytics and information regarding intensive care treatments and outcomes for patients with coronavirus disease 2019. key words: coronavirus disease 2019; registry t he coronavirus disease 2019 (covid-19) pandemic has introduced unprecedented challenges to healthcare systems worldwide. due to the effects of covid-19 on the respiratory system, geographic areas affected by the pandemic have experienced large surges in critically ill patients who require intensive care and multiple organ system support (1, 2) . in addition, case reports of medications hypothesized to reduce viral replication or systemic inflammation have spurred widespread off-label use without the usual level of evidence that has long been accepted in modern medicine, resulting in critical shortages of medications and frequently missing the opportunity to evaluate potential benefits as well as risks of such drugs (3) . large-scale data that enables rapid communication of patient characteristics, treatment strategies, and outcomes during a pandemic response would support nimble organizational planning and evaluation of effective critical care practices. we describe a novel icu database rapidly designed in response to the covid-19 pandemic to allow for near real-time data collection, analysis, and display-the multicenter, international society of critical care medicine (sccm) discovery network viral infection and respiratory illness universal study (virus). the sccm formed the discovery, the critical care research network, in 2016 to provide a central resource to link critical care clinical investigators to scale up research in critical illness and injury, providing both centralized networking and tangible resources including data storage and management, statistical support, grant writing assistance, and other project management needs. remarkably, the social media platform twitter (4) (5); daily twitter posts advertising the importance, availability, and purpose of the registry; and word of mouth. within 14 days of the initial twitter post bringing together the virus leadership team, the consortium ~250 sites across north america, south america, east, south, and western asia, africa, and europe had submitted institutional review board (irb) applications for participation in the registry (fig. 1) ; table 1 shows the number of sites reaching different landmarks to study participation during the first 2 weeks after a call for sites was announced. by week 2 after announcing the formation of the registry, data from more than 100 cases had been uploaded. the rapid enrollment of sites in the absence of external funding or support for the project is indicative of nearly universal enthusiasm across the international critical care community to collaborate across borders and silos in order to quickly learn from accumulating experience with covid-19. the overarching purpose of the sccm virus discovery database is to accelerate learning with regards to the epidemiology, physiology, and best practices in response to the covid-19 pandemic. the de-identified, hipaa compliant database was developed to capture both core data collection fields containing clinical information collected for all patients, and an enhanced data set of daily physiologic, laboratory, and treatment information collected by sites with available research support and/or infrastructure to allow for more intensive data collection (see data overview in fig. 2 , and detailed case report forms in appendix 1, supplemental digital content 1, http://links.lww.com/ccx/a163). the case report forms were adapted from the world health organization templates data collection forms (6), with edits to focus on an icuspecific context. case report forms went through rapid iterative editing to balance feasibility, efficiency, and comprehensiveness, with input from multiple clinical specialties. many challenges exist when initiating an international collaboration in multicenter clinical data sharing. the sccm virus discovery network will have a four pillar open science approach to data reporting and sharing. first, all centers have open access to their data for internal quality assurance and pilot studies. second, summary count data will be displayed on the sccm virus website (https://www.sccm.org/research/research/discovery-research-network/virus-covid-19-registry) as an interactive dashboard that will provide public reporting of real-time updates with regards to case counts, icu resource use, and outcomes. third, with appropriate data use agreements, investigators will be able apply to use the pooled multicenter data for independent research questions. fourth, the sccm covid-19 research team will identify urgent questions of clinical effectiveness, submit study protocols for independent methodological peer review, and design rigorous observational causal inference approaches (e.g., appropriate missing data methods, target trial emulation, use of directed acyclic graphs for covariate selection, quantitative sensitivity analyses [7] ) paired with data visualizations that produce real-time results displayed on the dashboard for immediate dissemination. we seek to facilitate a timely, democratized, and crowd-sourced discovery process, similar to icu databases such as medical information mart for intensive care (mimic-iii) (8) . the sccm virus discovery network will encourage that all research projects using consortium data post pre-prints in noncommercial archives, further facilitating rapid-reporting of research findings necessary for nimble response to a pandemic. we learned many lessons in a short period while setting up an international icu registry during a pandemic. strategies that worked to facilitate rapid progress included a strong social media presence, open communications and data harmonization with other research networks (e.g., national heart, lung, and blood institute prevention and early treatment of acute lung injury network), responsive irbs that identified the critical need for rapid approval of de-identified data collection in the setting of a pandemic, use of the established database infrastructure research electronic data capture (9) for construction of harmonized case report forms, as well as an academic-professional society partnership that facilitated rapid processing of data use agreements, and early set up of a central website for communication of study materials, frequently asked questions, and standard operating procedures. early review of literature and comparative research (10) of previous outbreaks was helpful in preparing the research resources; early reports from sentinel countries guided targeting of relevant data fields. in addition, assembly of a multidisciplinary leadership team enabled multiple stakeholder engagement and shared responsibility and mentorship across training levels. finally, a daily reminder to focus on the goals of the database-icu practices, physiology, and outcomes for patients with covid-19-helped to mitigate scope creep and allow for timely completion of the data infrastructure. strategies that may have improved the process included the use of a central irb, funding to support local data entry, and a preexisting team able to "flip the switch" on an existing infrastructure to immediately respond to a crisis. much of the world was relatively unprepared for the rapidly spreading covid-19 pandemic. four days after the pandemic was recognized and declared by the world health organization, we assembled an ad hoc team to initiate the registry of critically ill patients with covid-19 described herein. it is our profound hope that a similar registry will not be required in the future. however, it is likely that we will be applying lessons learned from covid-19 to future pandemics. our experience of quickly initiating the sccm discovery virus registry and moving from conception to data accrual within less than a month has taught us several valuable lessons-most important being that clinicians across the world want to donate their time for the greater good. as we continue to accrue data into the sccm discovery virus covid-19 registry, we anticipate that newly established infrastructure and networks will enable more nimble responses to data collection and discovery that allow us to learn from the past, and be better prepared for future pandemics. supplemental digital content is available for this article. direct url citations appear in the html and pdf versions of this article on the journal's website (http://journals.lww.com/ccejournal). dr. harhay is partially supported by national institutes of health/national heart, lung, and blood institute grant r00 hl141678. the remaining authors have disclosed that they do not have any potential conflicts of interest. for information regarding this article, e-mail: alwalkey@bu.edu clinical characteristics of coronavirus disease 2019 in china clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study fda authorizes widespread use of unproven drugs to treat coronavirus, saying possible benefit outweighs risk sccm virus covid-19 registry. 2020. available at isaric covid-19 clinical research resources. 2020. available at control of confounding and reporting of results in causal inference studies. guidance for authors from editors of respiratory, sleep, and critical care journals mimic-iii, a freely accessible critical care database research electronic data capture (redcap)-a metadata-driven methodology and workflow process for providing translational research informatics support guide to understanding the 2019 novel coronavirus key: cord-283793-ab1msb2m authors: chanchan, li; guoping, jiang title: modeling and analysis of epidemic spreading on community network with node's birth and death date: 2016-10-31 journal: the journal of china universities of posts and telecommunications doi: 10.1016/s1005-8885(16)60061-4 sha: doc_id: 283793 cord_uid: ab1msb2m abstract in this paper, a modified susceptible infected susceptible (sis) epidemic model is proposed on community structure networks considering birth and death of node. for the existence of node's death would change the topology of global network, the characteristic of network with death rate is discussed. then we study the epidemiology behavior based on the mean-field theory and derive the relationships between epidemic threshold and other parameters, such as modularity coefficient, birth rate and death rates (caused by disease or other reasons). in addition, the stability of endemic equilibrium is analyzed. theoretical analysis and simulations show that the epidemic threshold increases with the increase of two kinds of death rates, while it decreases with the increase of the modularity coefficient and network size. with the development of complex network theory, many social, biological and technological systems, such as the transportation networks, internet and social network, can be properly analyzed from the perspective of complex network. and many common characteristics of most real-life networks have been found out, e.g., small-world effect and scale-free property. for some kind of networks, the degree distributions have small fluctuations, and they are called as homogeneous networks [1] , e.g., random networks, small world networks and regular networks. in contrary to the homogeneous networks, heterogeneous networks [2] show power law distribution. based on the mean-field theory, many epidemic models, such as susceptible-infected (si), sis and susceptibleinfected-recovered/ removed (sir), have been proposed to describe the epidemic spreading process and investigate the epidemiology. it has been demonstrated that a threshold value exists in the homogeneous networks, while it is absent in the heterogeneous networks with sufficiently large size [3] . compared to the lifetime of individuals, the infectious period of the majority of infectious diseases is short. therefore, in most of the epidemic models, researchers generally choose to ignore the impact of individuals' birth and death on epidemic spreading. however, in real life, some infectious diseases have high death rate and may result in people's death in just a few days or even a few hours, such as severe acute respiratory syndrome (sars), hemagglutinin 7 neuraminidase 9 (h7n9) and the recent ebola. and some infectious diseases may have longer spreading time, like hbv, tuberculosis. besides, on the internet, nodes' adding and removing every time can also be treated as nodes' birth and death. in ref. [4] , liu et al. analyzed the spread of diseases with individuals' birth and death on regular and scale-free networks. they find that on a regular network the epidemic threshold increases with the increase of the treatment rate and death rate, while for a power law degree distribution network the epidemic threshold is absent in the thermodynamic limit. sanz et al. have investigated a tuberculosis-like infection epidemiological model with constant birth and death rates [5] . it is found that the constant change of the network topology which caused by the individuals' birth and death enhances the epidemic incidence and reduces the epidemic threshold. zhang et al. considered the epidemic thresholds for a staged progression model with birth and death on homogeneous and heterogeneous networks respectively [6] . in ref. [7] , an sis model with nonlinear infection rate, as well as birth and death of nodes, is investigated on heterogeneous networks. in ref. [8] , zhu et al. proposed a modified sis model with a birth-death process and nonlinear infection rate on an adaptive and weighted contact network. it is indicated that the fixed weights setting can raise the disease risk, and that the variation of the weight cannot change the epidemic threshold but it can affect the epidemic size. recently, it has been revealed that many real networks have the so-called community structure [9] , such as social networks, internet and citation networks. a lot of researchers focus on the study of epidemic spreading on community structure networks. liu et al. investigated the epidemic propagation in the sis model on homogeneous network with community structure. they found that community structure suppress the global spread but increase the threshold [10] . many researchers studied the epidemic spreading in scale-free networks with community structure based on different epidemic model, such as si model [11] , sis model [12] , sir model [13] [14] and susceptible exposed asymptomatically infected recovered (seair) model [15] . chu et al. investigated the epidemic spreading in weighted scale-free networks with community structure [16] . in ref. [17] , shao et al. proposed an traffic-driven sis epidemic model in which the epidemic pathway is decided by the traffic of nodes in community structure networks. it is found that the community structure can accelerate the epidemic propagation in the traffic-driven model, which is different from the traditional model. the social network has the property of community structure and some infectious diseases have high mortality rates or long infection period, while the previous studies only consider the impact of one of the aforementioned factors. so in this paper, we study the epidemic spreading in a modified sis epidemic model with birth and death of individuals on a community structure network. the rest of this paper is organized as follows. in sect. 2, we introduce in detail the network model and epidemic spreading process, and discuss the network characteristics either. in sect. 3, mean-field theory is utilized to analyze the spreading properties of the modified sis epidemic model. sect. 4 gives some numerical and simulations which support the theoretical analysis. at last, sect. 5 concludes the paper. as there exists the phenomena of the individual's birth and death in real networks, the topology of the network changes over time. we consider undirected and unweighted graphs in this paper. the generating algorithm of the network with community structure can be summarized as follows: we assume that each site of this network is empty or occupied by only one individual. 2) the probability to have a link between the individuals (non-empty sites) in the same community is p i . 3) we create a link between two nodes (non-empty sites) belonging to different communities with probability p e . 4) every site has its own state and may change with the evolution of epidemic. in each time step, susceptible individuals and infected individuals may respectively die with probability α and β, meanwhile, the corresponding site becomes empty, and the links of these sites are broken. 5) for each empty site, a susceptible individual may be born with probability b, and then it create links with other individuals with probability p i in the same community or p e belonging to different communities. suppose the initial number of edges is k, then we have: the state transition rules of the transmission process are schematically shown in fig. 1 . all the sites of the network are described as parameters: e, s or i, which respectively represent the empty states, susceptible individual occupations and infected individual occupations. the specific process are as follows: an empty site can give birth to a healthy individual at rate b; a healthy individual can be infected by contacting with infected neighbors at rate λ or die at rate α (due to other reasons); an infected individual can be cured at rate γ or die at rate β (on account of the disease). when an individual dies, this site becomes empty. in general, β>α, and all parameters above are non-negative. fig. 1 the schematic diagram of state transition rules an important measurement for community structure networks is the modularity coefficient [18] . it is defined as follows: where ij e denotes the proportion of edges between community i and j in the total network edges. so ii e and ij j e ∑ can be described as follows: where k represents the total edge number. thus, for our model the modularity coefficient is: therefore, for the given parameters of m, i n and k, combining eqs. (1) and (5), we can adjust the values of i p and p e to get community structure networks with various modularity q. for the network has time-varying topology, it is necessary to characterize the network's characteristics. we plot the curves of average degree 〈k〉, average path length l and average clustering coefficient c of networks changing with time. in fig. 2 , the lateral axis denotes time step, a time step is equal to one second. according to the statistics of birth and death rates of our country in recent years, we can approximately assume that the birth rate 0.01 b = and the natural death rate =0.01. α for different infectious diseases have different mortality rate and the mortality rate is affected by many factors (such as the region and personal habit), so we set the disease death rate in addition, the network size is 1 000. as shown in fig. 2 , the larger the network's link number k is, the higher the clustering coefficient c is, and the smaller the average path length l is. and the statistical property values remain unchanged with small β. this is because isolated nodes are not easily generated when the disease death rate is sufficiently small. the simulation results are averaged over 100 simulations. let parameters s, i represent the density of healthy individuals and infected individuals of the entire network. s i , i i are respectively the density of the susceptible and infected nodes within community i. based on the classical sis model [19] , we establish a modified sis epidemic model considering the characteristic of community structure. in addition, the circumstances of node's birth and death are taken into consideration either in this model. therefore, this epidemic model can be established as follows: in eq. therefore, eqs. (6) and (7) can be written as: let d d 0 s t = and d d 0 i t = , we get two steady state solutions: for the first solution, the jacobin matrix is: the determinant and the trace of j are: (14) if 0 > j , then tr 0 < j , and the solution is stable. then we can get the critical value: for the second solution, the jacobin matrix is: (16) where a is the same as above. clearly, if ( )( ) when c λ λ > , the second solution is stable, and the disease will diffuse in the network, otherwise the disease will die out. from eq. (17), we find that the threshold value is governed byα, β and b in a given network. in this section, we make a set of monte-carlo simulations on n-node networks to find the relationships between epidemic size and different parameters, such as modularity coefficient, death rate, birth rate and total edge number. the following simulation results are averaged over 100 configurations with different set of random numbers i n (i=1, 2,…, m). and for each configuration, 200 simulations are taken with one randomly chosen seed node initially. fig. 3 shows the time evolution curves of epidemic size, where β equals to 0, 0.001 and 0.005 respectively. some related parameters are n=1 000, m=10, k=10 000, q=0.3, λ=0.1, b=0.01, α=0.01 . it is shown that when β≠0, the epidemic size increases to a peak value then decays to tend a stable value, otherwise the epidemic size keeps increase and finally reach a steady state. the existence of disease death rate can prevent the spread of the disease by decreasing the infected fraction directly. the maximum prevalence of epidemic spreading without considering nodes' disease deaths is the largest. in addition, larger β corresponds to smaller stable epidemic size, which agrees well with the reality. fig. 4 shows the critical epidemic value decreases with the increase of birth rate, while the epidemic prevalence increases with the increase of birth rate. the arrows in fig. 4 indicate the theoretic epidemic threshold calculated through eq. (17) . eq. (17) clearly shows that the birth rate is inversely proportional to the critical value, which is consistent with the simulation results in fig. 4 . in real life, with the increase of birth rate, the density of whole population and healthy proportion increases, which makes it easier for infectious disease to diffuse. next, we plot the curves to indicate the influence of two kinds of death rates (natural death rate α and disease death rate β) on the epidemic threshold and average disease prevalence. the arrows in fig. 5 and 6 indicate the theoretic epidemic threshold. fig. 5 , β constantly equal to 0.05. for some infectious diseases, such as acquired immune deficiency syndrome (aids), it is necessary to consider the situation of individuals' natural deaths. from fig. 5 , we find that the existence of natural death rate α is conducive to prevent the spread of the disease, and the increase of threshold and decrease of epidemic size are expected with the increase of α. individuals' natural death decreases the density of total population, thus restrains the propagation of epidemic. the arrows in fig. 5 indicate the theoretic epidemic threshold. fig. 6 shows the effect of the existence of individuals' death caused by disease on epidemic threshold. the related parameters are b=0.005, q=0.3, k=5 000, and α=0.005. by comparisons, it is found that the epidemic threshold increases with the growing of β, while the epidemic size decreases with the growing of β. the existing of disease deaths can rapidly reduce the number of infected individual in populations, thus the existence of disease death rate can inhibits the epidemic spreading. in fig. 7 , we study the effects of both modularity coefficient q and the edge number of network k on the epidemic threshold. larger k represents that the individuals in network are linked more closely. it is found that the epidemic threshold decreases with the increase of the modularity coefficient of the network, and the epidemic size of the network with higher modularity coefficient is larger around the epidemic threshold, while the inverse situation occurs when the infection rate is far greater than the threshold. fig. 7 the relationship between i∞ and λ with different modularity coefficient q and edge number k this is because the infectious disease is mainly transmitted within the community, and when the propagation rate is sufficiently, the infectious disease spreads throughout the network through the edges between communities. the edge density of network with higher modularity coefficient is small, this is not conducive to the spread between communities, thereby reducing the spreading size of the entire network. in addition, the epidemic threshold has inverse correlation with the total edge number k. this is consistent with the real network circumstances. considering the circumstances of node's birth and death that may exist in real networks, a modified epidemic model based on the classical sis model is proposed in a community structure network. an approximate formula for the epidemic threshold is obtained by mathematical analysis to find the relative relationships between different parameters. then the stability of endemic equilibrium is analyzed. the simulations in this study illustrate that the epidemic threshold λ increases with the increase of the death rate (natural death or disease death), while it decreases with the increase of the birth rate, modularity coefficient and edge number. through this study, it is helpful to predict the spreading trend of some infectious diseases that may cause the deaths of individuals (such as ebola and h7n9) more accurately than ever before. collective dynamics of 'small-world' networks emergence of scaling in random networks epidemic dynamics and endemic states in complex networks the spread of disease with birth and death on networks spreading of persistent infections in heterogeneous populations staged progression model for epidemic spread on homogeneous and heterogeneous networks global attractivity of a network-based epidemic sis model with nonlinear infectivity epidemic spreading on contact networks with adaptive weights proceedings of the 3rd international conference on image and signal processing (icisp'08) photoocr: reading text in uncontrolled conditions characterize energy impact of concurrent network-intensive applications on mobile platforms iodetector: a generic service for indoor outdoor detection the case for vm-based cloudlets in mobile computing community structure in social and biological networks epidemic spreading in community networks epidemic spreading in scale-free networks with community structure how community structure influences epidemic spread in social networks community structure in social networks: applications for epidemiological modeling a stochastic sir epidemic on scale-free network with community structure epidemic spreading on complex networks with community structure epidemic spreading in weighted scale-free networks with community structure traffic driven epidemic spreading in homogeneous networks with community structure finding and evaluating community structure in networks epidemic outbreaks in two-scale community networks this work was supported by the national natural science key: cord-200354-t20v00tk authors: miya, taichi; ohshima, kohta; kitaguchi, yoshiaki; yamaoka, katsunori title: experimental analysis of communication relaying delay in low-energy ad-hoc networks date: 2020-10-29 journal: nan doi: nan sha: doc_id: 200354 cord_uid: t20v00tk in recent years, more and more applications use ad-hoc networks for local m2m communications, but in some cases such as when using wsns, the software processing delay induced by packets relaying may not be negligible. in this paper, we planned and carried out a delay measurement experiment using raspberry pi zero w. the results demonstrated that, in low-energy ad-hoc networks, processing delay of the application is always too large to ignore; it is at least ten times greater than the kernel routing and corresponds to 30% of the transmission delay. furthermore, if the task is cpu-intensive, such as packet encryption, the processing delay can be greater than the transmission delay and its behavior is represented by a simple linear model. our findings indicate that the key factor for achieving qos in ad-hoc networks is an appropriate node-to-node load balancing that takes into account the cpu performance and the amount of traffic passing through each node. an ad-hoc network is a self-organizing network that operates independently of pre-existing infrastructures such as wired backbone networks or wireless base stations by having each node inside the network behave as a repeater. it is a kind of temporary network that is not intended for longterm operation. every node of an ad-hoc network needs to be tolerant of dynamic topology changes and have the ability to organize the network autonomously and cooperatively. because of these specific characteristics, since the 1990s, ad-hoc networks have played an important role as a mean for instant communication in environments where the network infrastructure is weak or does not exist, such as developing countries, disaster areas, and battle fields. however, in recent years, the ad-hoc network is also a hot topic in urban areas where the broadband mobile communication systems are well developed and always available. more and more applications use ad-hoc networks for local m2m communications, especially in key technologies that are expected to play a vital role in future society, such as intelligent transportation systems (its) supporting autonomous car driving, cyber-physical systems (cps) like smart grids, wireless sensor networks (wsn), and applications like the iot platform. these days, communication entities are shifting from humans to things; the network infrastructures tend to require a more strict delay guarantee, and the ad-hoc network is no exception. there have been many prior studies about delayaware communication in the field of ad-hoc networks [1] [4] . most of these focus on the link delay and only a few consider both node and link delays [1] , [2] . however, in some situations where the power consumption is severely limited (e.g., with wsn), the communication relaying cost of small devices with low-power processors may not be negligible for the end-to-end delay of each communication. it is necessary to discuss, on the basis of actual data measured on wireless ad-hoc networks, how much the link and node delays account for the end-to-end delay. in the field of wired networks, there have been many studies reporting measurement experiments of packet processing delay as well as various proposals for performance improvement [5] [10] . in addition, the best practice of qos measurement has been discussed in the ietf [11] . in the past, measurement experiments on asic routers have been carried out for the purpose of benchmarking routers working on isp backbones [5][7] ; in contrast, since the software router has emerged as a hot topic in the last few years, recent studies mainly concentrate on the bottleneck analysis of the linux kernel's network stack [8] [10] . there has also been a study focusing on the processing delay caused by the low-power processor assuming interconnection among small robots [12] . however, as far as we know, no similar measurement exists in the field of wireless ad-hoc networks. therefore, many processing delay models have been considered so far, e.g., simple linear approximation [13] or queueing model-based nonlinear approximation [14] , but it is hard to determine which one is the most reasonable for wireless ad-hoc networks. in this work, we analyze the communication delay in an adhoc network through a practical experiment using raspberry pi zero w. we assume an energy-limited ad-hoc network composed of small devices with low-power processors. our goal is to support the design of qos algorithms on adhoc networks by clarifying the impact of software packet processing on the end-to-end delay and presenting a general delay model to which the measured delay can be adapted. this is an essential task for future ad-hoc networks and their related technologies. first, we briefly describe the structure of the linux kernel network stack in sect. ii. we explain the details of our measurement experiment in sects. iii and iv, and report the results in sect. v. we conclude in sect. vi with a brief summary and mention of future work. in this section, we present a brief description of the linux kernel's standard network stack from the viewpoints of the packet receiving and sending sequences. figure 1 shows the flow of packets in the network stack from the perspective of packet queueing. first, as the preparation for receiving packets, the nic driver allocates memory resources in ram that can store a few packets, and has packet descriptors (rx descriptors) hold these addresses. the rx ring buffer is a descriptor ring located in ram, and the driver notifies the nic of the head and tail addresses of the ring. the nic then fetches some unused descriptors by direct memory access (dma) and waits for the packets to arrive. the workflow after the packet arrival is as follows. as a side note, the below sequence is a receiving mechanism called new api (napi) supported in linux kernel 2.6 or later. i) once a packet arrives, nic writes the packet out as an sk buff structure to ram with dma, referring to the rx descriptors cached beforehand, and issues a hardirq after the completion. ii) the irq handler receiving hardirq pushes it by napi_schedule() to the poll list of a specific cpu core and then issues softirq so as to get the cpu out of the interrupt context. iii) the soft irq scheduler receiving softirq calls the interrupt handler net_rx_action() at the best timing. iv) net_rx_action() calls poll(), which is implemented in not the kernel but the driver, for each poll list. v) poll() fetches sk buff referring to the ring indirectly and pushes it to the application on the upper layer. at this time, packet data is transferred from ram to ram; that is, the data is copied from the memory in the kernel space to the receiving socket buffer in the user space by memcpy(). repeat this memory copy until the poll list becomes empty. vi) the application takes the payload from the socket buffer by calling recv(). this operation is asynchronous with the above workflows in the kernel space. the packet receiving sequence is completed when all the payloads have been retrieved. in the packet sending sequence, all the packets basically follow the reverse path of the receiving sequence, but they are stored in a buffer called qdisc before being written to the tx ring buffer (fig. 1) . the ring buffer is a simple fifo queue that treats all arriving packets equally. this design simplifies the implementation of the nic driver and allows it to process packets fast. qdisc corresponds to the abstraction of the traffic queue in the linux kernel and makes it possible to achieve a more complicated queueing strategy than fifo without modifying the existing codes of the kernel network stack or drivers. qdisc supports many queueing strategies; by default, it runs in pfifo_fast mode. if the packet addition fails due to a lack of free space in qdisc, the packet is pushed back to the upper layer socket buffer. as discussed in sect. i, the goal of this study is to evaluate the impact of software packet processing, induced by packet relaying, to the end-to-end delay, on the basis of an actual measurement assuming an ad-hoc network consisting of small devices with low-power processors. figure 2 shows our experimental environment, whose details are described in sect. iv. we define the classification of communication delays as below. both processing delay and queueing delay correspond to the application delay in a broad sense. • end-to-end delay: total of node delays and link delays • node delay: sum of processing delay, queueing delays, and any processing delays occurring in the network stack the proxy node (fig. 2) relays packets with the three methods below, and we evaluate the effect of each in terms of the end-to-end delay. by comparing the results of olsr and at, we can clarify the delay caused by packets passing through the network stack. • kernel routing (olsr): proxy relays packets by kernel routing based on the olsr routing table. in this case, the relaying process is completed in kernel space because all packets are wrapped in l3 of the network stack. accordingly, both processing delay and queueing delay defined above become zero, and node delay is purely equal to the processing delay on the network stack in the kernel space. • address translation (at): proxy works as a tcp/udp proxy, and all packets are raised to the application running in the user space. the application simply relays packets by switching sockets, which is equivalent to a fixed-length header translation. • encryption (enc): proxy works as a tcp/udp proxy. besides at, the application also encrypts payloads using aes 128-bit in ctr mode so that the relaying load depends on the payload size. for each relaying method, we conduct measurements with variations of the following conditions. we express all the results as multiple percentile values in order to remove delay spikes. because the experiment takes several days, we record the rssi of the ad-hoc network including five surrounding channels. • payload size • packets per second (pps) • additional cpu load (stress) in this section, we explain the technical details of the experimental environment and measurement programs. we use three raspberry pi zero ws (see table i for the hardware specs). the linux distributions installed on the raspberry pis are raspbian and the kernel version is 4.19.97+. we use olsr (rfc3626), which is a proactive routing protocol, and adopt olsrd as its actual implementation. since all three of the nodes are location fixed, even if we used a reactive routing protocol like aodv instead of olsr, only the periodic hello in olsr will change the periodic rreq induced by the route cache expiring; that is, in this experiment, whether the protocol is proactive or reactive does not have a significant impact on the final results. the ad-hoc network uses channel 9 (2.452 ghz) of ieee 802.11n, transmission power is fixed to -31 dbm, and bandwidth is 20 mhz. as wpa (tkip) and wpa2 (ccmp) do not support ad-hoc mode, the network is not encrypted. although the three nodes can configure an olsr mesh, as they are located physically close to each other, we have the sender/receiver drop olsr hello from the receiver/sender as well as the arp response by netfilter so that the network topology becomes a logically inline single-hop network, as show in fig. 2 . we use iperf as a traffic generator and measure the udp performance as it transmits packets from sender to receiver via proxy. the iperf embeds two timestamps and a packet id in the first 12 bytes of the udp data section (fig. 3) , and the following measurement programs we implement use this id to identify each packet. random data are generated when iperf starts getting entropy from /dev/urandom, and the same series is embedded in all packets. we create a loadable kernel module using netfilter and measure the queueing delay in receiving and sending udp socket buffers. the workflow is summarized as follows: the module hooks up the received packets with nf_inet_pre_routing and the sent packets with nf_inet_post_routing ( fig. 1) , retrieves the packet ids iperf marked by indirectly referencing the sk buff structure, and then writes them out to the kernel ring buffer via printk() with a timestamp obtained by ktime_get(). the proxy program is the application running in the user space. it creates af_inet sockets between sender and proxy as well as between proxy and receiver and then translates ip addresses and port numbers by switching sockets. furthermore, it records the timestamps obtained by clock_gettime() immediately after calling recv() and sendto(), and encrypts every payload data protecting the first 12 bytes of metadata marked by iperf so as not to be rewritten. the above refers to the udp proxy; the tcp proxy we prepare simply using socat. we execute a dummy process whose cpu utilization rate is limited by cpulimit as a controlled noise of the user space in order to investigate and clarify its impact on the node delay. we performed the delay measurement experiments under the conditions shown in table ii using the methods described in the previous section. due to the space constraints, we omit the results of the preliminary experiment. note that all experiments were carried out at the author's home; due to the japanese government's declaration of the covid-19 state of emergency, we have had to stick to the "stay home" initiative unless absolutely necessary. the experiment was divided into nine measurements. figure 4a shows the time variation of rssi during a measurement. we were unable to obtain snrs owing to the specifications of the wi-fi driver, and thus the noise floors were unknown, but the essids observed in the five surrounding channels were all less than -80 dbm. the rssi variabilities were also within the range that did not affect the modulation and coding scheme (mcs) [15] ; therefore, it appears that the link quality was sufficiently high throughout all measurements. figures 4b, 4c , and 4d shows the average time variations of node delay, which were the results under the condition of 1000 bytes, 200 pps, and 0% stress. the blue highlighted bars indicate upper outliers (delay spikes) detected with a hampel filter (σ = 3). there were 53 outliers in olsr, 115 in at, and 9 in enc. in general, when the cpu receives periodic interrupts (e.g., routing updates, snmp requests, gcs of ram), packet forwarding is paused temporarily so that the periodic delay spikes can be observed in the end-to-end delay. this phenomenon is called the "coffee-break effect" [7] and has been mentioned in several references [5] , [8] , [9] . for this experiment, as seen in the results of at (fig. 4c ), in the low-energy ad-hoc networks, it is evident that the cpurobbing by other processes like coffee-break had a significant impact on the communication delay. incidentally, there were fewer spikes under both 1) olsr and 2) enc than under at. 1) since the packet forwarding was completed within the kernel space, node delay was less susceptible to applications running in the user space. 2) since the payload encryption was overwhelmingly cpu-intensive, the influence of other applications was hidden and difficult to observe from the node delay. figures 5a and 5b shows the jitter of one-way communication delay. lines represent the average values, and we filled in the areas between the minimum and the maximum. there were no significant differences between olsr and at, which suggests that lifting packets to the application layer does not affect jitter. jitter increased in proportion to the payload size only in the case of enc. similarly, only in the case of enc with 200 pps or more, the packet loss rate tended to increase with payload size, drawing a logarithmic curve as seen in fig. 5c ; in all other cases, no packet loss occurred regardless of the conditions. figure 6 shows the tendency of the node delay variation against several conditions, and fig. 7 shows the likelihood of occurrence as empirical cdf. according to these figures, in the cases of olsr and at, the delay was nearly constant irrespective of pps and stress. there was a correlation between the variation and pps in olsr, while in at there was not; this suggests that the application-level packet forwarding is less stable than kernel routing from the perspective of node delay. in the case of enc, the processing delay increased to the millisecond order and increased approximately linearly with respect to the payload size, and the delay variance became large overall. in addition, the graph tended to be smoothed as the pps increased; this arises from the fact that packet encryption takes up more cpu time, which makes the influence of other processes less conspicuous. it appears that the higher the pps, the lower the average delay ( fig. 6a and 6b) , and the delay variance decreases around 1200 bytes (fig. 6c) , but the causes of these remain unknown, and further investigation is required. one thing is certain: on the raspberry pi, pulling the packets up to the application through the network stack results in a delay of more than 100 microseconds. figure 8 shows the breakdown of the end-to-end delay and also describes the node delay link delay ratio (nlr). as we saw in fig. 2 , for this experimental environment, the end-toend delay included two link delays, and the link delay shown in fig. 8 is the sum of them. the link delay was calculated from the effective throughput reported in iperf. as iperf does not support pps as its option, we achieved it by adjusting the amount of transmitted traffic, as the results showed that, in the cases of olsr and at, the nlr was almost constant with respect to the payload size, while in enc, it showed an approximately linear increase. the nlr was less than 5% in olsr, while in at, it was around 30%, which cannot be considered negligible. furthermore, node delay was greater than link delay when the payload size was over 1200 bytes in enc. in this work, we have designed and conducted an experiment to measure the software processing delay caused by packets relaying. the experimental environment is based on an olsr ad-hoc network composed of raspberry pi zero ws. the results were qualitatively explainable, and suggested that, in low-energy ad-hoc networks, there are some situations where the processing delay cannot be ignored. • the relaying delay of kernel routing is usually negligible, but when it is handled by application, the delay can be more than ten times greater, however simple the task is. • if an application performs cpu-intensive tasks such as encryption or full translation of protocol stacks, the delay increases according to the linear model and can be greater than the link's transmission delay. for this reason, node-to-node load balancing considering the cpu performance or amount of passing traffic could be extremely useful for achieving delay-guaranteed routing in adhoc networks. particularly in heterogeneous ad-hoc networks (hanets), where each node's hardware specs are different from each other, the accuracy of passing node selection would have a significant impact on the end-to-end delay. as we did not take any noise countermeasures in this experiment, our future work will involve similar measurements in an anechoic chamber to reduce the noise from external waves and an investigation of the differences in results. the upper limit of flow accommodation under allowable delay constraint in hanets a unified solution for gateway and in-network traffic load balancing in multihop data collection scenarios qos-aware routing based on bandwidth estimation for mobile ad hoc networks qos based multipath routing in manet: a cross layer approach measurement and analysis of single-hop delay on an ip backbone network dx: latencybased congestion control for datacenters experimental assessment of end-to-end behavior on internet measurement of processing and queuing delays introduced by an open-source router in a single-hop network a study of networking software induced latency scheme to measure packet processing time of a remote host through estimation of end-link capacity a one-way delay metric for ip performance metrics (ippm) real-time linux communications: an evaluation of the linux communication stack for real-time robotic applications characterizing network processing delay processor-sharing queues: some progress in analysis ieee 802.11 n/ac data rates under power constraints key: cord-285872-rnayrws3 authors: elgendi, mohamed; nasir, muhammad umer; tang, qunfeng; fletcher, richard ribon; howard, newton; menon, carlo; ward, rabab; parker, william; nicolaou, savvas title: the performance of deep neural networks in differentiating chest x-rays of covid-19 patients from other bacterial and viral pneumonias date: 2020-08-18 journal: front med (lausanne) doi: 10.3389/fmed.2020.00550 sha: doc_id: 285872 cord_uid: rnayrws3 chest radiography is a critical tool in the early detection, management planning, and follow-up evaluation of covid-19 pneumonia; however, in smaller clinics around the world, there is a shortage of radiologists to analyze large number of examinations especially performed during a pandemic. limited availability of high-resolution computed tomography and real-time polymerase chain reaction in developing countries and regions of high patient turnover also emphasizes the importance of chest radiography as both a screening and diagnostic tool. in this paper, we compare the performance of 17 available deep learning algorithms to help identify imaging features of covid19 pneumonia. we utilize an existing diagnostic technology (chest radiography) and preexisting neural networks (darknet-19) to detect imaging features of covid-19 pneumonia. our approach eliminates the extra time and resources needed to develop new technology and associated algorithms, thus aiding the front-line healthcare workers in the race against the covid-19 pandemic. our results show that darknet-19 is the optimal pre-trained neural network for the detection of radiographic features of covid-19 pneumonia, scoring an overall accuracy of 94.28% over 5,854 x-ray images. we also present a custom visualization of the results that can be used to highlight important visual biomarkers of the disease and disease progression. on march 11, 2020, the world health organization declared the covid-19 virus as an international pandemic (1) . the virus spreads among people via physical contact and respiratory droplets produced by coughing or sneezing (2) . the current gold standard for diagnosis of covid-19 pneumonia is real-time reverse transcription-polymerase chain reaction (rt-pcr). the test itself takes about 4 h; however, the process before and after running the test, such as transporting the sample and sending the results, requires a significant amount of time. pertaining to pcr testing is not a panacea, as the sensitivities range from 70 to 98% depending on when the test is performed during the course of the disease and the quality of the sample. in certain regions of the world it is simply not routinely available. more importantly, the rt-pcr average turnaround time is 3-6 days, and it is also relatively costly at an average of ca$4, 000 per test (3) . the need for a faster and relatively inexpensive technology for detecting covid-19 is thus crucial to expedite universal testing. the clinical presentation of covid-19 pneumonia is very diverse, ranging from mild to critical disease manifestations. early detection becomes pivotal in managing the disease and limiting its spread. in 20% of the affected patient population, the infection may lead to severe hypoxia, organ failure, and death (4) . in order to meet this need, high-resolution computed tomography (hrct) and chest radiography (cr, known as chest x-ray imaging) are commonly available worldwide. patterns of pulmonary parenchymal involvement in covid-19 infection and it's progression in the lungs has been described in multiple studies (5) . however, despite the widespread availability of xray imaging, there is unfortunately a shortage of radiologists in most low-resource clinics and developing countries to analyze and interpret these images. for this reason, artificial intelligence and computerized deep learning that can automate the process of image analysis have begun to attract great interest (6) . note that x-ray costs about ca$40 per test (3), making it a cost effective and readily available option. moreover, the x-ray machine is portable, making it versatile to be utilized in all areas of the hospital even in the intensive care unit. since the initial outbreak of the covid-19, a few attempts have been made to apply deep learning to radiological manifestations of covid-19 pneumonia. narin et al. (7) reported an accuracy of 98% on a balanced dataset for detecting covid-19 after investigating three pretrained neural networks. sethy and behera (8) explored 10 different pre-trained neural networks, reporting an accuracy of 93% on a balanced dataset, for detecting covid-19 on x-ray images. zhang et al. (9) utilized only one pretrained neural network, scoring 93% on an unbalanced dataset. hemdan et al. (10) looked into seven pre-trained networks, reporting an accuracy of 90% on a balanced dataset. apostolopoulos and bessiana (11) evaluated five pre-trained neural networks, scoring 98% of accuracy on an unbalanced dataset. however, these attempts did not make clear which existing deep learning method would be the most efficient and robust for covid-19 compared to many others. moreover, some of these studies were carried out on unbalanced datasets. note that an unbalanced dataset is a dataset where the number of subjects in each class is equal. our study aims to determine the optimal learning method, by investigating different types of pre-trained networks on a balanced dataset, for covid-19 testing. additionally, we attempt to visualize the optimal network weights, which were used for decision making, on top of the original x-ray image to visually represent the output of the network. we investigated 17 pre-trained neural networks: alexnet, squeeznet (12) , googlenet (13), resnet-50 (14) , darknet-53 (15) , , shufflenet (16) , nasnet-mobile (17), xception (18), place365-googlenet (13), mobilenet-v2 (19) , , , inception-resnet-v28 (21), inception-v3 (22) , resnet-101 (14) , and vgg-19 (23) . all the experiments in our work were carried out in matlab 2020a on a workstation (gpu nvidia geforce rtx 2080ti 11 gb, ram 64 gb, and intel processor i9-9900k @3.6 ghz). the dataset was divided into 80% training and 20% validation. the last fully connected layer was changed into the new task to classify two classes. the following parameters were fixed for the 17 pre-trained neural networks: learning rate was set to 0.0001, validation frequency was set to 5, max epochs was set to 8, and the min batch size was set to 64. the class activation mapping was carried by multiplying the image activations from the last relu layer by the weights of the last fully connected layer of the darknet-19 network, called "leaky18, " as follows: where c is the class activation map, l is the layer number, f is the image activations from relu layer (l = 60) with dimensions of 8 × 8 × 1, 024. here, w refers to the weights at l = 61 with dimensions of 1 × 1 × 1, 024. thus, the dimensions of c is 8 × 8. we then resized c to match the size of the original image and visualized it using a jet colormap. two datasets are used, the first dataset is the publicly available coronahack-chest x-ray-dataset which can be downloaded from this link: https://www.kaggle.com/ praveengovi/coronahack-chest-xraydataset. this dataset contains the following number of images: 85 covid-19, 2,772 bacterial, and 1,493 viral pneumonias. the second dataset is a local dataset collected from an accredited level i trauma center: vancouver general hospital (vgh), british columbia, canada. the dataset contains only 85 covid x-ray images. the coronahack -chest x-ray-dataset contains only x-ray 85 images for covid, and to balance the dataset for neural network training, we had to downsize the sample size from 85 to 50 by random selection. to generate the "other class, " we downsized the samples, by selecting 50 radiographic images that were diagnosed as healthy to match and balance the covid-19 class. radiographs labeled as bacterial or other viral pneumonias have also been included in the study to assess specificity. the number of images used in training and validation to retrain the deep neural network is shown in table 1 . data collected from the vancouver general hospital (vgh) contained 58 chest radiographs with pulmonary findings ranging from subtle to severe radiographic abnormality, which was confirmed by two radiologists individually on visual assessment with final interpretations with over 30 years of radiology experience combined. these 58 radiographs were obtained from 18 rt-pcr-positive covid-19 patients. serial radiographs acquired during a patient's hospital stay showing progressive disease were also included in the data set. the data set contained anteroposterior and posteroanterior projections. portable radiographs acquired in intensive care units with lined and tubes in place were also included in the data set. the images (true positive) submitted for the analysis by the vgh team were anonymized and mixed with an equal number of normal chest radiographs to create a balanced data set. the remaining from the coronahack-chest x-ray-dataset was used to test the specificity of the algorithm. dataset 2 was used an external dataset to test the robustness of the algorithm, with a total of 5,854 x-ray images (58 covid-19, 1,560 healthy, 2,761 bacterial, and 1,475 viral pneumonias), as shown in table 1 . note that there is no overlap between dataset 1 and dataset 2. to determine the optimal existing pre-trained neural network for the detection of covid-19, we used the coronahack-chest x-ray-dataset. the chest x-ray images dataset contains 85 images from patients diagnosed with covid-19 and 1,576 images from healthy subjects. five x-ray images collected from the lateral position were deleted for consistency. we then balanced the dataset to include 50 rt-pcr positive covid-19 patients and 50 healthy subjects. from the group of 85 rt-pcr positive cases patients were randomly selected with varying extent of pulmonary parenchymal involvement. after creating a balanced dataset, which is important for producing solid findings, 17 pretrained networks were analyzed following the framework shown in figure 1 . the 17 pre-trained neural networks were trained on a large data set by using more than a million images, as a result the algorithms developed can classify new images into 1,000 different object categories, such as keyboard, mouse, pencil, and various animals. through artificial intelligence and machine learning each network can detect images based on unique features representative of a particular category. by replacing the last fully connect layer, as shown in figure 1 , and retraining (fine-tune table 2 . interestingly, we found that the following two pre-trained neural networks achieved an accuracy of 100% during the training and validation phases using dataset 1: resnet-50 and darknet-19. inception-v3 and shufflenet achieved an overall validation accuracy below 90% suggesting that these neural networks are not robust enough for detecting covid-19 compared to, for example, resnet-50 and darknet-19. despite that the inception-renet-v2 was pre-trained on trained on more than a million images from the imagenet database (21), it was not ranked the highest in terms of the overall performance, suggesting it is not suitable to use for detecting covid-19. each pre-trained network has a structure that is different from others, e.g., number of layers and size of input. the most important characteristics of a pre-trained neural network are as follows: accuracy, speed, and size (24) . greater accuracy increases the specificity and sensitivity for covid-19 detection. increased speed allows for faster processing. smaller networks can be deployed on systems with less computational resources. therefore, the optimal network is the network that increases accuracy, utilizes less training time, and that is relatively small. typically, there is a tradeoff between the three characteristics, and not all can be satisfied at once. however, our results show that it is possible to satisfy all three requirements. darknet-19 outperformed all other networks, while having increased speed and increased accuracy in a relatively small-sized network, as shown in figure 2 . a visual comparison between all investigated pre-trained neural networks is presented, with respect to the three characteristics. the x-axis is the training time (logarithmic scale) in seconds, the y-axis is the overall validation accuracy and the bubble size represents the network size. note that darknet-19 and resnet-50 achieved an accuracy of 100%; however, darknet is much faster and requires less memory. a comparison of optimal neural networks recommended in previous studies, along with the optimal neural network suggested by this work, is shown in table 3 . narin et al. (7) used a balanced sample size of 100 subjects (50 covid-19 and 50 healthy). they investigated three pre-trained neural networks: resnet50, inceptionv3 and inceptionresnetv2, with a cross validation ratio of 80-20%. they found that resnet50 outperformed the other two networks, scoring a validation accuracy of 98%. sethy and behera (8) it is worth noting that the studies discussed in table 3 did not use other populations, such as pneumonia bacterial to test specificity. moreover, they did not use an external dataset to test reliability. in other words, they had only training and validation datasets. note that we used two datasets: dataset 1 for training and validation and dataset 2 for testing. interestingly, resnet-50 network achieved a high accuracy in three different studies. note that these studies only compared resnet-50 to a select few neural networks, whereas here we compared a total of 17. one possible reason that our resnet-50 achieved 100% is that the dataset (dataset 1) in our study differed from the datasets in other studies. another reason is the network's parameter settings (e.g., learning rate). however, darknet-19 also achieved a validation accuracy of 100%, and it is not clear which network is more accurately detect radiographic abnormalities associated with covid-19 pneumonia. two approaches will be used to compare the performance between the darknet-19 and resnet-50 networks: (1) model fitting and (2) performance over dataset 2. 1. model fitting: achieving a good model fitting is the target behind any learning algorithm by providing a model that does not suffer from either over-fitting and under-fitting (25) . typically, a "good fitted" model is obtained when both training and validation loss curves decrease to a stability zone where the gap between the loss curves is minimal (25) . this gap is referred to as the "generalization gap, " and it can be seen in figure 3 ; the gap between the loss curves in darknet-19 is smaller than the gap in resnet-50. this suggests that darknet-19 is more optimal when compared to resnet-19 even though both achieved 100% accuracy of the training and validation images using dataset 1. 2. performance over the testing dataset: in this step, the reliability and robustness of darknet-19 and resnet-50 over dataset 2 will be examined. as can be seen in table 4 , both neural networks were able to differentiable the pattern as we are interested in finding the model that achieved high sensitivity with minimal generalization gap, the optimal neural network to be used is the darknet-19. availability of efficient algorithms to detect and categorize abnormalities on chest radiographs into subsets can be a useful adjunct in the clinical practice. darknet-19's accuracy to detect radiographic patterns associated with covid-19 in portable and routine chest radiographs at varied clinical stages makes it a robust and useful tool. use of such efficient algorithms in everyday clinical practice can help address the problem of shortage of skilled manpower, contributing to provision of better clinical care. more institution-based research is, however, required in this area. while the darknet-19 algorithm can distinguish covid-19 patients from other populations with 94.28% accuracy, we note the following limitations: 1. the covid sample size used in the training and validation phase was relatively small, 50 images. frontiers in medicine | www.frontiersin.org 2. the images were not segregated based on the technique of acquisition (portable or standard supine ap chest radiograph) or positioning (posteroanterior vs. anteroposterior). thus, any possible errors that might arise because of the patient's positioning have not been addressed in the study. lateral chest radiographs were excluded from the data set. 3. our investigation compared radiographic features of covid-19 patients to healthy individuals. as a next step in our investigation, the radiographic data from covid-19 patients should also be compared with other respiratory infections in order to improve the specificity of the algorithm for detection of covid-19. an important component to the automated analysis of the x-ray data is the visualization of the x-ray images, using colors to identify the critical visual biomarkers as well as indication of disease progression. this step can make disease identification more intuitive and easier to understand, especially for healthcare workers with minimal knowledge about covid-19. the visualization can also expedite the diagnosis process. as shown in figure 4 (true positive), covid-19 subjects were identified based on the activation images and weights. also, examples for false positive (a non-covid subject identified as a covid), false negative (a covid subject identified as non-covid), and true negative (a non-covid subject identified as a non-covid subject) were shown. note that the main purpose of this paper is not to investigate the difference between pre-trained and trained neural networks; the purpose is rather to provide a solution that is based on already existing and proven technology to use for covid screening. if the accuracy achieved by the pre-trained neural network is not acceptable by radiologists, then exploring different untrained convolutional neural networks could be worth doing. also, including the patient's demographic information, d-dimer, oxygen saturation level, troponin level, neutrophil to lymphocyte ratio, glucose level, heart rate, degree of inspiration, and temperature may improve the overall detection accuracy. in conclusion, fast, versatile, accurate, and accessible tools are needed to help diagnose and manage covid-19 testing infection. the current gold standard laboratory tests are time consuming and costly, adding delays to the testing process. chest radiography is a widely available and affordable tool for screening patients with lower respiratory symptoms or suspected covid-19 pneumonia. addition of computer-aided radiography can be a useful adjunct in improving throughput and early diagnosis of the disease; this is especially true during a pandemic, particularly during the surge, and in areas with a shortage of radiologists. in this paper, we have reviewed and compared many deep learning techniques currently available in the market for detecting radiographic features of covid-19 pneumonia. after investigating 17 different pre-trained neural networks, our results showed that darknet-19 is the optimal pre-trained deep learning network for detection of imaging patterns of covid-19 pneumonia on chest radiographs. work to improve the specificity of these algorithms in the context of other respiratory infections is ongoing. the coronahack-chest x-ray-dataset used in this study is publicly available and can be downloaded from this link: https://www.kaggle.com/praveengovi/coronahack-chestxraydataset. requests to access the dataset collected at vancouver general hospital should be directed to savvas nicolaou, savvas.nicolaou@vch.ca. dataset 1 and all trained neural networks can be accessed via this link https://github.com/ elgendi/covid-19-detection-using-chest-x-rays. me designed the study, analyzed the data, and led the investigation. mn, wp, and sn provided an x-ray dataset, annotated the x-ray images, and checked the clinical perspective. me, mn, qt, rf, nh, cm, rw, wp, and sn conceived the study and drafted the manuscript. all authors approved the final manuscript. this research was supported by the nserc grant rgpin-2014-04462 and canada research chairs (crc) program. the funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results. who declares covid-19 a pandemic clinical features of patients infected with 2019 novel coronavirus in wuhan cost analysis of multiplex pcr testing for diagnosing respiratory virus infections characteristics of and important lessons from the coronavirus disease 2019 (covid-19) outbreak in china: summary of a report of 72,314 cases from the chinese center for disease control and prevention chest ct findings in patients with coronavirus disease 2019 and its relationship with clinical features deep learning in radiology automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks detection of coronavirus disease (covid-19) based on deep features covid-19 screening on chest x-ray images using deep learning based anomaly detection covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks alexnet-level accuracy with 50× fewer parameters and <0.5 mb model size. arxiv going deeper with convolutions deep residual learning for image recognition open source neural networks in c an extremely efficient convolutional neural network for mobile devices learning transferable architectures for scalable image recognition xception: deep learning with depthwise separable convolutions mobilenetv2: inverted residuals and linear bottlenecks densely connected convolutional networks inception-v4, inception-resnet and the impact of residual connections on learning inception-v3 for flower classification very deep convolutional networks for large-scale image recognition imagenet large scale visual recognition challenge learning curve models and applications: literature review and research directions key: cord-324256-5tzup41p authors: feng, shanshan; jin, zhen title: infectious diseases spreading on a metapopulation network coupled with its second-neighbor network date: 2019-11-15 journal: appl math comput doi: 10.1016/j.amc.2019.05.005 sha: doc_id: 324256 cord_uid: 5tzup41p traditional infectious diseases models on metapopulation networks focus on direct transportations (e.g., direct flights), ignoring the effect of indirect transportations. based on global aviation network, we turn the problem of indirect flights into a question of second neighbors, and propose a susceptible-infectious-susceptible model to study disease transmission on a connected metapopulation network coupled with its second-neighbor network (snn). we calculate the basic reproduction number, which is independent of human mobility, and we prove the global stability of disease-free and endemic equilibria of the model. furthermore, the study shows that the behavior that all travelers travel along the snn may hinder the spread of disease if the snn is not connected. however, the behavior that individuals travel along the metapopulation network coupled with its snn contributes to the spread of disease. thus for an emerging infectious disease, if the real network and its snn keep the same connectivity, indirect transportations may be a potential threat and need to be controlled. our work can be generalized to high-speed train and rail networks, which may further promote other research on metapopulation networks. with the rapid development of technology, the rate of globalization has increased, which brings not only opportunities to countries, but also many challenges, such as, the global transmission of infectious diseases. for example, severe acute respiratory syndromes (sars) [1] , originating from guangdong province, china, then spread around the world along international air travel routes. influenza a (h1n1) flu [2] in 2009, which was first reported in mexico, has become a global issue, which is followed by the emergence of avian influenza [3] , middle east respiratory syndrome coronavirus (mers-cov) [4] , ebola virus disease [5] , zika [6] . the outbreak of any infectious disease has a great impact on humans, either physically, mentally, or economically. how to forecast and control the global spread of infectious diseases has always been the focus of research. one of effective methods to address this problem is the introduction of metapopulation networks. node neighbors second neighbors a metapopulation network is a network, whose nodes (subpopulations) represent well-defined social units, such as countries, cities, towns, villages, with links standing for the mobility of individuals. using heterogeneous mean-field (hmf) theory and assuming that subpopulations with the same degree are statistical equivalent, colizza and vespignani proposed two models to describe transmission of diseases on heterogeneous metapopulation networks under two different mobility patterns, which sheds light on calculation of global invasion threshold [7] . next, different network structures, including bipartite metapopulation network [8] , time-varying metapopulation network [9] , local subpopulation structure [10] , interconnected metapopulation network [11] , have been found to play an essential role in the global spread of infectious diseases. furthermore, studies have shown that adaptive behavior of individuals contributes to the global spread of epidemics, contrary to willingness [12] [13] [14] [15] . these works mostly focus on large-scale, air-travel-like mobility pattern, without individuals going back to their origins. there are also some studies on recurrent mobility pattern. balcan and vespignani investigated invasion threshold on metapopulation network with recurrent mobility pattern [16, 17] . heterogeneous dwelling time in subpopulations was considered in ref. [18] . nearly all studies above are under the assumption that mobility between two linked subpopulations is based on direct flights or other direct transportations. for aviation networks, sometimes there is no any direct flight when people travel, which makes them have to traverse to other places before reaching their destinations. even in the case of a direct flight, individuals may have to make two or more stops before reaching their destinations. actually, these two cases reflect the same problem of individual transfer in a metapopulation network. for an emerging infectious disease, taking transfer once as an example, movement of infectious individuals may result in more susceptible subpopulations being infected, since infectious individuals in an infected subpopulation can arrive not only at its neighbors but also the neighbors of its neighbors. to address this problem, we define second neighbor and second-neighbor network (snn) on an arbitrary undirected network. then we investigate the spread of an infectious disease on a connected metapopulation network coupled with its snn and study how indirect flights affect the global spread of the infectious disease. we show that the behavior that individuals travel along the metapopulation network coupled with its snn contributes to the spread of disease. the paper is organized as follows. in section 2 , we introduce second neighbor and give some definitions on snn of an arbitrary undirected network. next, an infectious disease model is derived to study how transfer rate affects the global transmission of a disease in section 3 . further, the basic reproduction number and the stability analysis of model are given in section 4 . section 5 presents some simulation results. conclusions are given in section 6 . in order to investigate the effect of indirect flights on disease transmission on a metapopulation network, we introduce the concepts of second neighbor and snn in the following. definition 2.1. a second neighbor of node i in a network (undirected) is a node whose distance from i is exactly two. according to the definition above, j is a second neighbor of i means that there exists(exist) self-avoiding path(s) of length two from i to j . as illustrated in fig. 1 , the number of self-avoiding paths of length two between two nodes may be larger than 1. since what we focus on is the existence of these paths but not the number, we say these paths are equivalent if the number of self-avoiding paths of length two is larger than 1. in a similar way, one can define third neighbor and k th neighbor. a third neighbor of node i in a network (undirected) is a node whose distance from i is exactly three. a k th ( k > 3) neighbor of node i in a network (undirected) is a node whose distance from i is exactly k . based on the definitions above, we give the definition of snn. an snn for a undirected network is composed by all second neighbors of nodes. in other words, an snn keep the same nodes with the given network, and there existing a link between two nodes means that one node is a second neighbor of the other. fig. 2 illustrates a undirected network and its snn. in panel a, for node 1, for example, its second neighbors are nodes 2 and 6 according to the definition of second neighbor. in the same way, we obtain all second neighbors for each node and construct an snn (see panel b). similarly, one can get a third-neighbor network, a fourth-neighbor network, and so on. consider a simple undirected network with n nodes and let us label the nodes with integer labels 1 , . . . , n . the adjacency matrix a = (a i j ) n ×n is a matrix with entries accordingly, the number of second neighbors of node i , named as next-nearest degree, is k (2) i = n j=1 b i j . similarly, we use p (2) k to denote the probability that the number of second neighbors is exactly k . notice that where a (2) i j (≥ 0) is the number of paths with length two between nodes i and j . with the definition of second neighbor, b can be uniquely expressed by a and a 2 with the matrix b is symmetric due to the fact that the matrix a is symmetric. for complete graphs and networks with all nodes' degrees being 0 or 1, b = 0 , that is, b is a zero matrix. for an aviation network, more common is indirect flights with one stop except for direct flights, which is due to the fact that people are rarely willing to have more than one stop during their journey. actually, the network constructed by all these indirect flights is exactly the snn of the aviation network. thus, to solve the problem of these indirect flights is equivalent to fixing the problem of the snn for a metapopulation network. consider a connected and undirected metapopulation network with n subpopulations coupled with its snn, and an acute infectious disease (such as, influenza) with a susceptible-infectious-susceptible (sis) transmission process intra subpopulation. as illustrated in fig. 3 , each node represents a population in which individuals with different disease states (blue circle for susceptible individuals, red pentagram for infectious individuals) are well-mixed, while links represent that there exists individuals' mobility between two nodes. here dashed links represent the second neighbor relationship. on the other hand, metapopulation networks are weighted, which measures traffic flows between two linked subpopulations. for node i , let the weight of a link be defined as the probability at which individuals in node i travel along the link. in consideration of a general form, weight matrix for a metapopulation network is of the form . . , n ) and equality holds when a i j = 0 . in addition, the matrix w (1) satisfies the condition that each row sum equals one, that is, n j=1 a i j w (1) i j = 1 . in the same way, weight matrix for its snn takes the form furthermore, for disease spreading process intra subpopulation, let β denote transmission rate, and γ denote recovery rate of an infectious individual. referring to mobility process inter subpopulations, mobility rate at which an individual leaves a given subpopulation to its neighbors or second neighbors is denoted by δ. to depict the case of individuals transfer, we denote transfer rate by q , the rate at which an individual leaves a given subpopulation to its second neighbors, so the rate of an individual leaving a given subpopulation to its neighbors is 1 − q . we note that q = 0 when k (2) i = 0 . we assume that these rates keep the same for all subpopulations and that these rates are all at unit time: per day. upon these bases, we consider the following model for eq. (3.1a) , the first and the second terms represent disease transmission and recovery processes in a given subpopulation i , respectively. meanwhile, the latter three terms express the mobility process inter subpopulations. in detail, the fourth term is on behalf of the case that individuals arrive at subpopulation i from its neighbors, and the fifth term shows the case of individuals traveling along snn. the process of individuals in subpopulation i traveling to other subpopulations, including neighbors and second neighbors, is described by the third term, which is equivalent to the expression this model is the traditional metapopulation network model [19] . remark 3.2. q = 1 portrays the case that all individuals travel along the snn of the metapopulation network, which corresponds to the situation that governments prohibit direct flights when an infectious disease occurs or the case in areas with underdeveloped economy and poor traffic, and it is expressed by for the metapopulation network, assuming that subpopulations with the same degree and next-nearest degree are statistically equivalent and that link weights depend on degree (for the metapopulation network) and next-nearest degree (for the snn) of nodes, according to ref. [20] , we obtain an equivalent mean-field model s k (1) ,k (2) = −β s k (1) ,k (2) i k (1) ,k (2) n k (1) ,k (2) (1) ,l (2) , (1) ,l (2) , (3.4b) here subscripts are degree k (1) and next-nearest degree k (2) , respectively. n k (1) ,k (2) is the average population of subpopulations with the same degree k (1) and the same next-nearest degree k (2) , and definitions of s k (1) ,k (2) and i k (1) ,k (2) are similar. (1) ) denotes the conditional probability that a subpopulation with degree k (1) is connected to a subpopulation of degree l (1) , and p (2) ( l (2) | k (2) ) is a similar definition on snn to p (1) ( l (1) | k (1) ). summing eqs. (3.1a) and (3.1b) gives a ji w (1) ji n j + δq n j=1 b ji w (2) ji n j , i = 1 , . . . , n. thus, (−m) is a singular m-matrix. from (3.6) , letting n = n i =1 n i , we obtain that the total population n is constant (because n = 0 ). subject to this constraint, by theorem 3.3 in [21] , we show that (3.6) has a unique positive equilibrium n i = n * i , which is globally asymptotically stable. since we are only interested in the asymptotic dynamics of the global transmission of disease on the metapopulation network coupled with its snn, we will study the limiting system of (3.1) (3.7) in this section, we calculate the basic reproduction number, and prove the existence and stability of disease-free equilibrium (dfe) and endemic equilibrium (ee). before studying the global stability of dfe, we calculate the basic reproduction number following the approach of van den driessche and watmough [22] . obviously, there exists a unique dfe e 0 = (0 , . . . , 0) for system (3.7) . according to eq. (3.7) , the rate of appearance of new infections f and the rate of transfer of individuals out of the compartments v in the e 0 are given by here f and v are n × n matrices. using the next-generation matrix theory [22] , the basic reproduction number is r 0 = ρ(f v −1 ) , where ρ is the spectral radius of the matrix f v −1 . in the following, we calculate the basic reproduction number r 0 . note that the sum of each column of matrix v is γ and the matrix v is column diagonally dominant. so v is an irreducible nonsingular m-matrix. thus v −1 is a positive matrix. matrix v has column sum γ , i.e., 1 t has column sum β/ γ . by theorem 1.1 in chapter 2 in ref. [23] , the basic reproduction number is the threshold value r 0 depends only on disease parameters β and γ but not on mobility rate δ or transfer rate q , thus, mobility of individuals has no impact on the basic reproduction number. however, movements of individuals between subpopulations accelerates the global spread of infectious diseases on metapopulation networks [24] . notice that if r 0 < 1, then e 0 is locally asymptotically stable; while r 0 > 1, e 0 is unstable. in fact, we can further prove the global stability of e 0 . we give the following lemma first. proof. first, we will show that i i ( t ) > 0 for any t > 0 and i = 1 , . . . , n and initial value i (0) ∈ n . otherwise assume that there exist an i 0 ∈ { 1 , . . . , n } and t 0 > 0, such that then i i 0 (t * ) > 0 , but the definition of t * implies i i 0 (t * ) ≤ 0 , which is a contradiction. second, we show that for any t ≥ 0 , i i (t) ≤ n * i and i = 1 , . . . , n . for any initial value i (0) ∈ n , let x i (t) = n * i − i i (t) . according to (3.7) , we have the following system: we will show that for any t > 0, x i ( t ) > 0. if this is not true, there exist an i 0 (1 ≤ i 0 ≤ n ) and t 0 > 0, such that obviously, this is also a contradiction. thus i i 0 (t) ≤ n * define an auxiliary linear system, namely, (4.1) the right side of (4.1) has coefficient matrix f since (4.1) is a linear system, the dfe of this system is globally asymptotically stable. by the comparison principle, each non-negative solution of (3.7) satisfies lim t→ + ∞ notice that e 0 is locally asymptotically stable, thus e 0 is globally asymptotically stable. next, we study the existence and global stability of ee to system (3.7) . proof. to prove the existence and global stability of endemic equilibrium, we will use cooperate system theory in corollary 3.2 in [25] . in fact, let f : n → n be defined by the right-hand side of (3 .7) , note that for ∀ α ∈ (0, 1) and i i > 0, thus f is strong sublinear on n . by lemma 2 and corollary 3.2 in [25] , we conclude that system (3.7) admits a unique ee e * = (i * 1 , . . . , i * n ) which is globally asymptotically stable. prior to simulating an infectious disease spreading on a metapopulation network coupled with its snn, it is necessary to make it clear that what network topology is and which distribution the next-nearest degree follows. in [26] , newman derived an expression of probability p (2) k as follows: here q k means that the probability of excess degree is exactly k , and it is given by q k = (k + 1) p (1) k +1 / k . for a network with small size of nodes or simple structures (such as regular network), this probability is easily calculated. while for a general complex network, it is complex to calculate p (2) k directly. the introduction of generating function makes this problem easier, but extracting explicit probability distribution for next-nearest degrees is quite difficult. in fig. 4 , we illustrate p (2) k for two kinds of networks with average degree 7: homogeneous networks whose degrees follow a poisson distribution and heterogeneous networks with degrees power-law distributed, which makes the distribution of next-nearest degrees clearer. obviously, heterogeneity of network structures makes a big difference on p (2) k . referring to degrees of nodes poisson distributed, the probability distribution of next-nearest degrees is almost symmetric about k = 49 , the average next-nearest degree (the average number of second neighbors). in contrast, in the case that degrees of nodes follow a power-law distribution, next-nearest degrees present high heterogeneity with k range from 4 to 537. however, there is something in common that p (2) in section 4 , we calculate r 0 , which is independent of transfer rate, meaning that transfer rate has little impact on the stability of the system. hence, to know the significance of transfer rate or snn, we simulate an sis infectious disease on three kinds of metapopulation networks coupled with their respective snns. metapopulation networks with 20 0 0 subpopulations are generated following molloy and reed algorithm [27] . for sake of similarity, we assume that individuals in the same subpopulations travel along each link with the same probability. hence, weights of links are w (1) i j = 1 /k (1) i for the metapopulation network, and w (2) i j = 1 /k (2) i for its snn. with regard to each subpopulation i , the initial size of population depends on the degree of this subpopulation, i.e., n i = k (1) i / k n . n denotes the average population of whole network. focusing on the effect of transfer rate, we keep other parameters unchanged, and they are n = 10 0 0 , β = 0 . 4 , γ = 0 . 2 , δ = 0 . 1 . first, we consider a simplest connected metapopulation network, whose nodes are arranged in a straight line (named as linear metapopulation network), and simulate the spread of disease on this network coupled with its snn (shown in fig. 5 ). this network can be regarded as a regular network with degree 2 when the number of nodes is large enough. in each panel, we make a comparison among three values of transfer rates: q = 0 for no transfer, q = 0 . 5 for half of travelers choosing to transfer, and q = 1 for all individuals traveling along the snn. from fig. 5 , the fractions of infected subpopulations and infectious individuals both increase almost linearly. and the speed of disease transmission when q = 0 . 5 is nearly twice as fast as that of the other two cases. when transmission process reaches a steady state, the fractions of infected subpopulations and infectious individuals when all individuals travel along the snn are nearly half of that of the other two cases. the reason is obvious. the snn is not connected and it is composed of two linear subnets (shown in fig. 6 ). the change in network connectivity hinders the spread of disease to some degree. however, moderate transfer rate does accelerate the transmission of disease. these two results hold for all cases where the metapopulation network is connected while its snn not. under these cases, controlling direct flights may limit the spread of disease to a relatively small area. second, we investigate two kinds of typical networks with the same average degree 7, and let the networks and their snns keep the same network connectivity. as illustrated in figs. 7 and 8 , comparing the left and right panels of these two figures, we find that at the early phase of transmission disease occurs and outbreaks in a small number of subpopulations and infectious individuals increase slowly. when there exist infectious individuals in majority of subpopulations, the fractions of infectious individuals rise sharply and then reach a steady state in a short time. it is easily seen that the behavior of individual transfer accelerates the transmission of disease. increasing rates of transfer paves the way for infectious individuals transmitting disease to more susceptible subpopulations. however, transfer rate has little effect on the final fraction of infectious individuals, which is consistent with the theoretical results in section 4 . although transfer rate contributes to the global spread of infectious diseases, the effect of transfer rate differs in heterogeneity of next-nearest degrees. in fig. 8 , for power-law networks, moderate transfer rate is most conducive to the spread of infectious diseases. however, when q = 1 , the speed of transmission is slightly slower than that of q = 0 . 5 . referring to poisson networks (shown in fig. 7 ) , along the increase of transfer rate, the speed of spread displays an increase trend, at odds with power-law networks. this is owing to next-nearest degrees. for power-law networks (see fig. 4 (a) ), with the number of second neighbors climbs from 4 to 537, weights of links of the snn gradually decrease, which lower the probability of individuals traveling to second neighbors. in contrast, the distribution of next-nearest degrees for poisson network is relatively concentrated. under the circumstances, controlling direct flights may accelerate the global spread of disease. the role played by indirected flights can not be ignored. maybe no traveling is the best measure. in this paper, we took neglected indirect flights into account, and put forward a definition of snn for a undirected network. similar to general networks, we defined adjacency matrix, next-nearest degree and its distribution on this network. upon these bases, we proposed an ordinary differential equation group to curve the effect of transfer rate on the global transmission of an infectious disease. next, we obtained the limiting system of the model and gave the expression of the basic reproduction number, which depends only on disease parameters. further, the global stability of dfe and ee has been proven. then, we presented some simulation results on three kinds of connected metapopulation networks with different average degrees and different degree distributions. one is a linear metapopulation network with average degree approximately equaling to 2, and the other two are with the same average degree 7. we find that if the snn is not connected, controlling direct flights may hinder the spread of disease. on the contrary, if the snn is also connected, controlling direct flights may accelerate the spread of disease. it is in common that moderate transfer rate contributes to the global spread of infectious diseases. in detail, for a linear network, the numbers of infected subpopulations and infectious individuals increase almost linearly. for a poisson network, the dominant role is second neighbor because of its relatively homogeneous distribution. however, for the other two networks, moderate transfer rate is most conducive to the spread of infectious diseases, which means that although the existence of second neighbors may promote the global transmission of infectious diseases, the roles played by neighbors are still significant. therefore, when an infectious disease occurs, the governments should adjust measures to local conditions. that is, if the network connectivity is reduced after controlling direct flights, this measure is effective; otherwise, if the network connectivity keep the same with the original network, this measure fails. it may be more effective to control all flights (direct or indirect) properly. our studies shed lights on disease control and prevention. but there may be some problems to be solved. when people travel, they may traverse more than one place before reaching their destinations. this case is rare for aviation network but common for high-speed train and rail networks. under similar hypotheses, our model can be popularized to the thirdneighbor network, the fourth-neighbor network, and so on, and then be applied to high-speed train and rail networks. it is worth noticing that more stops may lead to according change of time scale, such as a general train, whose speed is so slow that it is time-consuming traveling between two places with long distance. forecast and control of epidemics in a globalized world pandemic potential of a strain of influenza a (h1n1): early findings human infection with a novel avian-origin influenza a (h7n9) virus assessing the pandemic potential of mers-cov assessing the impact of travel restrictions on international spread of the 2014 west african ebola epidemic potential for zika virus introduction and transmission in resource-limited countries in africa and the asia-pacific region: a modelling study epidemic modeling in metapopulation systems with heterogeneous coupling pattern: theory and simulations rendezvous effects in the diffusion process on bipartite metapopulation networks contagion dynamics in time-varying metapopulation networks effects of local population structure in a reaction-diffusion model of a contact process on metapopulation networks epidemic spread on interconnected metapopulation networks epidemic spreading by objective traveling modeling human mobility responses to the large-scale spreading of infectious diseases safety-information-driven human mobility patterns with metapopulation epidemic dynamics interplay between epidemic spread and information propagation on metapopulation networks phase transitions in contagion processes mediated by recurrent mobility patterns invasion threshold in structured populations with recurrent mobility patterns heterogeneous length of stay of hosts' movements and spatial epidemic spread human mobility and spatial disease dynamics mean-field diffusive dynamics on weighted networks a multi-species epidemic model with spatial dynamics reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission nonnegative matrices moment closure of infectious diseases model on heterogeneous metapopulation network global asymptotic behavior in some cooperative systems of functional differential equations networks: an introduction a critical point for random graphs with a given degree sequence key: cord-308249-es948mux authors: dokuka, sofia; valeeva, diliara; yudkevich, maria title: how academic achievement spreads: the role of distinct social networks in academic performance diffusion date: 2020-07-27 journal: plos one doi: 10.1371/journal.pone.0236737 sha: doc_id: 308249 cord_uid: es948mux behavior diffusion through social networks is a key social process. it may be guided by various factors such as network topology, type of propagated behavior, and the strength of network connections. in this paper, we claim that the type of social interactions is also an important ingredient of behavioral diffusion. we examine the spread of academic achievements of first-year undergraduate students through friendship and study assistance networks, applying stochastic actor-oriented modeling. we show that informal social connections transmit performance while instrumental connections do not. the results highlight the importance of friendship in educational environments and contribute to debates on the behavior spread in social networks. social environment has a significant impact on individual decisions and behavior [1] [2] [3] . people tend to assimilate the behavior, social norms, and habits of their friends and peers. it is empirically shown that social interactions play a key role in the spread of innovations [4] , health-related behavior [5, 6] , alcohol consumption and smoking [7, 8] , delinquent behavior [9, 10] , happiness [11] , political views [12, 13] , cultural tastes [14] , academic performance [15] [16] [17] [18] [19] . although there is an extensive body of research showing that a large proportion of social practices disseminates across social networks [3] , the question of what types of social contacts cause the spread of specific behavior remains open [1, 20] . in this paper, we analyze the diffusion of academic performance across different types of student social networks. while these social networks are extensively studied in the literature [15] [16] [17] [18] 21, 22] , there is a lack of agreement on whether social networks are effective channels for the academic performance spread [16, 18, 23] . and if they are, what types of networks serve the best for the propagation of the academic-related behavior? we analyze the spread of academic achievements within two different social networks of first-year undergraduate students. we test two mechanisms of academic performance diffusion in the student social networks. first, we analyze the academic performance spread through the friendship network, which can be considered as a network of informal social a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 interactions. second, we study the academic performance spread in the study assistance network which is aimed at study-related information and knowledge transmission, and serves for problem-solving [24] . we apply stochastic actor-oriented model (saom) for joint modeling of networks and behavior dynamics [25] . we model the evolution of two social systems. in the first model, we analyze the coevolution of friendship network and academic achievements. in the second model, we study the coevolution of study assistance network and academic achievements. both models are controlled with a variety of structural and behavioral properties such as a tendency to form mutual ties and befriend similar others. results show that academic performance spreads through friendship connections, while the study assistance ties do not cause the performance transfer. social networks are the pathways for the behavior transmission. this process may be guided by various factors, including the network topology [26] , type of propagated behavior, nature of social contacts, and other features of the social environment. the majority of recent studies on this topic are concentrated on the structural properties of the networks that drive the behavior diffusion processes. for example, it was experimentally shown that short average path length and high clustering cause a faster behavior spread [27] that can be explained by the formation of the dense network communities with the fast information and behavior exchange within these cohesive groups. in [28] it was demonstrated that there are differences in spreading processes initiated by well-connected actors, or hubs, and by actors with a few social connections. hubs are effective in information propagation due to the high number of connections, while actors with a few ties are more efficient in spreading messages that are controversial or costly. the probability of behavior adoption by an individual is also highly correlated with the number of social contacts that directly influence this individual [29] . the influence by many peers, or so-called "complex contagion", results in faster and easier behavior adoption, rather than the influence by one person, or "simple contagion" [26] . the efficacy of social contagion is often associated with the type of propagated behavior. centola and macy outline the danger of considering the social contagion studies in the 'whatever to be diffused' way [30] . for example, the adoption of information is much less risky, costly, and time-consuming, rather than the adoption of health-related behavior, sports habits, and academic achievements. the nature of social ties is also a significant factor for behavior transmission. social connections are traditionally divided into "weak" and "strong" ties, and they exhibit completely different spreading patterns [31] . strong ties are formed within the dense network communities such as family or friends, while weak ties, according to granovetter's definition, emerge during the whole life and represent people who are marginally included in the network of contacts, such as old college friends or colleagues [31] . empirical literature shows that both types of relationships can serve as channels for the diffusion of behavior or information [1, 20] . but weak ties are important instruments for information propagation, while strong ties are more successful in costly behavior transmission. although the vast array of theoretical and empirical studies improved our understanding of the behavior transmission processes, there is still an open question regarding the differences of the behavior spread in networks of different natures. social ties can vary both in the level of their strength and intensity, as we outlined above, and in their origins. networks can be based on friendship, romantics, advice seeking, social support, and many other relationships. despite the huge variance in the social network types, the majority of the research on social diffusion is concentrated on the networks of friendship ties. however, the relationships of distinct nature can result in completely different behavior transmission processes. in this paper, we consider the transmission of academic performance within student social networks. this process attracts the attention of researchers since the publication of the famous "coleman report" [21] . this report showed that students tend to obtain similar grades as their peers, classmates, and friends, and this effect remains strong after controlling for a variety of socio-economic and cognitive variables. further empirical studies demonstrated the presence of this effect in various case studies. for example, it was shown that student grade point average (gpa) increases if her dormmate is in the highest 25th gpa percentile [32] . in [15] , mba students tend to assimilate the grades of their friends and advisers. it was also demonstrated that this social influence is associated with the personal characteristics of students and the nature of their social connections. for instance, lower-achieving students are more influenced by their peers [33, 34] , the diffusion of academic performance is stronger among women than men [35] , can be related to the race of a peer [36, 37] , and is stronger from close peers such as friends [38] . at the same time, online communication networks do not serve as effective channels for performance transmission. students tend to segregate in online networks based on their performance and this prevents the diffusion of achievements through online ties [18, 19] . summarizing, the majority of studies demonstrate that social networks are effective channels for the performance diffusion. it was shown that achievements spread well within friendship networks, while other types of ties (e.g. online relationships) do not serve as channels for the performance transmission. in this paper, we examine the diffusion of academic achievements in two distinct social networks: friendship and study assistance. we demonstrate that, despite the significant overlap in these networks, they exhibit different patterns of behavior transmission. we analyze the longitudinal data on friendship and study assistance networks and gpa of a first-year student cohort of the economics department in one of the leading russian universities in 2013-2014 academic year. in this university, students are randomly assigned by the administration to different study groups of up to 30 students. lectures are usually delivered to all the cohort simultaneously while seminar classes are delivered to each study group separately. in the first year, most of the courses are obligatory. therefore, students have a limited possibility to form networks with students from other groups, programs, or year cohorts. the academic year consists of four modules of two or three months. at the end of each module, students take final tests and exams. the grading system is at a 10-point scale where a higher score indicates a higher level of academic achievement. the course grade is the weighted average of midterm and final exams, homework, essays, and other academic activities during the course. the sample consists of 31% males and 69% females. the data for this study was gathered from two sources: the longitudinal student questionnaire survey (3 waves during the first academic year: october 2013, february 2014, and june 2014) and the university administrative database. in total, our dataset consists of 117 students that took part in at least two surveys with up to 700 connections between them in total. the detailed over-time aspect of the networks gives us a rich dataset of links of diverse nature. the sample can be considered representative to student cohorts in selective universities. in the questionnaire survey, we ask students about their connections within their cohort. the questions were formulated in the following way: 1. please indicate the classmates with whom you spend most of your time together; 2. please indicate the classmates whom you ask for help with your studies. the role of distinct social networks in academic performance diffusion. there were no limitations in the number of nominations. additionally, students were asked to indicate those classmates whom they knew before the admission to the university. we also gather information about students' study-group affiliation from the administrative database. in total, we have four different network types: friendship, study assistance, knowing each other before studies, and being in the same study group. from an administrative database of the university, we gather data about student performance (grade point average, or gpa at the end of the first year) that is measured on a scale from 0 to 10. we transform the data on performance from continuous to categorical scale and distinguish four performance groups based on the grading system of the university: high performing students (their gpa is equal or higher than 8), medium high performing students (with gpa from 6 to 8), medium low performing students (with gpa from 4 to 6) and low performing students (with gpa lower than 4). it is important to mention that the information about individual student grades is publicly available in this university. this is common in some russian universities but very different from educational systems in the european union and the us. in russian universities, grades are often publicly announced by teachers to the class. in the studied university, final grades are additionally published online on the university website. this creates a specific case when students know about the grades of each other and can coordinate their social connections depending on this information. individuals who did not participate in the questionnaire survey were excluded from the analysis (14 individuals, 10.7% of the sample). these missing data were not treated in a special way. we followed the recommendations (40) , suggesting that 'up to 10% missing data will usually not give many difficulties or distortions, provided missingness is indeed non-informative'. data collection procedures are described in the "data collection" section in s1 file. the descriptive statistics of the sample are presented in the si ("case description" section and tables 1-3 in s1 file). the network visualizations are presented in figs 1-6. standard statistical techniques such as regression models are not applicable for the analysis of social networks due to the interdependence of network observations [39] . therefore, we apply a stochastic actor-oriented model (saom) that allows to reveal the coevolution of network properties and behavior of actors [25, 40] . this dynamic model is widely used for studying the joint evolution of social networks, actor attributes, and separating the processes of social selection and social influence. in total, we estimate two models: the first model estimates the coevolution of friendship network and academic performance, the second one estimates the coevolution of study assistance network and academic performance. the saom's underlying principles are the following. firstly, network and behavior changes are modeled as markov processes which means that the network state at time t depends only on the (t-1) network state. secondly, saom is grounded on the methodological approach of structural individualism. it is assumed that all actors are fully informed about the network structure and attributes of all other network participants. thirdly, time moves continuously and all the macro-changes of the network structure are modeled as a result of the sequence of the corresponding micro-changes. this means that an actor, at each point in time, can either change one of the outgoing ties or modify his or her behavior. the last principle is crucial for the separation of the social selection and social influence processes. there are four sub-components of the coevolution of network and behavior: network rate function, network objective function, behavior rate function, and behavior objective function [25, 40] . the rate functions represent the expected frequencies per unit of time with which actors get an opportunity to make network and/or behavioral micro-changes (40) . the objective functions are the primary determinants of the probabilities of changes. the probabilities of the network and/or behavior change are higher if the values of the objective functions for the network/behavior are higher [25, 40] . the objective functions for the network (eq 1) and behavior change (eq 2) are calculated as a linear combination of a set of components called effects: where s ki (x) are the analytical functions (also called effects) that describe the network tendencies (40) ; β kz s ki z (x, z) are functions that depend on the behavior of the focal actor i, but also on the behavior of his or her network partners and a network position [22] ; β k and β k z are statistical parameters that show the significance of the effects. saom coefficients are interpreted as logistic regression coefficients. parameters are unstandardized, therefore, the estimates for different parameters are not directly comparable. during the modeling, saom allows the inclusion of endogenous and exogenous covariates. as endogenous variables, we include in our models network density, reciprocity, popularity, activity, transitivity, 3-cycles, transitive reciprocated triplets, and betweenness [40] . density and reciprocity show the tendency of students to form any ties and to form mutual ties. transitivity, 3-cycles, transitive reciprocated triplets, and betweenness measure a propensity of students to form triadic connections with their peers. popularity and activity are included to control the tendency of actors to receive many ties from others and to nominate a large number of actors. to control for social selection, we include the selection effect based on academic achievement. it shows whether students with similar levels of academic achievements tend to form connections with each other. we also controlled for the tendency of students with high grades to increase their popularity and activity over time. to test the presence of social influence, we include the effect of performance assimilation. it shows whether students tend to assimilate the academic achievement levels of their peers. in addition, we controlled for the propensity of students with high levels of popularity and activity to change their academic performance. in the model construction we follow the general network modeling requirements necessary for saom [40] . the role of distinct social networks in academic performance diffusion. all research protocols were approved by the hse (higher school of economics) committee on interuniversity surveys and ethical assess of empirical research. all human subjects gave their informed verbal consent prior to their participation in this research, and adequate steps were taken to protect participants' confidentiality. table 1 presents the modeling results of two separate models. in the first, we model the coevolution of friendship network and academic performance. in the second one, we model the coevolution of study assistance network and academic performance. social influence [effect 24] is positive and significant in the friendship social network. this means that the academic performance of students tends to become similar to the performance of their friends. in other words, academic achievements diffuse through friendship ties. in the study assistance network, however, social influence is not present. this indicates that students the role of distinct social networks in academic performance diffusion. do not assimilate the performance of their study assistants; this network channel does not propagate the spread of academic achievements. positive indegree effect [effect 25] suggests that students who are often asked for help increase their performance over time. the non-significant estimates of the linear and quadratic shape parameters [effects 22 and 23] for friendship indicate that the influence of peers sufficiently explains the performance dynamics [15] . the negative effect of the quadratic shape parameter [effects 23] for the study assistance network shows the convergence of the academic performance to unimodal distribution [15] . the effect of performance selection [effect 19] is positive for the study assistance network. it suggests that students with similar levels of academic achievements tend to ask each other for help. the effect of social selection in the friendship network is not significant. this means that students do not have a preference to befriend students with similar academic achievements. positive estimates for the performance of alter [effect 17] in both social networks suggest that individuals with high performance are popular in friendship and study assistance networks. positive effect for the performance of ego [effect 18] for friendship network shows that high performing students tend to create friendship connections. the role of distinct social networks in academic performance diffusion. we find the presence of gender homophily [effect 16] in both friendship and study assistance social networks. students tend to create friendship and study assistance connections with individuals with the same gender. positive effect of ego for males [effect 15] in the friendship network suggests that males tend to nominate more friends. the network control effects [effects 3, 4, 7, 8, 9, and 10] that were included in the models show expected signs and significance scores, as in most student social networks [25] shows that transitivity is less important for friendship ties when reciprocity is present (and vice versa) [41] . the combination of negative betweenness [effect 10] and positive transitivity [effect 7] in both networks demonstrate that individuals do not seek for brokerage positions and do not want to connect peers from different network communities and study groups. positive activity effect [effect 6] in friendship network indicates that students with many ties tend to create new friendship relationships. positive effect of popularity [effect 5] in study assistance network suggests that individuals ask for help those students, who are often asked for help by others. in friendship network this effect is negative, which means that students do not tend to befriend popular individuals, i.e. those who already have a lot of friends. in both networks rate parameters are larger in the first period rather than in the second, indicating that the tie formation stabilizes over time. the modeling results also show that students tend to create friendship and study assistance ties with individuals they knew before the enrollment [effect 11] and individuals from the same study group [effect 12]. also, students tend to create friendship connections with their study assistants [effect 13.1] and they seek for study assistance from their friends [effect 13.2]. we conducted the time heterogeneity test for both network models [40] . this test is used to examine whether the parameter values β k of the objective function are constant over the the role of distinct social networks in academic performance diffusion. periods of observation. we find the time heterogeneity in models. in both networks, parameters such as betweenness, acquaintance before enrollment, popularity and activity of the high performing individuals are heterogeneous. in the friendship network, there is also time heterogeneity for gender of alter and ego, performance social selection and influence. in the study significance codes ��� p < 0.001 �� p < 0.01 � p < 0.05. the models converged according to the t-ratios for convergence and the overall maximum convergence ratio criteria suggested in (40) . goodness of fit is adequate for all models. https://doi.org/10.1371/journal.pone.0236737.t001 the role of distinct social networks in academic performance diffusion. assistance network, we find time heterogeneity for studying in the same group, gender of ego, and performance of ego. the cases of previous acquaintance or being in the same study group can be explained by the nature of these types of ties. for instance, the acquaintance before enrollment can play a significant role at the beginning of studies, while after several months' students will tend to expand their networks and will not seek connections with individuals they knew before studies. the same explanation can be used for the case of being in the same study group. at the beginning of studies, students will form ties within their study groups but later they will tend to expand their network and form ties with other group members. differences in time heterogeneity of academic achievements may be related to the decreased statistical power of these effects between different models. the effects of academic performance on network evolution processes may be understood in details by considering all the performance-related effects simultaneously [40] . in table 2 , we present log-odds for the performance selection within different achievement groups. the higher the estimate, the higher the probability of a study assistance tie formation between students from different performance groups. table 2 shows that there is a significant tendency toward selection of high-performing individuals as study assistants, and this tendency is present among all groups of students. similarly, in table 3 we present precise estimates for the social influence process for all achievement groups. each row of the table corresponds to a given average behavior of the friends of an ego. values in the row show the relative 'attractiveness' of the different potential values of ego's behavior. maximum diagonal value indicates that for each value of the average friends' behavior the actor 'prefers' to have the same behavior as all these friends (40) . this shows that individuals tend to assimilate their friends' performance. in this paper we explore the academic performance diffusion through two social networks of different natures: friendship and study assistance. we empirically confirm that educational the role of distinct social networks in academic performance diffusion. outcomes of students are diffused in different ways within friendship and study assistance networks. ties in the friendship network transmit academic achievements, while ties in the study assistance network do not. the absence of the social influence process along the presence of social selection in the study assistance network may suggest the presence of social segregation based on performance [19] . this can be related to the high competitiveness of the university environment under the study. we expect that some students are highly motivated to receive higher grades and prefer to invest time and effort in their high academic results rather than help their less academically successful peers. our findings demonstrate that the efficacy of academic achievements diffusion is determined by the nature of the social network. it was established that social integration in the classroom is positively associated with the higher academic performance of students [17, 19] . here we claim that it is extremely important to integrate individuals specifically in the network of informal friendship interactions and motivate them to create connections with higher-performing students. these findings support the idea that the nature of social relationships is crucial for the transmission of specific types of information and behavior in social networks. close friendship relationships serve as effective channels for the spread of various complex behaviors, including very costly behavior types such as health behavior [1] . academic performance is one of the examples of these behavior types that are not easily transmittable. in contrast, the instrumental study assistance ties do not produce the propagation of academic achievements from successful students to their lower-performing peers. to sum up, we show that costly and complex behavior (such as academic achievement) diffuses more effectively in the network of strong close connections such as friendship. these findings contribute to the current debates on behavior propagation in social networks and propose new insights on factors that impact the success of behavior transmission. this study has several limitations. first, we analyze social networks of first-year students. this time frame, when students start their educational path at an undergraduate level, receives a lot of attention in the literature [17, 19, 42] due to the fast speed of social tie formation. at the same time, it would be beneficial to investigate the diffusion of academic achievements through social networks along the full period of studies. second, we examine only two types of social relations, however, the spectrum of social ties that can serve as channels of the performance diffusion is much wider. it is a potential avenue for future studies to estimate the efficacy of other types of social networks such as cooperation, competition, romantic relationships, and negative ties on the process of academic achievement diffusion. the data on some of these networks is difficult to collect (e.g., negative relationships) due to the high sensitivity of studied relationships but these types of ties can be nevertheless significant for behavior transmission. in the time of the covid-19 pandemic and after it, it is also extremely important to examine the effect of online networks on academic performance transmission because online interaction remains the only communication channel for students. our empirical findings have several policy implications. academic achievements are one of the key components of financial success and individual well-being [43, 44] , that makes the performance increase is one of the main goals of the educational system. however individual achievements are quite stable and largely driven by heritable factors [45] which make interventions aimed at the academic performance growth highly complex and difficult to implement. one of the possible mechanisms of performance increase is social influence, as we show in this paper. teachers can pay additional attention to the development of informal friendship relationships between students with various performance levels during classes. it can be reached by group work assignments, in which group membership is defined by the teachers and is not based on the personal preferences of students. long-term group assignments such as working on a research project together can stimulate students from different achievement groups to develop friendship ties with each other. the creation of recreation and open spaces within the university building can also give additional options for students with distinct performance to meet, interact, and form friendship ties. the combination of these actions would help students to build and sustain their informal networks, which, in turn, serve as key channels of the academic performance diffusion and lead to a positive behavior change. supporting information s1 file. (docx) conceptualization: sofia dokuka, diliara valeeva, maria yudkevich. formal analysis: sofia dokuka. how behavior spreads: the science of complex contagions social cohesion social contagion theory: examining dynamic social networks and human behavior network interventions. science (80) the spread of obesity in a large social network over 32 years friendship as a social mechanism influencing body mass index (bmi) among emerging adults dynamics of adolescent friendship networks and smoking behavior teen alcohol use and social networks: the contributions of friend influence and friendship selection peer influences on moral disengagement in late childhood and early adolescence ainhoa de f. why and how selection patterns in classroom networks differ between students. the potential influence of networks size preferences, level of information, and group membership dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study a 61-million-person experiment in social influence and political mobilization peer networks and the development of illegal political behavior among adolescents social selection and peer influence in an online social network why are some more peer than others? evidence from a longitudinal study of social networks and individual academic performance academic achievement and its impact on friend dynamics integration in emerging social networks explains academic failure and success formation of homophily in academic performance: students change their friends rather than performance the rich club phenomenon in the classroom complex contagions: a decade in review. in: complex spreading phenomena in social systems equality of educational opportunity us department of health, education, and welfare, office of education changing friend selection in middle school: a social network analysis of a randomized intervention study designed to prevent adolescent problem behavior it's not your peers, and it's not your friends: some progress toward understanding the educational peer effect mechanism norms, status and the dynamics of advice networks: a case study introduction to stochastic actor-based models for network dynamics the social origins of networks and diffusion the spread of behavior in an online social network experiment. science (80-) anomalous structure and dynamics in news diffusion among heterogeneous individuals threshold models of collective behavior complex contagions and the weakness of long ties the strenght of weak ties peer effects with random assignment: results for dartmouth roommates peer effects in academic outcomes: evidence from a natural experiment does your cohort matter? measuring peer effects in college achievement parental and peer influences on adolescents' educational plans: some further evidence the effects of sex, race, and achievement on schoolchildren' s friendships students' characteristics and the peer-influence process classroom peer effects and student achievement a tutorial on methods for the modeling and analysis of social network data reciprocity, transitivity, and the mysterious three-cycle. soc networks short-term and long-term effects of a social network intervention on friendships among university students intelligence predicts health and longevity, but why? influence of sex and scholastic performance on reactions to job applicant resumes the stability of educational achievement across school years is largely explained by genetic factors key: cord-319658-u0wjgw50 authors: guven-maiorov, emine; tsai, chung-jung; nussinov, ruth title: structural host-microbiota interaction networks date: 2017-10-12 journal: plos comput biol doi: 10.1371/journal.pcbi.1005579 sha: doc_id: 319658 cord_uid: u0wjgw50 hundreds of different species colonize multicellular organisms making them “metaorganisms”. a growing body of data supports the role of microbiota in health and in disease. grasping the principles of host-microbiota interactions (hmis) at the molecular level is important since it may provide insights into the mechanisms of infections. the crosstalk between the host and the microbiota may help resolve puzzling questions such as how a microorganism can contribute to both health and disease. integrated superorganism networks that consider host and microbiota as a whole–may uncover their code, clarifying perhaps the most fundamental question: how they modulate immune surveillance. within this framework, structural hmi networks can uniquely identify potential microbial effectors that target distinct host nodes or interfere with endogenous host interactions, as well as how mutations on either host or microbial proteins affect the interaction. furthermore, structural hmis can help identify master host cell regulator nodes and modules whose tweaking by the microbes promote aberrant activity. collectively, these data can delineate pathogenic mechanisms and thereby help maximize beneficial therapeutics. to date, challenges in experimental techniques limit large-scale characterization of hmis. here we highlight an area in its infancy which we believe will increasingly engage the computational community: predicting interactions across kingdoms, and mapping these on the host cellular networks to figure out how commensal and pathogenic microbiota modulate the host signaling and broadly cross-species consequences. rather than existing as independent organisms, multi-cellular hosts together with their inhabiting microbial cells have been viewed as "metaorganisms" (also termed superorganisms or holobionts) [1] . millions of commensals, symbiotic, and pathogenic microorganisms colonize our body. together, they comprise the "microbiota". microbiota are indispensable for the host, as they contribute to the functioning of essential physiological processes including immunity and metabolism. hosts co-evolved with the microbiota. while some commensals are beneficial (symbionts), others may become harmful (pathobionts) [2, 3] . microbiota immune system development. the immune system recognizes antigens of microorganisms e.g. dna, rna, cell wall components, and many others, through pattern recognition receptors, such as toll-like receptors (tlrs) and downstream intracellular signaling circuitries are activated to generate immune responses [4] . however, like self-antigens, antigens from commensal microbiota are tolerated with no consequent inflammatory responses. this makes gut microbiota accepted as "extended-self" [5] . still, under some circumstances, commensals may act as pathogens. for example, staphylococcus aureus [6] or candida albicans [7] are commensals of human, but in "susceptible" hosts, they can undergo commensal-to-pathogen transition. thus, identifying microorganisms that reside in the host, and within these, those that are responsible for distinct host phenotypes, and the host pathways through which they act are significant goals in host-microbiota research. microbiota survival strategies within the host are likely to be limited. analysis of their repertoire may reveal core modules, thereby helping in classification, mechanistic elucidation and profile prediction. here we provide an overview of structural host-microbiota interaction networks from this standpoint. the host interacts with microbiota through proteins, metabolites, small molecules and nucleic acids [8, 9] . the microbiota employs a range of effectors to modulate host cellular functions and immune responses. they have sophisticated relationships with the host, and network representation enables an effective visualization of these relationships [10] . most proteins of bacterial and eukaryotic pathogens are not accessible to bind to host proteins; but some of their proteins either bind to host surface receptors [11] or enter the host cell and interact with host cytoplasmic proteins. various bacterial species have a secretion system-a syringe-like apparatus-through which they inject the bacterial effectors directly into the host cell cytoplasm [12] . via hmis, they specifically hone in on key pathways, alter host physiological signaling, evade the host immune system, modify the cytoskeletal organization [13, 14] , alter membrane and vesicular trafficking [2, 11, 13] , promote pathogen entry into the host, shift the cell cycle [15, 16] , and modulate apoptosis [17] . all are aimed to ensure their survival and replication within the host. host signaling pathways that are targeted by microbiota and turned on or off may change the cell fate. unraveling the hmis for both commensals and pathogens can elucidate how they repurpose the host signaling pathways and help develop new therapeutic approaches. hmis have complex and dynamic profiles. studies often focus on individual protein interactions and try to explain the pathogenicity of a microorganism with a single interaction. however, considering host-microbiota interactions one-at-a-time may not reflect the virulence scheme [18] . for instance, replication of vaccinia virus necessitates the establishment of a complex protein interaction network [19] and hence focusing on only one hmi is incomplete and may be misleading. at any given time, hundreds of different species reside in the gut. different microbial compositions and hence effector protein combinations from these microbial species may have additive (cross-activation) or subtractive (cross-inhibition) [4] impacts on the host pathways, which lead to signal amplification or inhibition, respectively (fig 1) . since numerous bacteria will be sensed by the host immune system at any given time, more than one signaling cascade will be active in a cell. communication and crosstalk among active, or active and inhibited, pathways determine the ultimate cellular outcome [4] : to survive, die, or elicit immune responses. the combinatorial ramifications of all active (or suppressed) host pathways and hmis will be integrated to shape the type and magnitude of the response, and thus the cell state. to tackle the pathogenicity challenge, it is reasonable to concomitantly consider all host pathways and hmis. the transkingdom (metaorganism) network analysis is a robust research framework that considers host and microbiota as a whole [1] . systems biology approaches that integrate the hmis with host endogenous protein interaction networks reveal the systematic trends in virulence strategies of pathogens. here we ask how interspecies (superorganism) networks can facilitate the understanding of the role of microbiota in disease and health. we focus on host-microbiota protein interaction networks since many bacteria or virus-induced pathological processes require physical interactions of host and microbial proteins [20] . the availability of genome-wide high throughput omics data makes it possible to associate microbiota with certain host phenotypes at multiple levels and construct host-pathogen interaction networks at the transcriptome [21], proteome combinatorial effects of microbial effectors and the active host pathways determine the cell response. (a) composition1 has certain microorganisms that secrete effector protein combination1. these effectors activate pathway1 in the host, which produces pro-inflammatory cytokines. (b) composition2 secretes effector combination2 and activates pathway2 in addition to pathway1. additive effects of these two pathways amplifies the signal and promotes inflammation (cross-activation). (c) microbial composition3 utilize effector combination3 to activate both pathway 1 and 3, which have opposing outcomes. subtractive effects of these pathways result in no inflammation (cross-inhibition). https://doi.org/10.1371/journal.pcbi.1005579.g001 [22], and metabolome levels [23] . steps toward the construction of host-microbiota networks of gene [1] , mrna [24], protein-protein interaction (ppi) [25] [26] [27] [28] , and metabolic networks [29] have already been taken. within this framework we highlight molecular mimicry, a common strategy that microorganisms exploit to bind to host proteins and perturb its physiological signaling. mimicry of interactions of critical regulatory nodes in core network modules in the immune system, may be a major way through which pathogens adversely subvert-and commensal microbiota may beneficially modulate-the host cell. microbiota developed several strategies to interact with host proteins and modulate its pathways. one efficient way is molecular mimicry, which has been extensively reviewed in our recent study [9] . molecular mimicry can take place at four levels: mimicking (i) both sequence and 3d structure of a protein, (ii) only structure without sequence similarity, (iii) sequence of a short motif-motif mimicry, and (iv) structure of a binding surface without sequence similarity-interface mimicry. interface mimicry (protein binding surface similarity) seems to be the most common type of molecular mimicry. global structural similarity is much rarer than interface similarity both within and across species. thus, employing interface mimicry instead of full-length sequence or structural homology allows microbes to target more host proteins. molecular mimicry follows the principle suggested over two decades ago that proteins with different global structures can interact in similar ways [30] [31] [32] . interface mimicry is frequently observed within intra-[33-35] and inter-species [18, 36] (fig 2) (intra-species interface mimicry: distinct proteins from the same species having the same/similar interfaces; inter-species interface mimicry: proteins from different species hijack the same interface architectures). interface similarity allows proteins to compete to bind to a shared target. if an interface is formed between proteins from the same species, it is an 'endogenous interface'. if it is formed by proteins from two different species, it is an 'exogenous interface' [18, 36] . endogenous (intra-species) interfaces mimic each other [33] [34] [35] , and exogenous (inter-species) interfaces mimic endogenous interfaces (fig 2) [18, 36]. by mimicking endogenous interfaces, exogenous interfaces enable pathogenic proteins to compete with their host counterparts and hence rewire host signaling pathways for their own advantage [9] . they can either inhibit or activate a host pathway. for example, the helicobacter pylori secreted protein caga interacts with human tumor suppressor tp53bp2, inhibits apoptosis and allows survival of infected host cells [37] . however, map protein of e. coli and sope protein of salmonella bacteria bind and activate human cdc42, a rho gtpase, and trigger actin reorganization in the host cell, facilitating bacterial entry into the host [38]. one of the most significant pattern recognition receptor families in the innate immune system is the tlr family. its members detect diverse bacterial compounds, like peptidoglycan, lipopolysaccharide, and nucleic acids of bacteria and viruses. they induce pro-inflammatory or anti-viral responses. once activated, they recruit other tir-containing proteins such as mal and myd88 or tram and trif through their cytoplasmic tir domains, forming the myd88-and trif-dependent tir domain signalosomes, respectively [39]. myd88 also assembles into a myddosome structure through its death domain together with irak4 and irak1/2 death domains. the myddosome then recruits e3 ubiquitin ligases-either traf6 or traf3 -to catalyze the addition of k63-linked ubiquitin chains to themselves, which serve as a docking platform for other proteins to bind, such as tak1. subsequently, nf-κb and mapk pathways are activated. in the nf-κb pathway, tak1 phosphorylates and activates ikk. activated ikk in turn phosphorylates iκb, which is the inhibitor of nf-κb. phosphorylated iκb is then ubiquitylated by other e3 ubiquitin ligases (k48-linked ubiquitin chain) and targeted for proteosomal degradation. this liberates the p65 subunit of nf-κb to translocate to nucleus and initiate transcription. in the mapk pathway, tak1 serves as a map3k that activates erk1/2, p38 and jnk pathways. the trif-dependent downstream path of tlrs recruits traf3 and leads to activation of interferon regulatory factors (irfs) and production of key antiviral cytokines, interferons (ifns). the tlr pathway is regulated by several endogenous negative regulators to prevent excess inflammation [40] . since this is one of the major immune pathways, its signaling is targeted by diverse microorganisms at various steps (fig 3) , compete with endogenous tir-containing proteins and interfere with the assembly of the tir-domain signalosome and prevent downstream signaling. since these microbial proteins do not enzymatically modify the endogenous proteins, elucidation of their inhibition mechanism requires structural information. the availability of the structures of their complexes with the orchestrators of the tlr pathway can clarify how they inhibit downstream signaling. microbial proteases prevent both tlr-induced mapk and nf-κb signaling and lead to proteosomal degradation of the key orchestrators in these pathways: nled of e. coli cleaves jnk and p38, inhibiting mapk pathway; and nlec cleaves p65, inhibiting nf-κb [46] . there are also bacterial acetyltransferases ( [57, 58] inhibit this protein to limit ifn production [59] . here, we listed only a couple of microbial proteins targeting tlr pathway as examples. there are many others. the tlr pathway does not constitute the whole innate immune system; other immune pathways also need to be considered as well as how these microbial proteins affect them as a whole. this can help foreseeing what kind of responses the coordinated actions of these pathways together with tlrs would generate. most cellular processes are elicited by proteins and their interactions. graph representations of ppi networks, where proteins are the nodes and their interactions are edges, are helpful for delineating the global behavior of the network. topological features of networks, such as degree (number of edges), betweenness-centrality (how a node affects the communication between two nodes), lethality-centrality, hubs (proteins with high node-degree, i.e. several a, b, c, d are host proteins and p is pathogenic protein. protein a has two interfaces: through blue interface it binds to b and through grey interface it binds to c and d. c and d proteins employ similar interfaces to bind to a. so, endogenous interfaces mimic each other. pathogenic protein p has similar interface as b and competes to bind to the blue interface on a. in this case, an exogenous interface mimics an endogenous interface. (b) the f1l protein of variola virus interacts with human bid protein (5ajj:ab.pdb) and inhibits apoptosis in the host cell by hijacking the interface between human bid-bclxl (4qve:ab.pdb): an exogenous interface mimicking an endogenous one. human mcl1 protein binds to human bid (5c3f:ab.pdb) in a very similar fashion that bclxl does: endogenous interfaces mimicking each other. https://doi.org/10.1371/journal.pcbi.1005579.g002 interaction partners), non-hubs (with only a few partners), and bottlenecks (nodes with high betweenness-centrality) help characterization of the importance of the nodes, i.e. the contribution of the node to network integrity [60, 61] . early on, hubs were classified as either party or date hubs. while party hubs interact with many partners at the same time since they use distinct interfaces, date hubs interact with their partners one at a time due to their overlapping interfaces. to infer whether a hub is party or date hub, structural information (interface residues) [62] or gene expression data (co-expressed proteins have higher chances of interacting with each other) [63] were used. later on, this definition was questioned. among the reasons were the many examples where a protein node can serve concomitantly as a party and date hub. large assemblies typically fall into this category. biological networks are often scale-free, with many non-hubs and fewer hubs [64, 65] . not all nodes have the same effect on the network: random node attacks do not harm the network as much as removing hubs from scale-free networks [66] . degree and betweenness-centrality are measures of the contribution of nodes to network integrity. there are also "essential" nodes, knock-out of which leads to lethality: a feature also known as "lethality-centrality". attack of a hub by microbiota is likely to influence the cell, either resulting in lethality, or in beneficial modulation. thus, integrated superorganism interaction networks may suggest candidate host and microbial node targets. structural interspecies networks and their topological features can shed light on how microbiota alter the host signaling and what will the outcome in different settings be. available hmi networks demonstrate that different bacteria often hijack the same host pathway in distinct ways [12] , like the tlr pathway subversion by numerous microbial species (fig 3) . however, importantly, the same host pathway is often targeted at several nodes, which was suggested to guarantee modulation of cellular function [12] . although there are a number of examples of constructed networks of host-pathogen superorganism interactions [12, 19, [67] [68] [69] [70] [71] [72] [73] [74] [75] , there are many fewer attempts of integrating 3d structural data with the hmi networks [18] . traditional network representation has low resolution, missing important details. however, structural interaction networks provide a higher resolution with mechanistic insights. they can decipher and resolve those that are not obvious in binary interaction networks [36] . the potential of structural networks in unraveling signaling pathways was demonstrated earlier [39, 40, 76, 77] . they are essential to fully grasp the mechanisms exerted by pathogens to divert the host cell signaling and attenuate immune responses. fig 4 displays an example of a structural hmi network, showing how host ppis can be affected by hmis. structures can detail which endogenous host ppis are disrupted by the hmis, possible consequences of mutations on either host proteins or pathogenic proteins, and whether variants of a virulence factor in different strains of the same species have distinct hmis. for instance, the pro-35 residue on hiv accessory protein vpr is at the interface with human cypa and its mutation to alanine abrogates the interaction [78] . the structure of the cypa-vpr complex shows that pro-35 is at the interface. if the structure of the vpr-cypa complex was unknown, it would have been difficult to understand why, or how, this mutation disrupts the ppi. previously built structural hmi networks demonstrated that endogenous interfaces that are hijacked by pathogens are involved in multiple transient interactions [18, 36] . these endogenous interfaces exhibit 'date-like' features, i.e. they are involved in interactions with several endogenous proteins at different times [18, 36] . hub and bottleneck proteins at the crossroads of several host pathways were suggested to be the major targets of viral and bacterial proteins [26, 28] and interface mimics allow transient interactions with the hub [79] . this allows them to interfere with multiple endogenous ppis. it was proposed that microorganisms causing acute infections, which are dramatic for the host, are likely to interfere with the hubs, whereas others that lead to persistent infections tend to target non-hubs [80] . during acute infection, pathogens replicate very quickly and are transmitted to new hosts. however, during chronic infections, they adapt to the host environment, which allows them to reside there for a long period of time. thus, how microbiota target certain proteins and pathways at the molecular level is of paramount importance. detecting the hmis, mapping them onto networks and determining their 3d structures as a complex are the major steps to construct structural hmi networks. despite the progress in experimental techniques, it is still challenging to determine structures of ppi complexes, particularly hmis. since large-scale experimental characterization of host-pathogen ppis is difficult, time consuming, and costly, experimentally verified hmi data is scarce. it is important to note that available endogenous protein structures are biased towards permanent, rather than transient interactions. if majority of the hmis are transient, this presents another hurdle since they will be under-represented in the structural space. several hmi databases have been developed, such as phisto [81] , hpidb [82] , proteopathogen [83] , patric [84] , phi-base [85] , phidias [86] , hopaci-db [87] , virhostnet [88] , virbase [89] , virusmentha [90] , hcvpro [91] , and likely some others as well. however, these databases cover only a limited number of pathogens and their interactions. given that thousands of species residing in the host, thousands of hmis are yet to be identified. computational approaches are becoming increasingly important in prioritizing putative hmis and complementing experiments. hence, construction of comprehensive metaorganism networks and increasing the coverage of the host-microbiota interactome will still mostly rely on computational models in the near future [92] . computational modeling of intra-species interactions is a well-established area; detection of inter-species interactions is relatively new. available computational tools to predict host-pathogen interactions have been recently reviewed by nourani et al. [93] . current methods mostly depend on global sequence and structure homology. sequence-based methods focus only on orthologs of host proteins. however, sequence by itself is insufficient to detect the targets of pathogenic proteins because several virulence factors do not have any sequence homologs in human. for instance, the vaca protein of helicobacter pylori, the most dominant species in gastric microbiota, has a unique sequence that does not resemble any human protein [94] . still, it alters several host pathways [95] . with sequence-based methods, it is impossible to find hmis for vaca. as noted above, global structural mimicry is much rarer than interface mimicry. hence, utilizing interface similarity, rather than global structural similarity in a computational approach would generate a more enriched set of hmi data together with atomic details [9] . several studies suggested that the available interface structures are diverse enough to cover most human ppis [96] [97] [98] [99] . therefore, success of template-based methods for prediction of human ppis is very high [34] . since exogenous interfaces mimic endogenous ones, both available endogenous and exogenous interface structures can be used as templates to detect novel hmis. thanks to the rapid increase in the number of resolved 3d structures of human-pathogen ppis in recent years [100] and advances in structural and computational biology, the performance of interface-based methods is expected to increase. both experimental and computational approaches have false-positives and false-negatives with varying rates depending on the approach. although the coverage of interface-based methods is higher, their false-positive rate is also higher. despite this, attempts to complete the host-microbiota interactome will improve our knowledge of microbiota and their roles in health and disease. advances in host-microbiota research will revolutionize the understanding of the connection between health and a broad range of diseases. building the rewired host-microbiota multiorganism interaction network, along with its structural details, is vital for figuring out the molecular mechanisms underlying host immune modulation by microbiota. topological features of such networks can reveal the selection of host targets by the microbiota. structural details are essential to fully grasp the mechanisms exerted by microbiota to subvert the host immunity. identification of the hmis will also help drug discovery and integrated superorganism networks would suggest how inhibition of an hmi can influence the whole system. here we highlighted the importance of building structural hmi networks. however, not only hmis are important; although to date data are scant, crosstalk among microorganisms is also emerging as critical. alterations in their population dynamics may lead to dysbiosis. signals from gut microbiota resulting from population shifts can affect profoundly several tissues, including the central nervous system. dysbiosis of microbiota is involved in several diseases, such as inflammatory bowel disease [101] , autoimmune diseases (e.g. multiple sclerosis) [102] , neurodegenerative diseases (e.g. parkinson's) [103] , and cancer [104, 105] . identifying bacterial effectors, or effector combinations, which are responsible for specific phenotypes, is challenging. in line with this, recently, parkinson's disease (pd) patients are found to have altered gut microbiota composition [106, 107] . transplanted microbiota from pd patients, but not from healthy controls, induce motor dysfunction and trigger pd in mice. it is not clear however whether dysbiosis triggers pd or it arises as a consequence of the disease [103] . the role of microbiota in host health and disease might be even more complex than thought: commensals once being benign can convert to disease-causing pathogens; different compositions of microbial communities trigger different phenotypes; more than one host pathway is targeted by more than one effector; the same microbial effector/antigen is sensed by several pattern recognition receptors (back-up mechanism, compensatory microbial sensing [4] ) and genetic variation in hosts results in different responses (i.e. some commensals transition to pathogen only in "susceptible" individuals). current knowledge on microbiota and their interactions with the host is still in its infancy, but given the advances that are accomplished so far and the attention this field started to attract these days, it is likely that many unknowns and questions will be uncovered soon. investigating a holobiont: microbiota perturbations and transkingdom networks cellular hijacking: a common strategy for microbial infection diet, microbiota and autoimmune diseases integration of innate immune signaling self or non-self? the multifaceted role of the microbiota in immune-mediated diseases differential expression and roles of staphylococcus aureus virulence determinants during colonization and disease from commensal to pathogen: stage-and tissue-specific gene expression of candida albicans a review on computational systems biology of pathogen-host interactions pathogen mimicry of host protein-protein interfaces modulates immunity network representations of immune system complexity anti-immunology: evasion of the host immune system by bacterial and viral pathogens manipulation of host-cell pathways by bacterial pathogens structural mimicry in bacterial virulence structural microengineers: pathogenic escherichia coli redesigns the actin cytoskeleton in host cells human papillomavirus oncoproteins: pathways to transformation the human papillomavirus 16 e6 protein binds to tumor necrosis factor (tnf) r1 and protects cells from tnf-induced apoptosis chronic helicobacter pylori infection induces an apoptosis-resistant phenotype associated with decreased expression of sars coronavirus papain-like protease inhibits the type i interferon signaling pathway through interaction with the sting-traf3-tbk1 complex the ny-1 hantavirus gn cytoplasmic tail coprecipitates traf3 and inhibits cellular interferon responses by disrupting tbk1-traf3 complex formation hantaviral proteins: structure, functions, and role in hantavirus infection competitive binding and evolvability of adaptive viral molecular mimicry. biochimica et biophysica acta the importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics network biology: understanding the cell's functional organization relating three-dimensional structures to protein networks provides evolutionary insights evidence for dynamically organized modularity in the yeast protein-protein interaction network topological properties of protein interaction networks from a structural perspective scale-free networks in cell biology lethality and centrality in protein networks herpesviral protein networks and their interaction with the human proteome the protein network of hiv budding epstein-barr virus and virus human protein interaction maps hepatitis c virus infection protein network. molecular systems biology a physical and regulatory map of host-influenza interactions reveals pathways in h1n1 infection a physical interaction network of dengue virus and human proteins global landscape of hiv-human protein complexes viral immune modulators perturb the human molecular network by common and unique strategies interpreting cancer genomes using systematic host network perturbations by tumour virus proteins the structural network of interleukin-10 and its implications in inflammation and cancer the structural network of inflammation and cancer: merits and challenges. seminars in cancer biology the host-pathogen interaction of human cyclophilin a and hiv-1 vpr requires specific n-terminal and novel c-terminal domains use of host-like peptide motifs in viral proteins is a prevalent strategy in host-virus interactions targeting of immune signalling networks by bacterial pathogens phisto: pathogen-host interaction search tool hpidb-a unified resource for host-pathogen interactions proteopathogen, a protein database for studying candida albicans-host interaction patric, the bacterial bioinformatics database and analysis resource the pathogen-host interactions database (phi-base): additions and future developments phidias: a pathogen-host interaction data integration and analysis system hopaci-db: host-pseudomonas and coxiella interaction database virhostnet 2.0: surfing on the web of virus/host molecular interactions data virbase: a resource for virus-host ncrna-associated interactions virusmentha: a new resource for virus-host protein interactions hcvpro: hepatitis c virus protein interaction database computational analysis of interactomes: current and future perspectives for bioinformatics approaches to model the host-pathogen interaction space computational approaches for prediction of pathogen-host proteinprotein interactions a tale of two toxins: helicobacter pylori caga and vaca modulate host pathways that impact disease the helicobacter pylori's protein vaca has direct effects on the regulation of cell cycle and apoptosis in gastric epithelial cells structure-based prediction of protein-protein interactions on a genome-wide scale proceedings of the national academy of sciences of the united states of america structural space of protein-protein interfaces is degenerate, close to complete, and highly connected templates are available to model nearly all complexes of structurally characterized proteins structural models for host-pathogen protein-protein interactions: assessing coverage and bias the microbiome in inflammatory bowel disease: current status and the future ahead alterations of the human gut microbiome in multiple sclerosis gut microbiota regulate motor deficits and neuroinflammation in a model of parkinson's disease commensal bacteria control cancer response to therapy by modulating the tumor microenvironment gastrointestinal cancers: influence of gut microbiota, probiotics and prebiotics colonic bacterial composition in parkinson's disease gut microbiota are related to parkinson's disease and clinical phenotype key: cord-155440-7l8tatwq authors: malinovskaya, anna; otto, philipp title: online network monitoring date: 2020-10-19 journal: nan doi: nan sha: doc_id: 155440 cord_uid: 7l8tatwq the application of network analysis has found great success in a wide variety of disciplines; however, the popularity of these approaches has revealed the difficulty in handling networks whose complexity scales rapidly. one of the main interests in network analysis is the online detection of anomalous behaviour. to overcome the curse of dimensionality, we introduce a network surveillance method bringing together network modelling and statistical process control. our approach is to apply multivariate control charts based on exponential smoothing and cumulative sums in order to monitor networks determined by temporal exponential random graph models (tergm). this allows us to account for potential temporal dependence, while simultaneously reducing the number of parameters to be monitored. the performance of the proposed charts is evaluated by calculating the average run length for both simulated and real data. to prove the appropriateness of the tergm to describe network data some measures of goodness of fit are inspected. we demonstrate the effectiveness of the proposed approach by an empirical application, monitoring daily flights in the united states to detect anomalous patterns. the digital information revolution offers a rich opportunity for scientific progress; however, the amount and variety of data available requires new analysis techniques for data mining, interpretation and application of results to deal with the growing complexity. as a consequence, these requirements have influenced the development of networks, bringing their analysis beyond the traditional sociological scope into many other disciplines, as varied as are physics, biology and statistics (cf. amaral et al. 2000; simpson et al. 2013; chen et al. 2019 ). one of the main interests in network study is the detection of anomalous behaviour. there are two types of network monitoring, differing in the treatment of nodes and links: fixed and random network surveillance (cf. leitch et al. 2019) . we concentrate on the modelling and monitoring of networks with randomly generated edges across time, describing a surveillance method of the second type. when talking about anomalies in temporal networks, the major interest is to find the point of time when a significant change happened and, if appropriate, to identify the vertices, edges or graph subsets which considerably contributed to the change (cf. akoglu et al. 2014 ). further differentiating depends on at least two factors: characteristics of the network data and available time granularity. hence, given a particular network to monitor it is worth first defining what is classified as "anomalous". to analyse the network data effectively and plausibly, it is important to account for its complex structure and the possibly high computational costs. our approach to mitigate these issues and simultaneously reflect the stochastic and dynamic nature of networks is to model them applying a temporal random graph model. we consider a general class of exponential random graph models (ergm) (cf. frank and strauss 1986; robins et al. 2007; schweinberger et al. 2020) , which was originally designed for modelling cross-sectional networks. this class includes many prominent random network configurations such as dyadic independence models and markov random graphs, enabling the ergm to be generally applicable to many types of complex networks. hanneke et al. (2010) developed a powerful dynamic extension based on ergm, namely the temporal exponential random graph model (tergm). these models contain the overall functionality of the ergm, additionally enabling time-dependent covariates. thus, our monitoring procedure for this class of models allows for many applications in different disciplines which are interested in analysing networks of medium size, such as sociology, political science, engineering, economics and psychology (cf. carrington et al. 2005; ward et al. 2011; das et al. 2013; jackson 2015; fonseca-pedrero 2018) . in the field of change detection, according to basseville et al. (1993) there are three classes of problems: online detection of a change, off-line hypotheses testing and off-line estimation of the change time. our method refers to the first class, meaning that the change point should be detected as soon as possible after it occurred. in this case, real-time monitoring of complex structures becomes necessary: for instance, if the network is observed every minute, the monitoring procedure should be faster than one minute. to perform online surveillance for real-time detection, the efficient way is to use tools from the field of statistical process control (spc). spc corresponds to an ensemble of analytical tools originally developed for industrial purposes, which is applied for achievement of process stability and variability reduction (e.g., montgomery 2012) . the leading spc tool for analysis is a control chart, which exists in various forms in terms of the number of variables, data type and different statistics being of interest. for example, the monitoring of network topology statistics applying the cumulative sum (cusum) chart and illustrating its effecpresent a comparative study of univariate and multivariate ewma for social network monitoring. an overview of further studies is provided by noorossana et al. (2018) . in this paper, we present an online monitoring procedure based on the spc concept, which enables one to detect significant changes in the network structure in real time. the foundations of this approach together with the description of the selected network model and multivariate control charts are discussed in section 2. section 3 outlines the simulation study and includes performance evaluation of the designed control charts. in section 4 we monitor daily flights in the united states and explain the detected anomalies. we conclude with a discussion of outcomes and present several directions for future research. network monitoring is a form of an online surveillance procedure to detect deviations from a so-called in-control state, i.e., the state when no unaccountable variation of the process is present. this is done by sequential hypothesis testing over time, which has a strong connection to control charts. in other words, the purpose of control charting is to identify occurrences of unusual deviation of the observed process from a prespecified target (or in-control) process, distinguishing common from special causes of variation (cf. johnson and wichern 2007) . to be precise, the aim is to test the null hypothesis h 0,t : the network observed at time point t is in its in-control state against the alternative h 1,t : the network observed at time point t deviates from its in-control state. in this paper, we concentrate on monitoring of networks, which are modelled by the tergm that is briefly described below. the network (also interchangeably called "graph") is presented by its adjacency matrix y := (y i j ) i, j=1,...,n , where n represents the total number of nodes. two vertices (or nodes) i, j are adjacent if they are connected by an edge (also called a tie or link). in this case, y i j = 1, otherwise, y i j = 0. in case of an undirected network, y is symmetric. the connections of a node with itself are mostly not applicable to the majority of the networks, therefore, we assume that y ii = 0 for all i = 1, . . . , n. formally, we define a network model as a collection {p θ (y ), y ∈ y : θ ∈ θ}, where y denotes the ensemble of possible networks, p θ is a probability distribution on y and θ is a vector of parameters, ranging over possible values in the real-valued space θ ⊆ ir p with p ∈ in (kolaczyk, 2009 ). this stochastic mechanism determines which of the n(n − 1) edges (in case of directed labelled graphs) emerge, i.e., it assigns probabilities to each of the 2 n(n−1) graphs (see cannings and penman, 2003) . the ergm functional representation is given by where y is the adjacency matrix of an observed graph with s : y → ir p being a p-dimensional statistic describing the essential properties of network based on y (cf. frank, 1991; wasserman and pattison, 1996) . there are several types of network terms, including dyadic dependent terms, for example, a statistic capturing transitivity, and dyadic independent terms, for instance, a term describing graph density (morris et al., 2008) . the parameters θ can be defined as respective coefficients of s(y ) which are of considerable interest in understanding the structural properties of a network. they reflect, on the network-level, the tendency of a graph to exhibit certain sub-structures relative to what would be expected from a model by chance, or, on the tie-level, the probability to observe a specific edge, given the rest of the graph (block et al., 2018) . the last interpretation follows from the representation of the problem as a log-odds ratio. the normalising constant in the denominator ensures that the sum of probabilities is equal to one, meaning it includes all possible network configurations in dynamic network modelling, a random sequence of y t for t = 1, 2, . . . with y t ∈ y defines a stochastic process for all t. it is possible that the dimensions of y t differ across the time stamps. to conduct surveillance over y t , we propose to consider only the dynamically estimated parameters of a random graph model in order to reduce computational complexity and to allow for real-time monitoring. in most of the cases, the dynamic network models serve as an extension of well-known static models. similarly, the discrete temporal expansion of the ergm is known as tergm (cf. hanneke et al., 2010) and can be seen as further advancement of a family of network models proposed by robins and pattison (2001) . the tergm defines the probability of a network at the discrete time point t both as a function of counted subgraphs in t and by including the network terms based on the previous graph observations until the particular time point t − v, that is where v represents the maximum temporal lag, capturing the networks which are incorporated into the θ estimation at t, hence, defining the complete temporal dependence of y t . we assume the markov structure between the observations, meaning (y t ⊥ ⊥ {y 1 , . . . , y t−2 }|y t−1 ) (hanneke et al., 2010) . in this case, the network statistics s(·) include "memory terms" such as dyadic stability or reciprocity (leifeld et al., 2018) . the creation of a meaningful configuration of sufficient network statistics s(y ) replicates its ability to represent and reproduce the observed network close to the reality. its dimension can differ over the time, however, we assume that in each time stamp t we have the same network statistics s(·). in general, the selection of terms extensively depends on the field and context, although the statistical modelling standards such as avoidance of linear dependencies among the terms should be also considered (morris, handcock, and hunter, 2008 ). an improper selection can often lead to a degenerate model, i.e., when the algorithm does not converge consistently (cf. handcock, 2003; schweinberger, 2011) . in this case, as well as fine-tuning the configuration of statistics, one can modify some settings which design the estimation procedure of the model parameter, for example, the run time, the sample size or the step length (morris et al., 2008) . currently, there are two widely used approaches: chain monte carlo (mcmc) ml estimation (leifeld et al., 2018) . another possibility would be to add some robust statistics such as geometrically-weighted edgewise shared partnerships (gwesp) (snijders et al., 2006) . however, the tergm is less prone to the degeneracy issues as ascertained by leifeld and cranmer (2019) and hanneke et al. (2010) . regarding the selection of network terms, we assume that most of the network surveillance studies can reliably estimate beforehand the type of anomalies which are possible to occur. this assumption guides the choice of terms in the models throughout the paper. let p be the number of network statistics, which describe the in-control state and can reflect the deviations in the out-of-control state. thus, there are p variablesθ t = (θ 1t , . . . ,θ pt ) , namely the estimates of the network parameters θ at time point t. that is, we apply a moving window approach, where the coefficients are estimated at each time point t using the current and past z observed networks. moreover, let f θ 0 ,σ be the target distribution of these estimates with θ 0 = e 0 (θ 1 , . . . ,θ p ) being the expected value and σ the respective p × p variance-covariance matrix (montgomery, 2012) . we also assume that the temporal dependence is fully captured by the past z observed networks. thus, where τ denotes a change point to be detected and θ = θ 0 . if τ = ∞ the network is set to be incontrol, whereas it is out of control in the case of τ ≤ t < ∞. furthermore, we assume that the estimation precision of the parameters does not change across t, i.e., σ is constant for the in-control and out-of-control state. hence, the monitoring procedure is based on the expected values ofθ t . in fact, we can specify the above mentioned hypothesis as follows typically, a multivariate control chart consists of the control statistic depending on one or more characteristic quantities, plotted in time order, and a horizontal line, called the upper control limit (ucl) that indicates the amount of acceptable variation. a hypothesis h 0 is rejected if the control statistic is equal to or exceeds the value of the ucl. hence, to perform monitoring a suitable control statistic and ucl are needed. subsequently, we discuss several control statistics and present a method to determine the respective ucls. the strength of the multivariate control chart over the univariate control chart is the ability to monitor several interrelated process variables. it implies that the corresponding test statistic should take into account the correlations of the data, be dimensionless and scale-invariant, as the process variables can considerably differ from each other. the squared mahalanobis distance, which represents the general form of the control statistic, fulfils these criteria and is defined as being the part of the respective "data depth" expression -mahalanobis depth that measures a deviation from an in-control distribution (cf. liu, 1995) . hence, d (1) t maps the p-dimensional characteristic quantityθ t to an one-dimensional measure. it is important to note that the characteristic quantity at time point t is usually the mean of several samples at t, but in our case, we only observe one network at each instant of time. thus, the characteristic quantityθ t is the value of the obtained estimates and not the average of several samples. firstly, multivariate cusum (mcusum) charts (cf. woodall and ncube, 1985; joseph et al. (1990) ; ngai and zhang, 2001 ) may be used for network monitoring. one of the widely used version was proposed by crosier (1988) and is defined as follows where given that r 0 = 0 and k > 0. the respective chart statistic is and it signals if d (2) t is greater than or equals the ucl. certainly, the values k and ucl considerably influence the performance of the chart. the parameter k, also known as reference value or allowance, reflects the variation tolerance, taking into consideration δ -the deviation from the mean measured in the standard deviation units we aim to detect. according to page (1954) and crosier (1988) , the chart is approximately optimal if k = δ /2. secondly, we consider multivariate charts based on exponential smoothing (ewma). lowry et al. (1992) proposed a multivariate extension of the ewma control chart (mewma), which is defined as follows with the 0 < λ ≤ 1 and l 0 = 0 (cf. montgomery, 2012) . the corresponding chart statistic is where the covariance matrix is defined as together with the mcusum, the mewma is an advisable approach for detecting relatively small but persistent changes. however, the detection of large shifts is also possible by setting k or λ high. for instance, in case of the mewma with λ = 1, the chart statistic coincides with d (1) t . thus, it is equivalent to the hotelling's t 2 control procedure, which is suitable for detection of substantial deviations. it is worth to mention that the discussed methods are directionally invariant, therefore, the investigation of the data at the signal time point is necessary if the change direction is of particular interest. is equal to or exceeds the ucl, it means that the charts signal a change. to determine the ucls, one typically assumes that the chart has a predefined (low) probability of false alarms, i.e., signals when the process is in control, or a prescribed in-control average run length arl 0 , i.e., the number of expected time steps until the first signal. to compute the ucls corresponding to arl 0 , a prevalent number of multivariate control charts require a normally distributed target process (cf. johnson and wichern, 2007; porzio and ragozini, 2008; montgomery, 2012) . in our case, this assumption would need to be valid for the estimates of the network model parameters. however, while there are some studies on the distributions of particular network statistics (cf. yan and xu, 2013; yan et al., 2016; sambale and sinulis, 2018) , only a few results are obtained about the distribution of the parameter estimates. primarily, the difficulties to determine the distribution is that the assumption of i.i.d. (independent and identically distributed) data is violated in the ergm case. in addition, the parameters depend on the choice of the model terms and network size (he and zheng, 2015) . kolaczyk and krivitsky (2015) proved asymptotic normality for the ml estimates in a simplified context of the ergm, pointing out the necessity to establish a deeper understanding of the distributional properties of parameter estimates. thus, we do not rely on any distributional assumption, but determine the ucls via monte-carlo simulations in section 3.2. to verify the applicability and effectiveness of the discussed approach, we design a simulation study followed by the surveillance of real-world data with the goal to obtain some insights into its temporal development. in practice, the in-control parameters θ 0 and σ are usually unknown and therefore have to be estimated. thus, one subdivides the sequence of networks into phase i and phase ii. in phase i, the process must coincide with the in-control state. thus, the true in-control parameters θ 0 and σ can be estimated by the sample mean vectorθ and the sample covariance matrix s of the estimated parametersθ t in phase i. using these estimates, the ucl is determined via simulations of the in-control networks, as we will show in the following part. it is important that the phase i replicates the natural behaviour of a network, so that if the network constantly grows, then it is vital to consider this aspect in phase i. similarly, if the type of network is prone to stay unchangeable in terms of additive connections or topological structure, this fact should be captured in phase i for reliable estimation and later network surveillance. after the necessary estimators of θ 0 , σ and the ucl are obtained, the calibrated control chart is applied to the actual data in phase ii. in specific cases of the constantly growing/topologically changing networks, we recommend to recalibrate the control chart after the length of arl 0 to guarantee a trustworthy detection of the outliers. to be able to computeθ and s, we need a certain number of in-control networks. for this purpose, we generate 2300 temporal graph sequences of desired length t < τ, where each graph consists of n = 100 nodes. the parameter τ defines the time stamp when an anomalous change is implemented. the simulation of synthetic networks is based on the markov chain principle: in the beginning, a network which is called the "base network" is simulated by applying an ergm with predefined network terms, so that it is possible to control the "network creation" indirectly. in our case, we select three network statistics, namely an edge term, a triangle term and a parameter that defines asymmetric dyads. subsequently, a fraction φ of elements of the adjacency matrix are randomly selected and set where m i j,0 denotes the probability of a transition from i to j in the in-control state. next, we need to guarantee that the generated samples of networks behave according to the requirements of phase i, i.e., capturing only the usual variation of the target process. for this purpose, we can exploit markov chain properties and calculate its steady state equilibrium vector π, as it follows that the expected number of edges and non-edges is given by π. using eigenvector decomposition, we find the steady state to be π = (0.8, 0.2) . consequently, the expected number of edges in the graph in its steady state is 1980. there are several possibilities to guarantee a generation of appropriate networks. in our case, we simulate 400 networks in a burn-in period, such that the in-control state of phase i starts at t = 401. nevertheless, the network density is only one of the aspects to define the in-control process, as the temporal development and the topology are also involved in the network creation. each network in time point y t is simulated from the network y t−1 by repeating the steps described above. after the generation stage, the coefficients of the network statistics and of an additional term which describes the stability of both edges and non-edges over time with v = 1 are estimated by applying a tergm with a certain window size z. the chosen estimation method is the bootstrap mple which is appropriate to handle a relatively large number of nodes and time points (leifeld et al., 2018) . eventually, we calibrate different control charts by computingθ, s, and the respective ucl via the bisection method. for two window sizes z = {7, 14}, table 1 in the next step, we analyse the performance of the proposed charts in terms of their detection speed. for this reason, we generate samples from phase ii, where t ≥ τ. the focus is on the detection of mean shifts, which are driven by an anomalous change in following three parameters: the vector of coefficients related to network termsθ t , the fraction of the randomly selected adjacency matrix entries φ and the transition matrix m . hence, we subdivide these scenarios into three different anomaly types which are briefly described in the chart flow presented in figure 1 . we define a type 1 anomaly as a persistent change in the values of m . that is, there is a transition matrix m 1 = m 0 when t ≥ τ. furthermore, we consider anomalies of type 2 by introducing a new value φ 1 in the generation process when t ≥ τ. anomalies of type 3 differ from the previous two as it represents a "point change" -the abnormal behaviour occurs only at a single point of time but its outcome affects further development of the network. we recreate this type of anomalies by converting a fraction ζ of asymmetric edges into mutual links. this process happens at time point τ only. afterwards, the new networks are created similar to phase i by applying m 0 and φ 0 up until the anomaly is detected. all cases of different magnitude are summarised in table 3 . as a performance measure we calculate the conditional expected delay (ced) of detection, conditional on a false signal not having been occurred before the ( should be detected (case 2.2, 2.3). again, the reference/smoothing parameter should be chosen according to the expected shift size. for changes of the proportion of mutual edges, anomalies of type 3, the charts have different behaviour. first of all, the mewma chart outperforms in all cases except 3.1 and 3.2 with z = 14. however, the hotelling's chart functions clearly worse in the first two cases having a shorter window size. thus, we would recommend choosing λ = 0.1 if the change in the network topology is relatively small as in case 3.1. in the opposite case of a larger change, λ could be chosen higher depending on the expected size of the shift, so that the control statistic also incorporates previous values. the disadvantage of both approaches is that small and persistent changes are not detected quickly when the parameters k or λ are not optimally chosen. for example, in figure 2 , we can notice that the ced slightly exceeds the arl 0 reflecting the poor performance. however, a careful selection of the parameters and the window size can overcome this problem. to summarise, the effectiveness of the presented charts to detect structural changes depends significantly on the accurate estimation of the anomaly size one aims to detect. thus, to ensure that no anomalies were missed, it can be effective to apply paired up charts and benefit from the strengths of each of them to detect varying types and sizes of anomalies, if the information on the possible change is not available or not reliable. to demonstrate applicability of the described method, the daily flight data of the united states through territories which allow travelling. that means, instead of having a direct journey from one geographical point to another, currently the route passes through several locations, which can be interpreted as nodes. thus, the topology of the graph has changed: instead of directed mutual links, the number of intransitive triads and asymmetric links start to increase significantly. we can incorporate both terms, together with the edge term and a memory term (v = 1), and expect the estimates of the respective coefficients belonging to the first two statistics to be close to zero or strongly negative in the in-control case. initially, we need to decide which data are suitable to define observations coming from phase i, the estimates θ t of the tergm described by a series of boxplots in figure 6 , we can observe extreme changes in the values. before proceeding with the analysis, it is important to evaluate whether a tergm fits the data well . for each of the years, we randomly selected one period of the length z and simulated 500 networks based on the parameter estimates from each of the corresponding networks. to select appropriate control charts, we need to take into consideration specifications of the flight network data. firstly, it is common to have 3-4 travel peaks per year around holidays, which are not explicitly modelled, so that we can detect these changes as verifiable anomalous patterns. it is worth noting that one could account for such seasonality by including nodal or edge covariates. secondly, as we aim to detect considerable deviations from the in-control state, we are more interested the horizontal red line corresponds to the upper control limit and the red points to the occurred signals. in sequences of signals. thus, we have chosen k = 1.5 for mcusum and λ = 0.9 for the mewma chart. the target arl 0 is set to 100 days, therefore, we could expect roughly 3.65 in-control signals per year by construction of the charts. to identify smaller and more specific changes in the daily flight data of the us, one could also integrate nodal and edge covariates which would refer to further aspects of the network. alternatively, control charts with smaller k and λ can be applied. statistical methods can be remarkably powerful for the surveillance of networks. however, due to the complex structure and possibly large size of the adjacency matrix, traditional tools for multivariate process control cannot directly be applied, but the network's complexity must be reduced first. for instance, this can be done by statistical modelling of the network. the choice of the model is crucial as it decides constraints and simplifications of the network which later influence the types of changes we are able to detect. in this paper, we show how multivariate control charts can be used to detect changes in tergm networks. the proposed methods can be applied in real time. this general approach is applicable for various types of networks in terms of the edge direction and topology, as well as allows for the integration of nodal and edge covariates. additionally, we make no assumptions on distribution and account for temporal dependence. the performance of our procedure is evaluated for different anomalous scenarios by comparing the ced of the calibrated control charts. according to the classification and explanation of anomalies provided by ranshous et al. (2015) , the surveillance method presented in this paper is applicable for event and point change detection in temporal networks. the difference between these problems lies in the duration of the abnormal behaviour: while change points indicate a time point when the anomaly is persistent until the next change point, events indicate short-term incidents, after that the network returns to its natural state. eventually, we illustrated the applicability of our approach by monitoring daily flights in the united states. both control charts were able to detect the beginning of the lock-down period due to the covid-19 pandemic. the mewma chart signalled a change just two days after a level 4 "no travel" warning was issued. despite the benefits of the tergm, such as incorporation of the temporal dimension and representation of the network in terms of its sufficient statistics, there are several considerable drawbacks. other than the difficulty to determine a suitable combination of the network terms, the model is not suitable for networks of large size (block et al., 2018) . furthermore, the temporal dependency statistics in the tergm depend on the selected temporal lag and the size of the time window over which the data is modelled (leifeld and cranmer, 2019) . thus, the accurate modelling of the network strongly relies on the analyst's knowledge about its nature. a helpful extension of the approach would be the implementation of the separable temporal exponential random graph model (stergm) that subdivides the network changes into two distinct streams (cf. krivitsky and handcock, 2014; fritz et al., 2020) . in this case, it could be possible to monitor the dissolution and formation of links separately, so that the interpretation of changes in the network would become clearer. regarding the multivariate control charts, there are also some aspects to consider. referring to montgomery (2012) , the multivariate control charts perform well if the number of process variables is not too large, usually up to 10. also, a possible extension of the procedure is to design a monitoring process when the values for σ can vary between the in-control and out-of-control states. whether this factor would beneficially enrich the surveillance remains open for future research. in our case, we did not rely on any distributional assumptions of the parameters, but we used simulation methods to calibrate the charts. hence, further development of adaptive control charts with different characteristics is interesting as they could remarkably improve the performance of the anomaly detection (cf. sparks and wilson, 2019). graph based anomaly detection and description: a survey classes of small-world networks detection of abrupt changes: theory and application change we can believe in: comparing longitudinal network models on consistency, interpretability and predictive power models of random graphs and their applications models and methods in social network analysis tail event driven networks of sifis multivariate generalizations of cumulative sum quality-control schemes the topological structure of the odisha power grid: a complex network analysis a statistical approach to social network monitoring network analysis in psychology statistical analysis of change in networks markov graphs tempus volat, hora fugit: a survey of tie-oriented dynamic network models in discrete and continuous time assessing degeneracy in statistical models of social networks discrete temporal models of social networks glmle: graph-limit enabled fast computation for fitting exponential random graph models to large social networks performance evaluation of ewma and cusum control charts to detect anomalies in social networks using average and standard deviation of degree measures goodness of fit of social network models the past and future of network analysis in economics applied multivariate statistical analysis, 6th edn comparisons of multivariate cusum charts on assessing the performance of sequential procedures for detecting a change statistical analysis of network data on the question of effective sample size in network modeling: an asymptotic inquiry. statistical science: a a separable model for dynamic networks a theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actor-oriented model temporal exponential random graph models with btergm: estimation and bootstrap confidence intervals toward epidemic thresholds on temporal networks: a review and open questions control charts for multivariate processes analyzing dynamic change in social network based on distribution-free multivariate process control method a multivariate exponentially weighted moving average control chart detecting change in longitudinal social networks statistical quality control specification of exponential-family random graph models: terms and computational aspects multivariate cumulative sum control charts based on projection pursuit an overview of dynamic anomaly detection in social networks via control charts continuous inspection schemes multivariate control charts from a data mining perspective. recent advances in data mining of enterprise data anomaly detection in dynamic networks: a survey random graph models for temporal processes in social networks an introduction to exponential random graph (p*) models for social networks monitoring of social network and change detection by applying statistical process: ergm change point detection in social networks using a multivariate exponentially weighted moving average chart logarithmic sobolev inequalities for finite spin systems and applications instability, sensitivity, and degeneracy of discrete exponential families exponential-family models of random graphs: inference in finite-, super-, and infinite population scenarios analyzing complex functional brain networks: fusing statistics and network science to understand the brain new specifications for exponential random graph models monitoring communication outbreaks among an unknown team of actors in dynamic networks network analysis and political science logit models and logistic regressions for social networks: i. an introduction to markov graphs and p* modeling and detecting change in temporal networks via the degree corrected stochastic block model multivariate cusum quality-control procedures a central limit theorem in the β -model for undirected random graphs with a diverging number of vertices asymptotics in directed exponential random graph models with an increasing bi-degree sequence key: cord-253711-a0prku2k authors: mao, liang; yang, yan title: coupling infectious diseases, human preventive behavior, and networks – a conceptual framework for epidemic modeling date: 2011-11-26 journal: soc sci med doi: 10.1016/j.socscimed.2011.10.012 sha: doc_id: 253711 cord_uid: a0prku2k human-disease interactions involve the transmission of infectious diseases among individuals and the practice of preventive behavior by individuals. both infectious diseases and preventive behavior diffuse simultaneously through human networks and interact with one another, but few existing models have coupled them together. this article proposes a conceptual framework to fill this knowledge gap and illustrates the model establishment. the conceptual model consists of two networks and two diffusion processes. the two networks include: an infection network that transmits diseases and a communication network that channels inter-personal influence regarding preventive behavior. both networks are composed of same individuals but different types of interactions. this article further introduces modeling approaches to formulize such a framework, including the individual-based modeling approach, network theory, disease transmission models and behavioral models. an illustrative model was implemented to simulate a coupled-diffusion process during an influenza epidemic. the simulation outcomes suggest that the transmission probability of a disease and the structure of infection network have profound effects on the dynamics of coupled-diffusion. the results imply that current models may underestimate disease transmissibility parameters, because human preventive behavior has not been considered. this issue calls for a new interdisciplinary study that incorporates theories from epidemiology, social science, behavioral science, and health psychology. despite outstanding advance in medical sciences, infectious diseases remain a major cause of death in the world, claiming millions of lives every year (who, 2002) . particularly in the past decade, emerging infectious diseases have obtained remarkable attention due to worldwide pandemics of severe acute respiratory syndrome (sars), bird flu and new h1n1 flu. although vaccination is a principal strategy to protect individuals from infection, new vaccines often need a long time to develop, test, and manufacture (stohr & esveld, 2004) . before sufficient vaccines are available, the best protection for individuals is to adopt preventive behavior, such as wearing facemasks, washing hands frequently, taking pharmaceutical drugs, and avoiding contact with sick people, etc. (centers for disease control and prevention, 2008) . it has been widely recognized that both infectious diseases and human behaviors can diffuse through human networks (keeling & eames, 2005; valente, 1996) . infectious diseases often spread through direct or indirect human contacts, which form infection networks. for example, influenza spreads through droplet/physical contacts among individuals, and malaria transmits via mosquitoes between human hosts. human behavior also propagates through inter-personal influence that fashions communication networks. this is commonly known as the 'social learning' or 'social contagion' effect in behavioral science, i.e., people can learn by observing behaviors of others and the outcomes of those behaviors (hill, rand, nowak, christakis, & bergstrom, 2010; rosenstock, strecher, & becker, 1988 ). in the current literature, models of disease transmission and behavioral diffusion have been developed separately for decades, both based on human networks (deffuant, huet, & amblard, 2005; keeling & eames, 2005; valente, 1996; watts & strogatz, 1998) . few efforts, however, have been devoted to integrating infectious diseases and human behaviors together. in reality, when a disease breaks out in a population, it is natural that individuals may voluntarily adopt some preventive behavior to respond, which in turn limits the spread of disease. failing to consider these two interactive processes, current epidemic models may under-represent human-disease interactions and bias policy making in public health. this article aims to propose a conceptual framework that integrates infectious diseases, human preventive behavior, and networks together. the focus of this article is on issues that arise in establishing a conceptual framework, including basic principles, assumptions, and approaches for model formulization. the following section (section 2) describes the conceptual framework and basic assumptions, which abstract essential aspects of a disease epidemic. the third section discusses approaches to formulize the model framework into a design. the fourth illustrates a computing model upon various human network structures and compares the simulation results. the last section concludes the article with implications. the conceptual model consists of two networks and two diffusion processes (fig. 1) . the two networks include an infection network that transmits disease agents (dark dash lines), and a communication network that channels inter-personal influence regarding preventive behavior (gray dash lines). both networks are composed of same individuals but different types of interactions. these two networks could be non-overlapping, partially or completely overlapping with one another. the two diffusion processes refer to the diffusion of infectious diseases (dark arrows) and that of preventive behavior (gray arrows) through the respective network. as illustrated in fig. 1 , if individual #1 is initially infected, the disease can be transmitted to individual #2 and #3, and then to individual #4, following the routes of infection network. meanwhile, individual #2 may perceive the risk of being infected from individual #1, and then voluntarily adopt preventive behavior for protection, known as the effects of 'perceived risks' (becker, 1976) . further, the preventive behavior of individual #2 may be perceived as a 'social standard' by individual #4 and motivate him/her toward adoption, i.e., the 'social contagion'. in such a manner, the preventive behavior diffuses on the communication network through inter-personal influence. during an epidemic, these two diffusion processes take place simultaneously and interact in opposite directions. the diffusion of diseases motivates individuals to adopt preventive behavior, which, in turn, limits the diffusion of diseases. this two-network two diffusion framework is dubbed as a 'coupled diffusion' in the subsequent discussion. the conceptual framework entails five assumptions. first, individuals differ in their characteristics and behaviors, such as their infection status, adoption status, and individualized interactions. second, both infection and communication networks are formed by interactions among individuals. third, the development of infectious diseases follows disease natural history, such as the incubation, latent, and infectious periods. fourth, individuals voluntarily adopt preventive behavior, dependent on their own personality, experiences, and inter-personal influence from family members, colleagues, as well as friends (glanz, rimer, & lewis, 2002) . fifth and lastly, the infection status of surrounding people or their behavior may motivate individuals to adopt preventive behavior, which then reduces the likelihood of infection. of the five assumptions, the first two provide networks as a basis for modeling. the third and fourth assumptions are relevant to the two diffusion processes, respectively. the last assumption represents the interactions between the two processes. corresponding to the five assumptions, this article introduces a number of approaches to represent individuals, networks, infectious diseases, and preventive behavior, as four model components, and depicts the relationships between the four. the first model assumption requires a representation of discrete individuals, their unique characteristics and behaviors. this requirement can be well addressed by an individual-based modeling approach. in the last decade, this modeling approach has gained momentum in the research community of both epidemiology and behavioral science (judson, 1994; koopman & lynch, 1999) . specifically, the individual-based approach views a population as discrete individuals, i.e., every individual is a basic modeling unit and has a number of characteristics and behaviors. the characteristics indicate states of individuals, e.g., the infection status, adoption status, and the number of contacts, while the behaviors change these states, e.g., receiving infection and adopting preventive behavior. by simulating at an individual level, this approach allows to understand how the population characteristics, such as the total number of infections and adopters, emerge from collective behaviors of individuals (grimm & railsback, 2005) . from an implementation perspective, the characteristics and behaviors of individuals can be easily accommodated by object-oriented languages, a mainstream paradigm of programming technologies. various tools are also available to facilitate the design and implementation of individual-based approach, such as the netlog and repast (robertson, 2005) . with regard to the second assumption, both the infection and communication networks can be abstracted as a finite number of nodes and links. nodes represent individuals and links represent interactions among individuals. the network structure is compatible with the aforementioned individual-based approach, in that the individual nodes directly correspond to the basic modeling units, while links can be treated as a characteristic of individuals. interactions between individuals (through links) can be represented as behaviors of individuals. to be realistic in modeling, both networks can be generated to fit observed characteristics and structures of real-world networks. important characteristics of networks include: the number of links attached to a node (the node degree), the minimum number of links between any pair of nodes (the path length), the ratio between the existing number of links and the maximum possible number of links among certain nodes (the level of clustering), and so on (scott, 2000) . particularly for human networks of social contacts, empirical studies showed that the average node degree often varies from 10 to 20, dependent on occupation, race, geography, etc (edmunds, kafatos, wallinga, & mossong, 2006; fu, 2005) . the average path length was estimated to be around 6, popularly known as the 'six degrees of separation' (milgram, 1967) . the level of clustering has typical values in the range of 0.1e0.5 (girvan & newman, 2002) . besides these characteristics, studies on human networks have also disclosed two generic structures: "small-world" and "scale-free" structures. the "small-world" structure is named after the 'small-world' phenomena, arguing that people are all connected by short chains of acquaintances (travers & milgram, 1969) . theoretically, the small-world structure is a transition state between regular networks and random networks (watts & strogatz, 1998) . the regular networks represent one extreme that all nodes are linked to their nearest neighbors, resulting in highly clustered networks. the random networks are the other extreme that all nodes are randomly linked with each other regardless of their closeness, resulting in short path lengths. a typical small-world structure has characteristics from both extremes, i.e., most nodes are directly linked to others nearby (highly clustered), but can be indirectly connected to any distant node through a few links (short path lengths). the "scale-free" structure has also been commonly observed in social, biological, disease, and computer networks, etc. (cohen, erez, ben-avraham, & havlin, 2001; jeong, tombor, albert, oltvai, & barabási, 2000; liljeros, edling, amaral, stanley, & aaberg, 2001) . it depicts a network with highly heterogeneous degrees of nodes, whose distribution follows a power-law decay function, p < k > wk àg (k denotes the node degree and empirically 2 < g < 3). in other words, a few individuals have a significantly large number of links, while the rest of individuals only have a few (albert, jeong, & barabasi, 2000) . all of these observed characteristics and structures can be used to calibrate the modeled networks, which then serve as a reliable basis to simulate the coupled-diffusion process. in epidemiology, the development of infectious diseases has been characterized by a series of infection statuses, events, and periods, often referred to as the natural history of diseases (gordis, 2000) . the progress of an infectious disease often starts with a susceptible individual. after having contact with an infectious individual, this susceptible individual may receive disease agents and develop infection based on a transmission probability. the receipt of infection triggers a latent period, during which the disease agents develop internally in the body and are not emitted. the end of the latent period initiates an infectious period, in which this individual is able to infect other susceptible contacts and may manifest disease symptoms. after the infectious period, this individual either recovers or dies from the disease. among these disease characteristics, the transmission probability is critical for bridging infectious diseases to the other model components: individuals, networks, and preventive behavior. this probability controls the chance that the disease agents can be transmitted between individuals through network links. meanwhile, the reduction of transmission probability reflects the efficacy of preventive behavior. the individual-based modeling approach enables the representation of disease progress for each individual. the infection statuses, periods, and transmission probability per contact can be associated with individuals as their characteristics, while infection events (e.g., receipt of infection and emission of agents) can be modeled as behaviors of individuals. each individual has one of four infection statuses at a time point, either susceptible, latent, infectious, or recovered (kermack & mckendrick, 1927) . the infection status changes when infection events are triggered by behaviors of this individual or surrounding individuals. the simulation of disease transmission often starts with an introduction of a few infectious individuals (infectious seeds) into a susceptible population. then, the first generation of infections can be identified by searching susceptible contacts of these seeds. stochastic methods, such as the monte carlo method, could be used to determine who will be infected or not. subsequently, the first generation of infections may further infect their contacts, and over time leads to a cascade diffusion of disease over the network. to parameterize the simulation, the transmission probability of a disease, the lengths of latent period and infectious period can be derived from the established literature or from observational disease records. like other human behaviors, the adoption of preventive behavior depends on the individual's own characteristics (e.g., knowledge, experience, and personal traits) and inter-personal influence from surrounding individuals (e.g., family supports and role model effects) (glanz et al., 2002) . because individuals vary in their willingness to adopt, human behaviors often diffuse from a few early adopters to the early majority, and then over time throughout the social networks (rogers, 1995) . a number of individual-based models have been developed by sociologists and geographers to represent such behavioral diffusion processes, e.g., the mean-information-field (mif) model (hägerstrand, 1967) , the threshold model (granovetter, 1978) , the relative agreement model (deffuant et al., 2005) , etc. the mif model populates individuals on a regular network (or a grid), and assumes that a behavior diffuses through the 'word-of-mouth' communication between an adopter and his/ her neighbors. the mif is a moving window that defines the size of neighborhood and the likelihood of human communications to every adopter. the simulation centers the mif on every adopter and uses the monte carlo method to identify a new generation of adopters (hägerstrand, 1967) . the threshold model assumes that individuals observe their surroundings and adopt a behavior based on a threshold effect (granovetter, 1978; valente, 1996) . the threshold is the proportion of adopters in an individual's social contacts necessary to convince this individual to adopt. the behavioral diffusion begins with a small number of adopters, and spreads from the low-threshold population to the high-threshold population. the recently proposed relative agreement model assumes that every individual holds an initial attitude, which is a value range specified by a mean value, maximum and minimum. based on the value ranges, individuals' attitudes are categorized as positive, neutral, and negative. individuals communicate through a social network, and influence their attitudes (value ranges) reciprocally according to mathematical rules of relative agreement. if individuals can hold positive attitudes for a certain time period, they will decide to adopt a behavior (deffuant et al., 2005) . due to the individual-based nature of all these models, they can be easily incorporated under the proposed conceptual framework. to further discuss the individual-based design of behavioral models, this research chose the threshold model for illustrations. in terms of complexity, the threshold model lies midway between the mif model and the relative agreement model, and its parameters can be feasibly estimated through social surveys. the mif model has been criticized for its simplicity in that it assumes an immediate adoption after a communication and oversimplifies the decision process of individuals (shannon, bashshur, & metzner, 1971) . by contrast, the relative agreement model is too sophisticated: many parameters are difficult to estimate, for example, the ranges of individual attitudes. the threshold model can be formulized as follows so as to become an integral part of the coupled-diffusion framework. first, individuals are assumed to spontaneously evaluate the proportion of adopters among their contacts, and perceive the pressure of adoption. once the perceived pressure reaches a threshold (hereinafter called the threshold of adoption pressure), an individual will decide to adopt preventive behavior. second, in order to relate the preventive behavior to infectious diseases, individuals also evaluate the proportion of infected individuals (with disease symptoms) among their contacts, and perceive the risks of infection. once the perceived risk reaches another threshold (hereinafter called the threshold of infection risk), an individual will also adopt preventive behavior. these two threshold effects can be further formulized as three characteristics and two behaviors of individuals. the three characteristics include an adoption status (adopter or non-adopter) and two individualized thresholds toward adoption. the two behaviors represent the individual's evaluation of adoption pressure and infection risk from surrounding contacts, which in turn determines their adoption status. the individualized thresholds toward adoption reflect personal characteristics of individuals, while the behaviors of evaluation represent the inter-personal influence between individuals. to build a working model, the individualized thresholds toward adoption can be best estimated by health behavior surveys as illustrated below. based on the discussion above, the conceptual framework ( fig. 1) can be transformed into a formative design with four model components and their relationships (fig. 2) . individuals are building blocks of the proposed model, and their interactions compose networks as a core of the model. through the infection network, individuals may receive infection from others and have their infection status changed, propelling the diffusion of diseases. meanwhile, individuals may perceive risks and pressure from the communication network, and gradually adopt preventive behavior, resulting in the behavioral diffusion. the adoption of preventive behavior reduces the disease transmission probability, thus controlling and preventing the disease transmission. in this manner, the diffusion of diseases and preventive behavior in a population are coupled together. to illustrate the proposed coupled-diffusion model, an influenza epidemic was simulated in a hypothetic population of 5000 individuals (n ¼ 5000), each with characteristics and behaviors as described in fig. 2 . influenza was chosen because it is common and readily transmissible between individuals. the simulation simply assumes that the population is closed, i.e., no births, deaths, or migrations. with regard to the network component, the average number of links per individuals was set to 12, reasonably assuming that an individual on average has contact with 2 family members and 10 colleagues. for the purpose of sensitivity analysis, the illustrative model allowed the disease and communication networks to take either a small-world (sw) structure or a scale-free (sf) structure. the generation of sw structures started with a regular network where all individuals were linked to their nearest neighbors. then, each individual's existing links were rewired with a probability to randomly selected individuals (watts & strogatz, 1998) . the rewiring probability p ranged from 0 to 1, and governed the clustering level and average path lengths of resultant networks (fig. 3a) . the sf structures were created by a preferential attachment algorithm, which linked each new individual preferentially to those who already have a large number of contacts (pastor-satorras & vespignani, 2001) . this algorithm produces a power-law degree distribution, p < k > wk àg (k is the node degree), with various exponents g (fig. 3b) . based on fig. 3a and b, the rewiring probabilities p were set to 0.005, 0.05, and 0.5 to typically represent the regular, small-world, and random networks, respectively (fig. 3cee) . the exponent g were set to 3, 5, and 7 to represent three scale-free networks with high, medium, and low levels of node heterogeneity (fig. 3feh) . a sensitivity analysis was performed to examine every possible pair of
, , and as a network combination (3 p-values â 3 g-values â 4 ¼ 36 combinations in total), where the first parameter indicates the structure of infection network and the second specifies the structure of communication network. to simulate the diffusion of influenza, the latent period and infectious period were specified as 1 day and 4 days, respectively, based on published estimates (heymann, 2004) . the transmission probability per contact was varied from 0.01 to 0.1 (with a 0.01 increment) to test its effects on the coupled-diffusion processes. 50% of infected individuals was assumed to manifest symptoms, following the assumption made by ferguson et al. (2006) . only these symptomatic individuals could be perceived by their surrounding individuals as infection risks. recovered individuals were deemed to be immune to further infection during the rest of the epidemic. with respect to the diffusion of preventive behavior, the use of flu antiviral drugs (e.g., tami flu and relenza) was taken as a typical example because its efficacy is more conclusive than other preventive behavior, such as hand washing and facemask wearing. for symptomatic individuals, the probability of taking antiviral drugs was set to 75% (mcisaac, levine, & goel, 1998; stoller, forster, & portugal, 1993) , and the consequent probability of infecting others was set to be reduced by 40% (longini, halloran, nizam, & yang, 2004) . susceptible individuals may also take antiviral drugs due to the perceived infection risk or adoption pressure. if they use antiviral drugs, the probability of being infected was set to be reduced by 70% (hayden, 2001) . the key to simulate the diffusion of preventive behavior was to estimate thresholds of infection risk and that of adoption pressure for individuals. a health behavior survey was conducted online for one month (march 12eapril 12, 2010) to recruit participants. voluntary participants were invited to answer two questions: 1) "suppose you have 10 close contacts, including household members, colleagues, and close friends, after how many of them get influenza would you consider using flu drugs?", and 2) "suppose you have 10 close contacts, including household members, colleagues, and close friends, after how many of them start to use flu drugs would you consider using flu drugs, too?". the first question was designed to estimate the threshold of infection risks, while the second was for the threshold of adoption pressure. the survey ended up with 262 respondents out of 273 participants (a 96% response rate), and their answers were summarized into two threshold-frequency distributions (fig. 4) . the monte carlo method was then performed to assign threshold values to the 5000 modeled individuals based on the two distributions. this survey was approved by the irb in university at buffalo. to initialize the simulation, all 5000 individuals were set to be non-adopters and susceptible to influenza. one individual was randomly chosen to be infectious on the first day. the model took a daily time step and simulated the two diffusion processes simultaneously over 200 days. the simulation results were presented as disease attack rates (total percent of symptomatic individuals in the population), and adoption rates (total percent of adopters in the population). another two characteristics were derived to indicate the speed of coupled-diffusion: the epidemic slope and the adoption slope. the former is defined as the total number of symptomatic individuals divided by the duration of an epidemic (in day). similarly, the latter is defined as the total number of adopters divided by the duration of behavioral diffusion (in day). they are called slopes because graphically they approximate the slopes of cumulative diffusion curves. a higher slope implies a faster diffusion because of more infections/adoptions (the numerator) in a shorter time period (the denominator). all simulation results were averaged by 50 model realizations to average out the randomness. simulation results were presented in two parts. first, the coupled-diffusion process under various transmission probabilities was analyzed, and compared to an influenza-only process that is fig. 3. (a) standardized network properties (average path length and clustering coefficient) as a function of rewiring probability p from 0 to 1, given n ¼ 5000; (b) the power-law degree distributions given g ¼ 3, 5 and 7, given n ¼ 5000; (cee) an illustration of generated sw networks for three p values, given n ¼ 100 for figure clarity; (feh) an illustration of sf networks for three g values, given n ¼ 100. widely seen in the literature. the influenza-only process was simulated with the same parameters in the coupled-diffusion process except that individual preventive behavior was not considered. for the ease of comparison, a typical "small-world" network (p ¼ 0.05), was chosen for both infection and communication networks, assuming the two are overlapping. the second part examined the dynamics of coupled-diffusion under various structures of infection and communication networks, i.e., the 36 pairs of network parameters , , and while fixing the influenza transmission probability to 0.05 (resultant basic reproductive number r 0 ¼ 1e1.3). fig. 5a indicates that the diffusion of influenza with and without the preventive behavior differs significantly, particularly for medium transmission probabilities (0.04e0.06). for the influenzaonly process (the black curve with triangles), the disease attack rate rises dramatically as the transmission probability exceeds 0.03, and reaches a plateau of 50% when the probability increases to 0.07. the coupled-diffusion process (the black curve with squares) produces lower attack rates, which slowly incline to the maximum of 45%. this is because individuals gradually adopt preventive behavior, thereby inhibiting disease transmission from infectious individuals to the susceptible. meanwhile, the adoption rate (the gray curve with squares) also increases with the transmission probability, and can achieve a 65% of the population as the maximum. this is not surprising because the more individuals get infected, the greater risks and pressure other individuals may perceive, motivating them to adopt preventive behavior. individuals who have not adopted eventually may have extremely high-threshold of adoption (see fig. 4 ), and thus resist adopting preventive behavior. fig. 5b displays an example of the coupled-diffusion process (transmission probability ¼ 0.05), ending up with nearly 2000 symptomatic cases and approximately 3000 adopters of flu antiviral drugs. despite differences in magnitude, the two diffusion curves exhibit a similar trend that follows the 5-phase s-shaped curve of innovation diffusion (rogers, 1995) . the 'innovation' phase occurs from the beginning to day 30, followed by the 'early acceptance' phase (day31e50), 'early majority' (day 51e70), 'late majority' (day 71e90) and 'laggards' (after day 90). this simulated similarity in temporal trend is consistent with many empirical studies regarding flu infection and flu drug usage. for example, das et al. (2005) and magruder (2003) had compared the temporal variation of both influenza incidence and over-the-counter flu drug sales in the new york city and the washington dc metropolitan area, respectively. both studies reported a high correlation between over-the-counter drug sales and cases of diagnosed influenza, and thus suggested that over-the-counter drug sales could be a possible early detector of disease outbreaks. the consistency with the observed facts, to some extent, reflects the validity of the proposed model. in addition to the transmission probability, the coupled-diffusion process is also sensitive to various combinations of network structures, i.e., 36 pairs of network parameters , , and (fig. 6) . the z axis represents either the epidemic or adoption slope, and a greater value indicates a faster diffusion process. in general, both epidemic and adoption slopes change dramatically with the structure of infection network, while they are less sensitive to the variation of communication networks. given the small-world infection network ( fig. 6aeb and e-f), the epidemic and adoption slopes increase quickly as the rewiring probability p rises from 0.005 to 0.5. when p ¼ 0.005 (a regular network), almost all individuals are linked to their nearest neighbors, and influenza transmission between two distant individuals needs to go through a large number of intermediate individuals. the slow spread of influenza induces a low perception of infection risks among individuals, thereby decelerating the dissemination of preventive fig. 6 . the sensitivity of coupled-diffusion processes to various network structures, including sw infection àsw communication as , sf infection àsf communication as sw infection àsf communication as and sf infection àsw communication as . each combination is displayed in one row from top to bottom. the sw and sf denote the network structure, while the subscripts indicate the network function. parameter p is the rewiring probability of a sw network, taking values (0.005, 0.05, 0.5), while parameter g is the exponent of a sf network, taking values (3, 5, 7). the z axis denotes epidemic slopes (the left column) and adoption slopes (the right column) as a result of a network structure. a greater z value indicates a faster diffusion process. behavior. as p increases to 0.5 (a random network), a large number of shortcuts exist in the network, and the transmission of influenza is greatly speeded by shortcuts. as a result, the diffusion of preventive behavior is also accelerated, because individuals may perceive more risks of infection and take actions quickly. likewise, given a scale-free infection network ( fig. 6ced and g-h) , both influenza and preventive behavior diffuse much faster in a highly heterogeneous network (g ¼ 3) than in a relatively homogeneous network (g ¼ 7) . this is because a highly heterogeneous network has a few super-spreaders who have numerous direct contacts. super-spreaders act as hubs directly distributing the influenza virus to a large number of susceptible individuals, thus speeding the disease diffusion. as individuals perceived more risks of infection in their surroundings, they will adopt preventive behavior faster. human networks, infectious diseases, and human preventive behavior are intrinsically inter-related, but little attention has been paid to simulating the three together. this article proposes a conceptual framework to fill this knowledge gap and offer a more comprehensive representation of the disease system. this twonetwork two diffusion framework is composed of four components, including individuals, networks, infectious diseases, and preventive behavior of individuals. the individual-based modeling approach can be employed to represent discrete individuals, while network structures support the formulization of individual interactions, including infection and communication. disease transmission models and behavioral models can be embedded into the network structures, and simulate disease infection and adoptive behavior, respectively. the collective changes in individuals' infection and adoption status represent the coupled-diffusion process at the population level. compared to the widely used influenza-only models, the proposed model produces a lower percent of infection, because preventive behavior protects certain individuals from being infected. sensitivity analysis identifies that the structure of infection network is a dominant factor in the coupled-diffusion, while the variation of communication network produces fewer effects. this research implies that current predictions about disease impacts might be under-estimating the transmissibility of the disease, e.g., the transmission probability per contact. modelers fit to observed data in which populations are presumably performing preventive behavior, while the models they create do not account for the preventive behavior. when they match their modeled infection levels to those in these populations, the disease transmissibility needs to be lower than its true value so as to compensate for the effects of preventive behavior. this issue has been mentioned in a number of recent research, such as ferguson et al. (2006) , but the literature contains few in-depth studies. this article moves the issue towards its solution, and stresses the importance of understanding human preventive behavior before policy making. the study raises an additional research question concerning social-distancing interventions for disease control, such as the household quarantine and workplace/school closure. admittedly, these interventions decompose the infection network for disease transmission, but they may also break down the communication network and limit the propagation of preventive behavior. the costs and benefits of these interventions remain unclear and a comprehensive evaluation is needed. the proposed framework also suggests several directions for future research. first, although the illustrative model is based on a hypothetical population, the representation principles outlined in this article can be applied to a real population. more realistic models can be established based on the census data, workplace data, and health survey data. second, the proposed framework focuses on inter-personal influence on human behavior, but has not included the effects of mass media, another channel of behavioral diffusion. the reason is that the effects of mass media remain inconclusive and difficult to quantify, while the effects of interpersonal influence have been extensively studied before. third, the proposed framework has not considered the 'risk compensation' effect, i.e., individuals will behave less cautiously in situations where they feel safer or more protected (cassell, halperin, shelton, & stanton, 2006) . in the context of infectious diseases, the risk compensation can be interpreted as individuals being less cautious of the disease if they have taken antiviral drugs, which may facilitate the disease transmission. this health psychological effect could also be incorporated to refine the framework. to summarize, this article proposes a synergy between epidemiology, social sciences, and human behavioral sciences. for a broader view, the conceptual framework could be easily expanded to include more theories, for instance, from communications, psychology, and public health, thus forming a new interdisciplinary area. further exploration in this area would offer a better understanding of complex human-disease systems. the knowledge acquired would be of a great significance given that vaccines and manpower may be insufficient to combat emerging infectious diseases. error and attack tolerance of complex networks the health belief model and personal health behavior hiv and risk behaviour: risk compensation: the achilles' heel of innovations in hiv prevention? breakdown of the internet under intentional attack monitoring over-the-counter medication sales for early detection of disease outbreakse new york city an individual-based model of innovation diffusion mixing social value and individual benefit mixing patterns and the spread of close-contact infectious diseases strategies for mitigating an influenza pandemic measuring personal networks with daily contacts: a single-item survey question and the contact diary community structure in social and biological networks health behavior and health education: theory, research, and practice epidemiology. philadelphia: wb saunders threshold models of collective behavior individual-based modeling and ecology on monte carlo simulation of diffusion perspectives on antiviral use during pandemic influenza control of communicable diseases manual infectious disease modeling of social contagion in networks the large-scale organization of metabolic networks the rise of the individual-based model in ecology networks and epidemic models a contribution to the mathematical theory of epidemics individual causal models and population system models in epidemiology the web of human sexual contacts containing pandemic influenza with antiviral agents evaluation of over-the-counter pharmaceutical sales as a possible early warning indicator of human disease visits by adults to family physicians for the common cold epidemic spreading in scale-free networks agent-based modeling toolkits netlogo, repast, and swarm diffusion of innovations social learning theory and the health belief model social network analysis: a handbook the spatial diffusion of an innovative health care plan will vaccines be available for the next influenza pandemic? self-care responses to symptoms by older people. a health diary study of illness behavior an experimental study of the small world problem social network thresholds in the diffusion of innovations collective dynamics of small-world networks world health organization report on infectious diseases. world health organization the authors are thankful for insightful comments from the editor and two reviewers. key: cord-336747-8m7n5r85 authors: grossmann, g.; backenkoehler, m.; wolf, v. title: importance of interaction structure and stochasticity for epidemic spreading: a covid-19 case study date: 2020-05-08 journal: nan doi: 10.1101/2020.05.05.20091736 sha: doc_id: 336747 cord_uid: 8m7n5r85 in the recent covid-19 pandemic, computer simulations are used to predict the evolution of the virus propagation and to evaluate the prospective effectiveness of non-pharmaceutical interventions. as such, the corresponding mathematical models and their simulations are central tools to guide political decision-making. typically, ode-based models are considered, in which fractions of infected and healthy individuals change deterministically and continuously over time. in this work, we translate an ode-based covid-19 spreading model from literature to a stochastic multi-agent system and use a contact network to mimic complex interaction structures. we observe a large dependency of the epidemic's dynamics on the structure of the underlying contact graph, which is not adequately captured by existing ode-models. for instance, existence of super-spreaders leads to a higher infection peak but a lower death toll compared to interaction structures without super-spreaders. overall, we observe that the interaction structure has a crucial impact on the spreading dynamics, which exceeds the effects of other parameters such as the basic reproduction number r0. we conclude that deterministic models fitted to covid-19 outbreak data have limited predictive power or may even lead to wrong conclusions while stochastic models taking interaction structure into account offer different and probably more realistic epidemiological insights. on march 11th, 2020, the world health organization (who) officially declared the outbreak of the coronavirus disease 2019 (covid-19) to be a pandemic. by this date at the latest, curbing the spread of the virus then became a major worldwide concern. given the lack of a vaccine, the international community relied on non-pharmaceutical interventions (npis) such as social distancing, mandatory quarantines, or border closures. such intervention strategies, however, inflict high costs on society. hence, for political decision-making it is crucial to forecast the spreading dynamics and to estimate the effectiveness of different interventions. mathematical and computational modeling of epidemics is a long-established research field with the goal of predicting and controlling epidemics. it has developed epidemic spreading models of many different types: data-driven and mechanistic as well as deterministic and stochastic approaches, ranging over many different temporal and spatial scales (see [49, 15] for an overview). computational models have been calibrated to predict the spreading dynamics of the covid-19 pandemic and influenced public discourse. most models and in particular those with high impact are based on ordinary differential equations (odes). in these equations, the fractions of individuals in certain compartments (e.g., infected and healthy) change continuously and deterministically over time, and interventions can be modeled by adjusting parameters. in this paper, we compare the results of covid-19 spreading models that are based on odes to results obtained from a different class of models: stochastic spreading processes on contact networks. we argue that virus spreading models taking into account the interaction structure of individuals and reflecting the stochasticity of the spreading process yield a more realistic view on the epidemic's dynamic. if an underlying interaction structure is considered, not all individuals of a population meet equally likely as assumed for ode-based models. a wellestablished way to model such structures is to simulate the spreading on a network structure that represents the individuals of a population and their social contacts. effects of the network structure are largely related to the epidemic threshold which describes the minimal infection rate needed for a pathogen to be able to spread over a network [37] . in the network-free paradigm the basic reproduction number (r 0 ), which describes the (mean) number of susceptible individuals infected by patient zero, determines the evolution of the spreading process. the value r 0 depends on both, the connectivity of the society and the infectiousness of the pathogen. in contrast, in the network-based paradigm the interaction structure (given by the network) and the infectiousness (given by the infection rate) are decoupled. here, we focus on contact networks as they provide a universal way of encoding real-world interaction characteristics like super-spreaders, grouping of different parts of the population (e.g. senior citizens or children with different contact patterns), as well as restrictions due to spatial conditions and mobility, and household structures. moreover, models based on contact networks can be used to predict the efficiency of interventions [38, 34, 5] . here, we analyze in detail a network-based stochastic model for the spreading of covid-19 with respect to its differences to existing ode-based models and the sensitivity of the spreading dynamics on particular network features. we calibrate both, ode-models and stochastic models with interaction structure to the same basic reproduction number r 0 or to the same infection peak and compare the corresponding results. in particular, we analyze the changes in the effective reproduction number over time. for instance, early exposure of superspreaders leads to a sharp increase of the reproduction number, which results in a strong increase of infected individuals. we compare the times at which the number of infected individuals is maximal for different network structures as well as the death toll. our results show that the interaction structure has a major impact on the spreading dynamics and, in particular, important characteristic values deviate strongly from those of the ode model. in the last decade, research focused largely on epidemic spreading, where interactions were constrained by contact networks, i.e. a graph representing the individuals (as nodes) and their connectivity (as edges). many generalizations, e.g. to weighted, adaptive, temporal, and multi-layer networks exist [31, 44] . here, we focus on simple contact networks without such extensions. spreading characteristics on different contact networks based on the susceptible-infected-susceptible (sis) or susceptible-infected-recovered (sir) compartment model have been investigated intensively. in such models, each individual (node) successively passes through the individual stages (compartments). for an overview, we refer the reader to [35] . qualitative and quantitative differences between network structures and network-free models have been investigated in [22, 2] . in contrast, this work considers a specific covid-19 spreading model and focuses on those characteristics that are most relevant for covid-19 and which have, to the best of our knowledge, not been analyzed in previous work. sis-type models require knowledge of the spreading parameters (infection strength, recovery rate, etc.) and the contact network, which can partially be inferred from real-world observations. currently for covid-19, inferred data seems to be of very poor quality [24] . however, while the spreading parameters are subject to a broad scientific discussion, publicly available data, which could be used for inferring a realistic contact network, practically does not exist. therefore real-world data on contact networks are rare [30, 45, 23, 32, 43] and not available for large-scale populations. a reasonable approach is to generate the data synthetically, for instance by using mobility and population data based on geographical diffusion [46, 17, 36, 3] . for instance, this has been applied to the influenza virus [33] . due to the major challenge of inferring a realistic contact network, most of these works, however, focus on how specific network features shape the spreading dynamics. literature abounds with proposed models of the covid-19 spreading dynamics. very influential is the work of neil ferguson and his research group that regularly publishes reports on the outbreak (e.g. [11] ). they study the effects of different interventions on the outbreak dynamics. the computational modeling is based on a model of influenza outbreaks [19, 12] . they present a very high-resolution spatial analysis based on movement-data, air-traffic networks etc. and perform sensitivity analysis on the spreading parameters, but to the best of our knowledge not on the interaction data. interaction data were also inferred locally at the beginning of the outbreak in wuhan [4] or in singapore [40] and chicago [13] . models based on community structures, however, consider isolated (parts of) cities and are of limited significance for large-scale model-based analysis of the outbreak dynamic. another work focusing on interaction structure is the modeling of outbreak dynamics in germany and poland done by bock et al. [6] . the interaction structure within households is modeled based on census data. inter-household interactions are expressed as a single variable and are inferred from data. they then generated "representative households" by re-sampling but remain vague on many details of the method. in particular, they only use a single value to express the rich types of relationships between individuals of different households. a more rigorous model of stochastic propagation of the virus is proposed by arenas et al. [1] . they take the interaction structure and heterogeneity of the population into account by using demographic and mobility data. they analyze the model by deriving a mean-field equation. mean-field equations are more suitable to express the mean of a stochastic process than other ode-based methods but tend to be inaccurate for complex interaction structures. moreover, the relationship between networked-constrained interactions and mobility data remains unclear. other notable approaches use sir-type methods, but cluster individuals into age-groups [39, 28] , which increases the model's accuracy. rader et al. [41] combined spatial-, urbanization-, and census-data and observed that the crowding structure of densely populated cities strongly shaped the epidemics intensity and duration. in a similar way, a meta-population model for a more realistic interaction structure has been developed [8] without considering an explicit network structure. the majority of research, however, is based on deterministic, network-free sir-based ode-models. for instance, the work of josé lourenço et al. [29] infers epidemiological parameters based on a standard sir model. similarly, dehning et al. [9] use an sir-based ode-model, but the infection rate may change over time. they use their model to predict a suitable time point to loosen npis in germany. khailaie et al. analyze how changes in the reproduction number ("mimicking npis") affect changes in the epidemic dynamics using epidemic simulations [25] , where a variant of the deterministic, network-free sir-model is used and modified to include states (compartments) for hospitalized, deceased, and asymptotic patients. otherwise, the method is conceptually very similar to [29, 9] and the authors argue against a relaxation of npis in germany. another popular work is the online simulator covidsim 1 . the underlying method is also based on a network-free sir-approach [50, 51] . however, the role of an interaction structure is not discussed and the authors explicitly state that they believe that the stochastic effects are only relevant in the early stages of the outbreak. a very similar method has been developed at the german robert-koch-institut (rki) [7] . jianxi luo et al. proposed an ode-based sir-model 1 available at covidsim.eu all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. to predict the end of the covid-19 pandemic 2 , which is regressed with daily updated data. ode-models have also been used to project the epidemic dynamics into the "postpandemic" future by kissler et al. [27] . some groups also reside to branching processes, which are inherently stochastic but not based on a complex interaction structure [21, 42] . a very popular class of epidemic models is based on the assumption that during an epidemic individuals are either susceptible (s), infected (i), or recovered/removed (r). the mean number of individuals in each compartment evolves according to the following system of ordinary differential equations where n denotes the total population size, λ ode and β are the infection and recovery rates. typically, one assumes that n = 1 in which case the equation refers to fractions of the population, leading to the invariance s(t)+i(t)+r(t) = 1 for all t. it is trivial to extent the compartments and transitions. a stochastic network-based spreading model is a continuous-time stochastic process on a discrete state space. the underlying structure is given by a graph, where 2 available at ddi.sutd.edu.sg all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint each node represents one individual (or other entities of interest). at each point in time, each node occupies a compartment, for instance: s, i, or r. moreover, nodes can only receive or transmit infections from neighboring nodes (according to the edges of the graph). for the general case with m possible compartments, this yields a state space of size m n , where n is the number of nodes. the jump times until events happen are typically assumed to follow an exponential distribution. note that in the ode model, residual residence times in the compartments are not tracked, which naturally corresponds to the exponential distribution in the network model. hence, the underlying stochastic process is a continuous-time markov chain (ctmc) [26] . the extension to non-markovian semantics is trivial. we illustrate the three-compartment case in fig. 1 . the transition rates of the ctmc are such that an infected node transmits infections at rate λ. hence, the rate at which a susceptible node is infected is λ·#neigh(i), where #neigh(i) is the number of its infected direct neighbours. spontaneous recovery of a node occurs at rate β. the size of the state space renders a full solution of the model infeasible and approximations of the mean-field [14] or monte-carlo simulations are common ways to analyze the process. general differences to the ode model. this aforementioned formalism yields some fundamental differences to network-free ode-based approaches. the most distinct difference to network-free ode-based approaches is the decoupling of infectiousness and interaction structure. the infectiousness λ (i.e. the infection rate) is assumed to be a parameter expressing how contagious a pathogen inherently is. it encodes the probability of a virus transmission if two people meet. that is, it is independent from the social interactions of individuals (it might however depend on hygiene, masks, etc.). the influence of social contacts is expressed in the (potentially time-varying) connectivity of the graph. loosely speaking, it encodes the possibility that two individuals meet. in the ode-approach both are combined in the basic reproduction number. note that, throughout this manuscript, we use λ to denote the infectiousness of covid-19 (as an instantaneous transmission rate). another important difference is that ode-models consider fractions of individuals in each compartment. in the network-based paradigm, we model absolute numbers of entities in each compartment and extinction of the epidemic may happen with positive probability. while ode-models are agnostic to the actual population size, in network-based models, increasing the population by adding more nodes inevitably changes the dynamics. another important connection between the two paradigms is that if the network topology is a complete graph (resp. clique) then the ode-model gives an accurate approximation of the expected fractions of the network-based model. in systems biology this assumption is often referred to as well-stirredness. in the limit of an infinite graph size, the approximation approaches the true mean. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. to transform an ode-model to a network-based model, one can simply keep rates relating to spontaneous transitions between compartments as these transitions do not depend on interactions (e.g., recovery at rate β or an exposed node becoming infected). translating the infection rate is more complicated. in odemodels, one typically has given an infection rate and assumes that each infected individual can infect susceptible ones. to make the model invariant to the actual number of individuals, one typically divides the rate by the population size (or assumes the population size is one and the odes express fractions). naturally, in a contact network, we do not work with fractions but each node relates to one entity. here, we propose to choose an infection rate such that the network-based model yields the same basic reproduction number r 0 as the ode-model. the basic reproduction number describes the (excepted) number of individuals that an infected person infects in a completely susceptible population. we calibrate our model to this starting point of the spreading process, where there is a single infected node (patient zero). we assume that r 0 is either explicitly given or can implicitly be derived from an ode-based model specification. hence, when we pick a random node as patient zero, we want it to infect on average r 0 susceptible neighbors (all neighbors are susceptible at that point in time) before it recovers or dies. let us assume that, like in aforementioned sir-model, infectious node becomes infect their susceptible neighbors with rate λ and that an infectious node loses its infectiousness (by dying, recovering, or quarantining) with rate β. according to the underlying ctmc semantics of the network model, each susceptible neighbor gets infected with probability λ β+λ [26] . note that we only take direct infections form patient zero into account and, for simplicity, assume all neighbors are only infected by patient zero. hence, when patient zero has k neighbors, the expected number of neighbors it infects is k λ β+λ . since the mean degree of the network is k mean , the expected number of nodes infected by patent zero is now we can calibrate λ to relate to any desired r 0 . that is note that we generally assume that r 0 > 1 and that no isolates (nodes with no neighbors) exists in the graph, which implies k mean ≥ 1. hence, by construction, it is not possible to have an r 0 which is larger than (or equal to) the average number of neighbors in the network. in contrast, in the deterministic paradigm this relationship is given by the equation (cf. [29, 9] ): note that the recovery rate β is identical in the ode-and network-model. we can translate the infection rate of an ode-model to a corresponding networkbased stochastic model with the equation while keeping r 0 fixed. in the limit of an infinite complete network, this yields to lim n→∞ λ = λode n , which is equivalent to the effective infection rate in the ode-model λode n for population size n (cf. eq. (1) ). example. consider a network where each node has exactly 5 neighbors (a 5regular graph) and let r 0 = 2. we also assume that the recovery rate is β = 1, which then yields λ ode = 2. the probability that a random neighbor of patient zero becomes infected is 2 5 = λ (β+λ) , which gives λ = 2 3 . it is trivial to extent the compartments and transitions, for instance by including an exposed compartment for the time-period where an individual is infected but not yet infectious. the derivation of r 0 remains the same. the only the only requirement is the existence of a distinct infection and recovery rate, respectively. in the next section, we discuss a more complex case. we consider a network-based model that is strongly inspired by the ode-model used in [25] , we document it in fig. 2 . we use the same compartments and transition-types but simplify the notation compared to [25] to make the intuitive meaning of the variables clearer 3 . we denote the compartments by c = {s, e, c, i, h, u, r, d}, where each node can be susceptible(s), exposed (e), a carrier (c), infected (i), hospitalized (h), in the (intensive care unit (u), dead (d), or recovered (r). exposed agents are already infected but symptom-free and not infectious. carriers are also symptomfree but already infectious. infected nodes show symptoms and are infectious. therefore, we assume that their infectiousness is reduced by a factor of γ (γ ≤ 1, sick people will reduce their social activity). individuals that are hospitalized (or in the icu) are assumed to be properly quarantined and cannot infect others. note that accurate spreading parameters are very difficult to infer in general and the high number of undetected cases complicate the problem further in the current pandemic. here, we choose values that are within the ranges listed in [25] , where the ranges are rigorously discussed and justified. we document them in table 1 . we remark that there is a high amount of uncertainty in the spreading parameters. however, our goal is not a rigorous fit to data but rather a comparison of network-free ode-models to stochastic models with an underlying network structure. note that the mean number of days in a compartment is the inverse of the cumulative instantaneous rate to leave that compartment. for instance, the mean residence time in compartment h is as a consequence of the race condition of the exponential distribution [47] , r h modulates the probability of entering the successor compartment. that is, with probability r h , the successor compartment will be r and not u. inferring the infection rate λ for a fixed r 0 is somewhat more complex than in the previous section because this model admits two compartments for infectious agents. we first consider the expected number of nodes that a randomly chosen patient zero infects, while being in state c. we denote the corresponding basic reproduction number by r 0 . we calibrate the only unknown parameter λ accordingly (the relationships from the previous section remain valid). we explain the relation to r 0 when taking c and i into account in appendix a. substituting β by µ c gives naturally, it is extremely challenging to reconstruct large-scale contact-networks based on data. here, we test different types of contact networks with different features, which are likely to resemble important real-world characteristics. the contact networks are specific realizations (i.e. variates) of random graph models. different graph models highlight different (potential) features of the real-world interaction structure. the number of nodes ranges from 100 to 10 5 . we only use strongly connected networks (where each node is reachable from all other nodes). we refer to [10] or the networkx [18] documentation for further information about the network models discussed in the sequel. we provide a schematic visualization in fig. 3 . we consider erdős-rényi (er) random graphs as a baseline, where each pair of nodes is connected with a certain (fixed) probability. we also compute results for watts-strogatz (ws) random networks. they are based on a ring topology with random re-wiring. the re-wiring yields to a small-world property of the network. colloquially, this means that one can reach each node from each other node with a small number of steps (even when the number of nodes increases). we further consider geometric random networks (gn), where nodes are randomly sampled in an euclidean space and randomly connected such that nodes closer to each other have a higher connection probability. we also consider barabási-albert (ba) random graphs that are generated using a preferential attachment mechanism among nodes and graphs generated using the configuration model (cm-pl) which are-except from being constrained all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint λ · #neigh(c) + λγ · #neigh(i) rate of transitioning from e to c rc 0.08 recovery probability when node is a carrier µc rate of leaving c ri 0.8 recovery probability when node is infected µi 1 5 rate of leaving i r h 0.74 recovery probability when node is hospitalized µ h rate of leaving h ru 0.46 recovery probability when node is in the icu µu 1 8 rate of leaving u all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. on having power-law degree distribution-completely random. both models contain a very small number of nodes with very high degree, which act as superspreaders. we also test a synthetically generated household (hh) network that was loosely inspired by [2] . each household is a clique, the edges between households represent connections stemming from work, education, shopping, leisure, etc. we use a configuration model to generate the global inter-household structure that follows a power-law distribution. we also use a complete graph (cg) as a sanity check. it allows the extinction of the epidemic, but otherwise similar results to those of the ode are expected. we are interested in the relationship between the contact network structure, r 0 , the height and time point of the infection-peak, and the number of individuals ultimately affected by the epidemic. therefore, we run different network models with different r 0 (which is equivalent to fixing the corresponding values for λ or for r 0 ). for one series of experiments, we fix r 0 = 1.8 and derive the corresponding infection rate λ and the value for λ ode in the ode model. in the second experiments, calibrate λ and λ ode such that all infection peaks lie on the same level. in the sequel, we do not explicitly model npis. however, we note that the network-based paradigm makes it intuitive to distinguish between npis related to the probability that people meet (by changing the contact network) and npis all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint related to the probability of a transmission happening when two people meet (by changing the infection rate λ). political decision-making is faced with the challenge of transforming a network structure which inherently supports covid-19 spreading to one which tends to suppress it. here, we investigate how changes in λ affect the dynamics of the epidemic in section 5 (experiment 3). we compare the solution of the ode model (using numerical integration) with the solution of the corresponding stochastic network-based model (using monte-carlo simulations). code will be made available 4 . we investigate the evolution of mean fractions in each compartment over time, the evolution of the so-called effective reproduction number, and the influence of the infectiousness λ. setup. we used contact networks with n = 1000 nodes (except for the complete graph where we used 100 nodes). to generate samples of the stochastic spreading process, we utilized event-driven simulation (similar to the rejection-free version in [16] ). the simulation started with three random seeds nodes in compartment c (and with an initial fraction of 3/1000 for the ode model). one thousand simulation runs were performed on a fixed variate of a random graph. we remark that results for other variates were very similar. hence, for better comparability, we refrained from taking an average over the random graphs. the parameters to generate a graph are: er: k mean = 6, ws: k = 4 (numbers of neighbors), p = 0.2 (re-wire probability), gn: r = 0.1 (radius), ba: m = 2 (number of nodes for attachment), cm-pl: γ = 2.0 (power-law parameter) , k min = 2, hh: household size is 4, global network is cm-pl with γ = 2.0, k min = 3. experiment 1: results with homogeneous r 0 . in our first experiment, we compare the epidemic's evolution (cf. fig. 4 ) while λ is calibrated such that all networks admit a r 0 of 1.8. and λ is set (w.r.t. the mean degree) according to eq. (6). thereby, we analyze how well certain network structures generally support the spread of covid-19. the evolution of the mean fraction of nodes in each compartment is illustrated in fig. 4 and fig. 5 . based on the monte-carlo simulations, we analyzed how the number r t of neighbors that an infectious node infects changes over time (cf. fig. 6 ). hence, r t is the effective reproduction number at day t (conditioned on the survival of the epidemic until t). for t = 0, the estimated effective reproduction number always starts around the same value and matched the theoretical prediction. independent of the network, r 0 = 1.8 yields r 0 ≈ 2.05 (cf. appendix a). in fig. 6 we see that the evolution of r t differs tremendously for different contact networks. unsurprisingly, r t decreases on the complete graph (cg), as nodes that become infectious later will not infect more of their neighbors. this 4 github.com/gerritgr/stochasticnetworkedcovid19 all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. also happens for gn-and ws-networks, but they cause a much slower decline of r t which is around 1 in most parts (the sharp decrease in the end stems from the end of the simulation being reached). this indicates that the epidemic slowly "burns" through the network. in contrast, in networks that admit super-spreaders (cm-pl, hh, and also ba), it is principally possible for r t to increase. for the cm-pl network, we have a very early and intense peak of the infection while the number of individuals ultimately affected by the virus (and consequently the death toll) remains comparably small (when we remove the super-spreaders from the network while keeping the same r 0 , the death toll and the time point of the peak increase, plot not shown). note that the high value of r t in fig. 6 c) in the first days results from the fact that super-spreaders become exposed, which later infect a large number of individuals. as there are very few super-spreaders, they are unlikely to be part of the seeds. however, due to their high centrality, they are likely to be one of the first exposed nodes, leading to an "explosion" of the epidemic. in hh-networks this effect is way more subtle but follows the same principle. experiment 2: calibrating r 0 to a fixed peak. next, we calibrate λ such that each network admits an infection peak (regarding i total ) of the same height (0.2). results are shown in fig. 7 . they emphasize that there is no direct relationship between the number of individuals affected by the epidemic and the height of the infection peak, which is particularly relevant in the light of limited icu capacities. it also shows that vastly different infection rates and basic reproduction numbers are acceptable when aiming at keeping the peak below a certain threshold. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. x-axis: day at which a node becomes exposed, y-axis: (mean) number of neighbors this node infects while being a carrier or infected. note that at later time points results are more noisy as the number of samples decreases. the first data-point is the simulation-based estimation of r 0 and is shown as a blue square. fig. 8 . noticeably, the relationship is concave for most network models but almost linear for the ode model. this indicates that the networks models are more sensitive to small changes of λ (and r 0 ). this suggests that the use of ode models might lead to a misleading sense of confidence because, roughly speaking, it will tend to yield similar results when adding some noise to λ. that makes them seemingly robust to uncertainty in the parameters, while in reality the process is much less robust. assuming that ba-networks resemble some important features of real social networks, the non-linear relationship between infection peak and infectiousness indicates that small changes of λ, which could be achieved through proper hand-all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. for the network-models, r 0 (given by eq. (7)) is sown as a scatter plot. note the different scales on x-and y-axis washing, wearing masks in public, and keeping distance to others, can significantly "flatten the curve". in the series of experiments, we tested how various network types influence an epidemic's dynamics. the networks types highlight different potential features of real-world social networks. most results do not contradict line with real-world observations. for instance, we found that better hygiene and the truncation of super-spreaders will likely reduce the peak of an epidemic by a large amount. we also observed that, even when r 0 is fixed, the evolution of r t largely depends on the network structure. for certain networks, in particular those admitting super-spreaders, it can even increase. an increasing reproduction number can be seen in many countries, for instance in germany [20] . how much of this can be attributed to super-spreaders is still being researched. note that superspreaders do not necessarily have to correspond to certain individuals. it can also, on a more abstract level, refer to a type of events. we also observed that cm-pl networks have a very early and very intense infection peak. however, the number of people ultimately affected (and therefore also the death toll) remain comparably small. this is somewhat surprising and requires further research. we speculate that the fragmentation in the network makes it difficult for the virus to "reach every corner" of the graph while it "burns out" relatively quickly for the high-degree nodes. we presented results for a covid-19 case study that is based on the translation of an ode model to a stochastic network-based setting. we compared several interaction structures using contact graphs where one was (a finite version of the) the implicit underlying structure of the ode model, the complete graph. we found that inhomogeneity in the interaction structure significantly shapes the epidemic's dynamic. this indicates that fitting deterministic ode models to real-world data might lead to qualitatively and quantitatively wrong results. the interaction structure should be included into computational models and should undergo the same rigorous scientific discussion as other model parameters. contact graphs have the advantage of encoding various types of interaction structures (spatial, social, etc.) and they decouple the infectiousness from the connectivity. we found that the choice of the network structure has a significant impact and it is very likely that this is also the case for the inhomogeneous interaction structure among humans. specifically, networks containing super-spreaders consistently lead to the emergence of an earlier and higher peak of the infection. moreover, the almost linear relationship between r 0 , λ ode , and the peak intensity in ode-models might also lead to misplaced confidence in the results. regarding the network structure in general, we find that super-spreaders can lead to a very early "explosion" of the epidemic. small-worldness, by itself, does not admit this property. generally, it seems that-unsurprisingly-a geometric network is best at containing a pandemic. this would imply evidence for corresponding mobility restrictions. surprisingly, we found a trade-off between the height of the infection peak and the fraction of individuals affected by the epidemic in total. for future work, it would be interesting to investigate the influence of non-markovian dynamics. ode-models naturally correspond to an exponentially distributed residence times in each compartment [48, 16] . moreover, it would be interesting to reconstruct more realistic contact networks. they would allow to investigate the effect of npis in the network-based paradigm and to have a well-founded scientific discussion about their efficiency. from a risk-assessment perspective, it would also be interesting to focus more explicitly on worst-case trajectories (taking the model's inherent stochasticity into account). this is especially relevant because the costs to society do not scale linearly with the characteristic values of an epidemic. for instance, when icu capacities are reached, a small additional number of severe cases might lead to dramatic consequences. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. the reachability probability from s − c to e − c is related to r 0 . it expresses the probability that a infected node (patient zero, in c) infects a specific random (susceptible) neighbor. the infection can happen via two paths. furthermore, we assume that this happens for all edges/neighbors of patient zero independently. assume a randomly chosen patient zero that is in compartment c. we are interested in r 0 in the model given in fig. 2 assuming γ > 0. again, we consider each neighbor independently and multiply with k mean . moreover, we have to consider the likelihood that patient zero infects a neighbor while being in compartment c and the possibility of transitioning to i and then transmitting the virus. this can be expressed as a reachability probability (cf. fig. 9 ) and gives raise to the equation: in the brackets, the first part of the sum expresses the probability that patient zero infects a random neighbor as long as it is in c. in the second part of the sum, the first factor expresses the probability that patient zero transitions to i before infecting a random neighbor. the second factor is then the probability of infecting a random neighbor as long as being in i. note that, as we consider a fixed random neighbor, we need to condition the second part of the sum on the fact that the neighbor was not already infected in the first step. all rights reserved. no reuse allowed without permission. was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint (which this version posted may 8, 2020. . https://doi.org/10.1101/2020.05.05.20091736 doi: medrxiv preprint derivation of the effective reproduction number r for covid-19 in relation to mobility restrictions and confinement analysis of a stochastic sir epidemic on a random network incorporating household structure generation and analysis of large synthetic social contact networks epidemiology and transmission of covid-19 in shenzhen china: analysis of 391 cases and 1,286 of their close contacts controlling contact network topology to prevent measles outbreaks mitigation and herd immunity strategy for covid-19 is likely to fail modellierung von beispielszenarien der sars-cov-2-ausbreitung und schwere in deutschland the effect of travel restrictions on the spread of the 2019 novel coronavirus inferring covid-19 spreading rates and potential change points for case number forecasts a first course in network theory strategies for mitigating an influenza pandemic community transmission of sars-cov-2 at two family gatherings-chicago, illinois binary-state dynamics on complex networks: pair approximation and beyond mathematical models of infectious disease transmission rejection-based simulation of nonmarkovian agents on complex networks epidemic spreading in urban areas using agent-based transportation models exploring network structure, dynamics, and function using networkx modeling targeted layered containment of an influenza pandemic in the united states schätzung der aktuellen entwicklung der sars-cov-2-epidemie in deutschland-nowcasting feasibility of controlling covid-19 outbreaks by isolation of cases and contacts representations of human contact patterns and outbreak diversity in sir epidemics insights into the transmission of respiratory infectious diseases through empirical human contact networks coronavirus disease 2019: the harms of exaggerated information and non-evidence-based measures estimate of the development of the epidemic reproduction number rt from coronavirus sars-cov-2 case data and implications for political measures based on prognostics mathematics of epidemics on networks projecting the transmission dynamics of sars-cov-2 through the post-pandemic period contacts in context: large-scale setting-specific social mixing matrices from the bbc pandemic project fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the sars-cov-2 epidemic an infectious disease model on empirical networks of human contact: bridging the gap between dynamic network data and contact matrices temporal network epidemiology comparison of three methods for ascertainment of contact information relevant to respiratory pathogen transmission in encounter networks a small community model for the transmission of infectious diseases: comparison of school closure as an intervention in individual-based models of an influenza pandemic analysis and control of epidemics: a survey of spreading processes on complex networks epidemic processes in complex networks an agent-based approach for modeling dynamics of contagious disease spread threshold conditions for arbitrary cascade models on arbitrary networks optimal vaccine allocation to control epidemic outbreaks in arbitrary networks the effect of control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan, china: a modelling study investigation of three clusters of covid-19 in singapore: implications for surveillance and response measures crowding and the epidemic intensity of covid-19 transmission. medrxiv pattern of early human-to-human transmission of wuhan 2019 novel coronavirus (2019-ncov) a high-resolution human contact network for infectious disease transmission spreading processes in multilayer networks interaction data from the copenhagen networks study impact of temporal scales and recurrent mobility patterns on the unfolding of epidemics probability, markov chains, queues, and simulation: the mathematical basis of performance modeling non-markovian infection spread dramatically alters the susceptible-infected-susceptible epidemic threshold in networks an introduction to infectious disease modelling modelling the potential health impact of the covid-19 pandemic on a hypothetical european country all rights reserved. no reuse allowed without permission. key: cord-350646-7soxjnnk authors: becker, sara; chaple, michael; freese, tom; hagle, holly; henry, maxine; koutsenok, igor; krom, laurie; martin, rosemarie; molfenter, todd; powell, kristen; roget, nancy; saunders, laura; velez, isa; yanez, ruth title: virtual reality for behavioral health workforce development in the era of covid-19 date: 2020-10-09 journal: j subst abuse treat doi: 10.1016/j.jsat.2020.108157 sha: doc_id: 350646 cord_uid: 7soxjnnk the coronavirus 2019 disease (covid-19) pandemic emerged at a time of substantial investment in the united states substance use service infrastructure. a key component of this fiscal investment was funding for training and technical assistance (ta) from the substance abuse and mental health services administration (samhsa) to newly configured technology transfer centers (ttcs), including the addiction ttcs (attc network), prevention ttcs (pttc network), and the mental health ttcs (mhttc network). samhsa charges ttcs with building the capacity of the behavioral health workforce to provide evidence-based interventions via locally and culturally responsive training and ta. this commentary describes how, in the wake of the covid-19 pandemic, ttcs rapidly adapted to ensure that the behavioral health workforce had continuous access to remote training and technical assistance. ttcs use a conceptual framework that differentiates among three types of technical assistance: basic, targeted, and intensive. we define each of these types of ta and provide case examples to describe novel strategies that the ttcs used to shift an entire continuum of capacity building activities to remote platforms. examples of innovations include online listening sessions, virtual process walkthroughs, and remote “live” supervision. ongoing evaluation is needed to determine whether virtual ta delivery is as effective as face-to-face delivery or whether a mix of virtual and face-to-face delivery is optimal. the ttcs will need to carefully balance the benefits and challenges associated with rapid virtualization of ta services to design the ideal hybrid delivery model following the pandemic. the coronavirus 2019 disease pandemic emerged at a time of substantial investment in the united states substance use service infrastructure. between 2017 and 2019, congress released $3.3 billion dollars in grants to scale up substance use prevention, treatment, and recovery efforts in an attempt to curtail the overdose epidemic (goodnough, 2019) . a key component of this fiscal investment was funding for training and technical assistance (ta) from the substance abuse and mental health services administration (samhsa) to newly configured technology transfer centers (ttcs), including the addiction ttcs (attc network), prevention ttcs (pttc network), and mental health ttcs (mhttc network). to ensure the modernization of the behavioral health service system, samhsa charges ttcs with building the capacity of the behavioral health workforce to provide evidence-based interventions via locally and culturally responsive training and ta (katz, 2018) . in march 2020, the covid-19 pandemic upended the united states healthcare system, and challenged the behavioral health workforce in unprecedented ways. to meet the needs of the workforce, ttcs had to rapidly innovate to provide training and ta without service disruption. ttcs apply different ta strategies based on circumstances, need, and appropriateness (powell, 2015) and consider training (i.e., conducting educational meetings) as a discrete activity that can be provided as part of any ta effort. ttcs are guided by extensive evidence that strategies beyond training are required for practice implementation and organizational change (edmunds et al., 2013) , underscoring the critical need for virtual ta in the wake of the covid-19 pandemic. in may 2020, we surveyed all 39 u.s.-based ttcs to identify example innovations in each layer of the ta pyramid that the covid-19 necessitated. thirty-five ttcs (90%) across three networks (pttc n=13; attc n=13; mhttc n=9) responded, representing both regional and national ttcs. consultations. tccs typically deliver basic ta to large audiences and focus on building awareness and knowledge. common basic ta activities for untargeted audiences include conferences, brief consultation, and web-based lectures (i.e., webinars). ttcs reported a surge in requests for basic ta during the covid-19 pandemic and responded with a significant increase in dissemination of information (i.e., best practice guidelines), as well as brief consultations to support interpretation of such information. ttcs emphasized virtual content curation, organizing content to enhance usability. additionally, ttcs employed novel delivery channels, such as live streaming, pre-recorded videos, podcasts, and webinars with live transcription, to reach wide audiences. another practice innovation was online listening sessions in which health professionals convened around a priority topic. for instance, two national ttcs co-hosted a j o u r n a l p r e -p r o o f journal pre-proof virtual workforce development 6 series of listening sessions titled "emerging issues around covid-19 and social determinants of health" that experimented with "flipping the typical script" by first having participants engage in conversation and then having expert presenters address emergent topics via brief didactics. this series, which was not sequential or interconnected, built knowledge and awareness around evolving workforce needs. targeted ta is the provision of directed training or support to specific groups (e.g., clinical supervisors) or organizations (e.g., prevention coalitions) focused on building skill and promoting behavior change. targeted ta encompasses activities customized for specific recipients such as didactic workshop trainings, learning communities, and communities-of-practice. due to the focus on provider skill-building, targeted ta often relies on experiential learning activities such as role plays and behavioral rehearsal (edmunds et al., 2013) . to transition targeted ta online, ttcs reduced didactic material to the minimum necessary; spread content over several sessions; and leveraged technology to foster interaction among small groups. for example, one regional ttc transformed a face-to-face, multi-day motivational interviewing skills-building series by moving the delivery to a multi-week virtual learning series. this ttc kept participants engaged by limiting the time for each session to 1-2 hours, utilizing the full capabilities of videoconferencing platforms (e.g., small breakout rooms and interactive polling), and extending learning through sms text messages containing reminders of core skills. covid-net: a weekly summary of u.s. hospitalization data coronavirus disease 2019 (covid-19): cases in the the dynamic sustainability framework: addressing the paradox of sustainment amid ongoing change dissemination and implementation of evidence-based practices: training and consultation as implementation strategies implementation: the missing link between research and practice states are making progress on opioids. now the money that is helping them may dry up drug overdose deaths drop in u.s. for the first time since 1990 the substance abuse and mental health services administration key: cord-338588-rc1h4drd authors: li, xuanyi; sigworth, elizabeth a.; wu, adrianne h.; behrens, jess; etemad, shervin a.; nagpal, seema; go, ronald s.; wuichet, kristin; chen, eddy j.; rubinstein, samuel m.; venepalli, neeta k.; tillman, benjamin f.; cowan, andrew j.; schoen, martin w.; malty, andrew; greer, john p.; fernandes, hermina d.; seifter, ari; chen, qingxia; chowdhery, rozina a.; mohan, sanjay r.; dewdney, summer b.; osterman, travis; ambinder, edward p.; buchbinder, elizabeth i.; schwartz, candice; abraham, ivy; rioth, matthew j.; singh, naina; sharma, sanjai; gibson, michael k.; yang, peter c.; warner, jeremy l. title: seven decades of chemotherapy clinical trials: a pan-cancer social network analysis date: 2020-10-16 journal: sci rep doi: 10.1038/s41598-020-73466-6 sha: doc_id: 338588 cord_uid: rc1h4drd clinical trials establish the standard of cancer care, yet the evolution and characteristics of the social dynamics between the people conducting this work remain understudied. we performed a social network analysis of authors publishing chemotherapy-based prospective trials from 1946 to 2018 to understand how social influences, including the role of gender, have influenced the growth and development of this network, which has expanded exponentially from fewer than 50 authors in 1946 to 29,197 in 2018. while 99.4% of authors were directly or indirectly connected by 2018, our results indicate a tendency to predominantly connect with others in the same or similar fields, as well as an increasing disparity in author impact and number of connections. scale-free effects were evident, with small numbers of individuals having disproportionate impact. women were under-represented and likelier to have lower impact, shorter productive periods (p < 0.001 for both comparisons), less centrality, and a greater proportion of co-authors in their same subspecialty. the past 30 years were characterized by a trend towards increased authorship by women, with new author parity anticipated in 2032. the network of cancer clinical trialists is best characterized as strategic or mixed-motive, with cooperative and competitive elements influencing its appearance. network effects such as low centrality, which may limit access to high-profile individuals, likely contribute to the observed disparities. the modern era of chemotherapy began in 1946, with publications describing therapeutic uses of nitrogen mustard 1, 2 . over the next 70 years, the repertoire of available cancer treatments has expanded at an ever-increasing pace. chemotherapeutics have a notably low therapeutic index, i.e., the difference between a harmful and beneficial dose or combination is often quite small 3 . consequently, a complex international clinical trial apparatus emerged in the 1970s to study chemotherapeutics in controlled settings, and prospective clinical trials remain the gold standard by which standard of care treatments are established 4, 5 . discoveries made by successive generations have led to overall improvement in the prognosis of most cancers 6 . while social network analysis has been used to study patterns of co-authorship in scientific settings 7, 8 , the social component of clinical trial research is not well characterized. little is known about how social factors have shaped the progress of the field, as cancer care has become increasingly subspecialized, and how social network baseline characteristics. n = 5599 of 6301 reviewed publications with an aggregate of n = 29,197 authors met the inclusion criteria (consort figure s1 ). cumulatively, most authors in the network (n = 22,761, 78%) published at least one randomized trial, with n = 15,340 (52.5%) participating in the publication of a "positive" trial (table s2 ). most of the included authors (n = 28,087, 96.2%) participated in the primary publication of a clinical trial, while a smaller subgroup (n = 6,773, 23.2%) participated in the publication of updates. the most common venues for publication were high-impact clinical journals: the journal of clinical oncology (n = 1595, 28 .5%), the lancet family (n = 710, 12.7%), the new england journal of medicine (n = 495, 8.8%), and the blood family (n = 495, 8.8%). co-authorship has changed in a non-linear fashion over time: the median number of authors per publication increased from n = 6 in 1946 to n = 20 (iqr [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] in 2018 ( figure s2 ). across subspecialties, the median number of co-authors per publication varied somewhat, from a low of n = 10 (iqr 7-15) in gynecologic oncology to a high of n = 16 (iqr [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] in dermatologic oncology. median longevity is < 1 year at all times, although the number of authors with multiple years in the field grows substantially over time ( figure s3 ). a small number of individuals maintained the highest impact over time-nearly 20 years each in the case of chemotherapy pioneers sidney farber and james f. holland ( figure s4 ). in any given year, most authors had a betweenness centrality of < 1% of the maximum; conversely, a very small number of authors had an exceptionally high score, with 1% of authors accounting for 100% of the total in recent years ( figure s5 ). accordingly, an increasingly smaller proportion of authors were both very highly connected and highly impactful; in 1970, the 10% highest-impact authors (n = 20) account for 21.4% of links and 54.9% of impact; in 2018, the same proportion (n = 2920) account for 37.1% of links and 62.3% of impact. first/last authorship has also become concentrated; in 2018 publications, 10% of authors had at least one such role, whereas prior to 1980 it was on average > 25% ( figure s6 ). the structure of the network changes considerably over time, from relatively dense and connected to sparse and modular (fig. 1b) . the final network is very sparse (0.16% of possible links are present); nevertheless, n = 29,029 (99.4%) authors are in a single connected component; the next-largest component comprises 14 authors. each of the 13 cancer subspecialties developed at different rates, with clear influence of seminal events in several subspecialties, e.g., the introduction of adjuvant therapy and tamoxifen for breast cancer, completely new classes of drugs for plasma cell disorders, and immunotherapy for melanoma (fig. 1c) 25-31 . network visualization and cumulative metrics. the final cumulative network visualization is shown in figs. 2 & s7. the impact score of authors is unevenly distributed, median 0.0532 (range 0-14.31); however, the log-transformed impact scores approximate a normal distribution ( figure s8 ). authors with longevity ≥ 1 year who changed primary subspecialty at least once (n = 2330) had nearly twice the median impact and longevity of those who remained in one subspecialty (n = 10,276), 0.25 (iqr 0.11-0.6) versus 0.14 (iqr 0.07-0.35) and 13 years (iqr 6-19) versus 7 years (iqr 3-12), respectively (p < 0.001 for both comparisons). cumulatively, subspecialized authors with calculable homophily (n = 24,560) have a median proportion of co-authors sharing the same subspecialty of 88% (iqr 76-95%); 945,167 (71.4%) of these authors' outlinks are within-subspecialty. this is reflected by a high assortativity by subspecialty since the mid-1960s (fig. 1b) modularity follows a sigmoid pattern with a period of linear increase between 1960-80 followed by a plateau at high modularity; assortativity rapidly increases in early decades; median normalized pagerank decreases to a low plateau from the 1970s onward; (c) subspecialties develop at different but broadly parallel rates, with seminal events apparently preceding accelerations of individual subspecialties, e.g.,: (1) in the four years after 1973, combination therapy (ac 25 ), adjuvant therapy 26 , and tamoxifen 27 were introduced in breast cancer; (2) thalidomide 28 and bortezomib 29 were reported to be efficacious for multiple myeloma; and (3) immunotherapy (ipilimumab 30, 31 ) was introduced in the treatment of melanoma. www.nature.com/scientificreports/ sensitivity analysis. normalized score distributions did not change significantly, although modulation of the trial design coefficient led to a bimodal peak ( figure s11 ). correlation of assortativity and modularity was high, ranging from 0.815-0.999 for the former and 0.981-0.999 for the latter (table s3 ; figure s12 ). the remarkable gains in the fields of hematology and oncology can be ascribed to the tireless work of numerous trialists and the generosity of countless patient participants. as a result, systemic antineoplastics now stand beside surgery and radiotherapy as a pillar of cancer care. our analysis of clinical trialists as a social network, particularly with respect to the density distribution of pagerank, reveals a mixed-motive network that differs only authors assigned to a subspecialty are visualized; these account for 84% of all authors in the database. this figure highlights various clustering trends by subspecialty, such as the apparent sub-clusters of sarcoma research (yellow) and the two dominant clusters of breast cancer research (pink). it is clear as well that certain subspecialties are more cohesive than others, such as the tightly clustered dermatology (black) compared to the spread-out head and neck cancer authors (red). www.nature.com/scientificreports/ substantially from "collegial" and "friend-based" online social networks. while clinical trials are conducted towards a collaborative goal-improved outcomes for all cancer patients-there are significant competitive pressures. examples of these pressures include resource limitations (e.g., funding and patients available for accrual), the tension between prioritization of cooperative group versus industry-funded trials, personal motivations such as academic promotion or leadership opportunities, and institutional reputation. the emergence of formal and informal leaders in scientific networks has been shown to facilitate research as well as create clusters 32 . as fig. 2 shows, there is a strong tendency for clustering based on subspecialty in the complete network, although some subspecialties (e.g., lymphoid and myeloid malignancies) have many more interconnections than others (e.g., sarcoma and neuro-oncology). many of these clusters appear to be organized around an individual or group of individuals who have high impact and centrality. as an organizational principle, these individuals appear to rarely be in direct competition, but their presence is a clear indicator of scale-free phenomena within the network. the facts that betweenness centrality follows a power law cumulative distribution bolsters this theory. scale-free phenomena, which are defined by a power law distribution of connectedness, are very common in strategic networks, especially when they become increasingly sparse, as this network does 33 . the two related theories for this network behavior are preferential attachment and fitness. the former observes that those with impact tend to attract more impact; the latter postulates that such gains for the "fittest" come at the expense of the "less fit" 34 . seminal events (fig. 1c) are likely a driver of preferential attachment 35 , and may the network is overwhelmingly dominated by men until 1980, when a trend towards increasing authorship by women begins to be seen; however, representation by women in first/last authorship remains low; gray shaded lines are 95% confidence intervals of the loess curves; (b) men tend on average to have a longer productive period and to achieve a higher author impact score than women (p < 0.001 for both comparisons); (c) men tend on average to be more central and have more collaborations outside of their subspecialty. note that the homophily calculation requires a subspecialty assignment, which explains the slightly lower numbers in (c) as compared to (b). www.nature.com/scientificreports/ partially explain why some authors change their primary subspecialty at least once over time (e.g., through a "bandwagon" effect driven by the diffusion of ideas 36 ). given that these authors were observed to have nearly twice the impact and longevity of their single subspecialty peers, this dynamic will be a focus of future study, including calculation of the q factor, a metric developed to quantify the ability of a scientist to "take advantage of the available knowledge in a way that enhances (q > 1) or diminishes (q < 1) the potential impact p of a paper" 37 . in the analysis of network dynamics (fig. 1b) , the field as a whole appears to emerge in the 1970s, which is also when medical oncology and hematology were formally recognized through board certification. measurements of field maturity are by their nature subjective, but the pessimism 38 of the late 1960s was captured by sidney farber: "…the anticancer chemicals, hormones, and the antibiotics…marked the beginning of the era of chemotherapy of cancer, which may be described after 20 years as disappointing because progress has not been more rapid…" 39 . these concerns prompted the us national cancer act of 1971, which was followed by the leveling of modularity at a very high level from 1976 onwards, suggesting that the subspecialties generated in the 1970s have remained stable. the assortativity by subspecialty has increased as well, with recent levels approximately twice those seen in a co-authorship network of physicists 20 . while median pagerank has decreased markedly, indicating decreasing influence for the average author, the distribution in 2018 is broadly right-skewed ( figure s13 ). these findings reveal a high level of increasing exclusivity, suggesting that it is becoming progressively more difficult to join the top echelon of the network. this has major implications for junior investigators' mobility, and potentially for the continued health of the network as a whole. while there is much to be applauded in the continued success of translating research findings into the clinic, we observed clear gender disparities within the cancer clinical trialist network: women have a statistically significantly lower final impact score, shorter productive period, less centrality, and less collaboration with those outside of their primary subspecialty. these findings are consistent with and build upon previous literature on www.nature.com/scientificreports/ the challenges facing women in pursuing and remaining in academic careers 10, 16, 19, 40 . they are also consistent with more recent gender disparity findings, such as those observed in research published on covid-19 41 . other studies investigating the basis for such a gender gap have identified several layers of barriers to the advancement of women in academic medicine. these include sexism in the academic environment, lack of mentorship, and inequity with regards to resource allocation, salary, space, and positions of influence 42, 43 . our study suggests that additional network factors such as relatively low centrality, which indicates a lack of access to other individuals of influence, and high homophily, which indicates a lack of access to new ideas and perspectives, also perpetuate the gender gap-corroborating recent findings from graduate school social networks 44 . it is somewhat encouraging that there has been a steady increase in the proportion of authorship by women since 1980 (fig. 3a) . this increase is observed approximately a decade after the passage of title ix of the us civil rights act in 1972. given that the majority of authors in this network are clinicians, a partial explanation could be that us-based women began to attend previously all-male medical schools in the early 1970s, completed their training, and began to contribute to the network as authors approximately 10 years later. if the nearly linear trend continues, we predict that gender parity for new authors entering the network will be reached by the year 2032, 26 years after us medical school enrollment approached parity 45 . however, the proportion of first/last authors who are women is growing much more slowly, and parity may not be reached for 50+ years, if at all. given that senior authorship is a traditional metric of scholarly productivity, it may be particularly difficult for clinical trialists who are women to obtain promotion under the current paradigm. one possible solution is to increase the role of joint senior authorship, which remains vanishingly rare in the clinical trials domain (furman et al. 2014 46 is one of very few examples that we are aware of)-although this is predicated on the acceptance of these roles by advancement and promotion committees. the field itself may also suffer from slow entry of new talent and a lack of broad perspectives. while the gender mapping algorithm and manual lookups are imperfect, our approach is consistent with prior work in this area 16, 47 . unisex names posed a particular challenge 48 . it should be noted that we could not account for all situations where an author changed their name (e.g., a person assumed their spouse's surname); this could have led to overestimation of representation by women and underestimation of impact, since this practice is more common with women. it is also possible that an individual's gender identity does not match the gender assignment of their given name. future work will include further analysis of gender disparities, factoring in institutional affiliation and highest degree(s) obtained, which are both likely to have significant influence on publication and senior authorship 49, 50 . there are several additional limitations to this work, starting with the fact that co-authorship is but one way to measure social network interactions and this study reports results from published trials, which induces publication bias. although hemonc.org aims to be the most comprehensive resource of its kind, non-randomized trials and randomized phase ii trials are intentionally underrepresented, given that findings at this stage of investigation infrequently translate to practice-changing results (e.g., approximately 70% of oncology drugs fail during phase ii) [51] [52] [53] . the effect of any biases introduced by this underrepresentation is unclear, given the confounding influence of publication bias, which may itself be subject to gender disparity 54 . some older literature which no longer has practice-changing implications may have been overlooked. during name disambiguation, some names could not be resolved, primarily because neither medline nor the primary journal site contained full names. this effect is non-random, since certain journals do not publish full names. the choice of coefficients and their relative weights was based on clinical intuition and consensus; given that the "worth" of metrics such as first/last authorship is fundamentally qualitative, there must be some degree of subjectivity when formulating a quantitative algorithm. while the sensitivity analysis demonstrated that neither normalized author impact score distribution, assortativity, nor modularity are majorly changed by variation in the trial design and author role coefficients, it remains possible that other combinations of coefficients and relative weightings could lead to different results. furthermore, our impact algorithm weighs heavily on first and last authorship, but the definition of senior authorship has changed over time. for example, in the 1946 article by goodman et al. 2 , the authors were listed in decreasing order of seniority (personal communication). in general, the impact score used in this paper, although similar to others proposed in the academic literature, is not validated and should be interpreted with caution. finally, the majority of authors in this database publish extensively, and their impact as measured here should not be misconstrued to reflect their contributions to the cancer field more broadly. in conclusion, we have described the first and most comprehensive social network analysis of the clinical trialists involved in chemotherapy trials. we found emergent properties of a strategic network and clear indications of gender disparities, albeit with improvement in representation in recent decades. the network has been highly modular and assortative for the past 40 years, with little collaboration across most subspecialties. as the field pivots from an anatomy-based to a precision oncology paradigm, it remains to be seen how the network will re-organize so that the incredible progress seen to date can continue. 1946-2018 and referenced on hemonc.org were considered for inclusion. hemonc.org is the largest collaborative wiki of chemotherapy drugs and regimens and has a formal curation process 55 . in order for a reference to be included on hemonc.org, it generally must include at least one regimen meeting the criteria outlined here: https ://hemon c.org/wiki/eligi bilit y_crite ria. as such, the majority of references on hemonc.org are randomized controlled trials (rcts) or non-randomized trials with at least 20 participants and/or practice-changing implications. one of the main goals of hemonc.org is creating a database of all standard of care systemic antineoplastic therapy regimens. this is difficult as there is no universally accepted definition of standard of care except in a www.nature.com/scientificreports/ legal capacity. for example, the state of washington, in its legislation on medical negligence, inversely defines the standard of care as "exercis[ing] that degree of skill, care, and learning possessed at that time by other persons in the same profession". we currently employ four separate definitions that meet the threshold of standard of care: 1. the control arm of a phase iii randomized controlled trial (rct). by implication, this means that all phase iii rcts with a control arm must eventually be included on the website. 2. the experimental arm(s) of a phase iii rct that provide(s) reasonable evidence (p-value less than 0.10) of superior efficacy for an intermediate surrogate endpoint (e.g., pfs) or a strong endpoint (e.g., os). 3. a non-randomized study that is either: 4. any study (including case series and retrospective studies) that is specifically recommended by a member of the hemonc.org editorial board. all section editors of the editorial board with direct oversight of diseasespecific pages are board-eligible or board-certified physicians. in order to identify new regimens and study references for inclusion on hemonc.org, we undertake several parallel screening methods: as part of the process of building hemonc.org, we have also systematically reviewed all lancet, jama, and new england journal of medicine tables of contents from 1946 to december 31, 2018. in addition, the citations of any included manuscript are hand-searched for additional citations. for any treatment regimen that has been subject to randomized comparison, we additionally seek to identify the first instance in which such a regimen was evaluated as an experimental arm; if no such determination can be made, we seek the earliest non-randomized description of the regimen for inclusion on the website. in order or prioritization, phase iii rcts are added first, then smaller rcts such as randomized phase ii, followed by non-randomized trials, followed by retrospective studies or case series identified by our editorial board as relevant to the practice of hematology/oncology. when a reference is added to hemonc.org, bibliographic information including authorship is recorded. the usually coincides with medline record details, although some older references in medline are capped at ten authors and are manually completed based upon the publication of record. for trials that do not list individual authors (e.g., the elderly lung cancer vinorelbine italian study group 56 ), the original manuscript and appendices are examined for a writing committee. if a writing committee is identified, the members of this committee are listed as authors in the order that they appeared in the manuscript. if no writing committee is identified, the chairperson(s) of the study group are listed as the first & last authors. if no chairpersons are listed, the corresponding author is listed as the sole author. www.nature.com/scientificreports/ publications solely consisting of the evaluation of drugs not yet approved by the fda or other international approval bodies were not included. trials that appeared in abstract form only, reviews, retrospective studies, meta-analyses, and case reports were excluded, as were trials reporting only on local interventions such as surgery, radiation therapy, and intralesional therapy. non-antineoplastic trials (table s1 ) and trials of supportive interventions (e.g., antiemesis; growth factor support) were also excluded. disambiguation of author names. for each included publication, author names were extracted and disambiguated. author names on hemonc.org are stored in the medline lastname_firstinitial (middleinitial) format, which can lead to two forms of ambiguity: (1) the short form, e.g., smith_j, can refer to two or more individuals, e.g., julian and jane smith; (2) two short forms can refer to the same individual, e.g., kantarjian_h and kantarjian_hm. additionally, names can be misspelled and individuals can change their name over time (e.g., a person assumes their spouse's surname). we undertook several steps to disambiguate names: (1) full first and middle names, when available, were programmatically accessed through the ncbi pubmed eutils 57 application programming interface; (2) when not available through medline, full first names were searched for on journal websites or through web search engines; (3) automatic rules were developed to merge likely duplicates; and (4) some names were manually merged (e.g., misspellings: benboubker_lofti and benboubker_lotfi; alternate forms: rigal-huguet_francoise and huguet-rigal_francoise; and subsumptions: baldotto_clarissa and serodio da rocha baldotto_clarissa). transformation algorithms are available upon request, and the full mapping table is provided in supplemental file 1. gender mapping. once the name disambiguation step was complete, we mapped authors with full name available to gender. we first mapped names to genders using us census data, which includes the relative frequencies of given names by gender in the population of us birth from 1880 to 2017. we calculated the gender ratio for names that appeared as both genders. for names with gender ratio > 0.9 for one gender (e.g., john, rebecca), we assigned the name to that gender. to expand gender mapping to include names that are more frequently seen internationally (e.g., jean, andreas), we used a program that searches from a dictionary containing gender information about names from most european countries as well as some asian and middle eastern countries 58 . for unmatched first names (e.g., dana, michele), we manually reviewed for potential gender assignment. for some names that are masculine in certain countries and feminine in others (e.g., andrea, daniele, and pascale are masculine in italy and feminine elsewhere), we mapped based on surnames. finally, we performed manual internet searches to look for photographs and pronouns used in online content such as faculty profiles, book biographies, and professional social media accounts for the remaining unmapped full names associated with a longevity of greater than one year. a total of 25,698 (88%) authors were assigned to the categories of woman (n = 8511; 29.2%) or man (n = 17,187; 58.9%). the gender of most of the people with unassigned names could not be determined because they only appeared with initials (n = 2716; 9.3%) in the primary publication and medline. the remaining n = 685 (2.3%) were ambiguously gendered names that could not be resolved through manual searching, and were excluded in the gender-specific analyses. the full mapping table is provided in supplemental file 2. author impact score. we considered existing metrics for measuring author impact 59-62 , but ultimately proceeded with our own formulation given some of the unique considerations of prospective clinical trials and their impact. every author was assigned an impact score, using an algorithm calculated per manuscript using four coefficients: (1) author role; (2) trial type; (3) citation score; (4) primary versus updated analysis. the coefficients are multiplied to arrive at the score, and the total author impact score is summed across all of their published manuscripts. author role: first and last author roles are assigned a coefficient of three; middle authors are assigned a coefficient of one. when joint authorship is denoted in a medline record, there is an additional attribute "equalcontrib" that is set to "y" (yes). we look for this during the parsing process and treat these authors as first or last authors when the attribute is detected. trial type: any prospective trial with randomization is denoted as randomized and the authors of any manuscript reporting on such a trial are assigned a coefficient of two. non-randomized trials are assigned a coefficient of one. for manuscripts that reported on more than one trial with mixed designs (i.e., one or more randomized and one or more non-randomized trials), the randomized coefficient was used. citation score: we programmatically obtained a snapshot of citation counts from google scholar from september 2019 and used unadjusted total citations as the citation score coefficient for the years 1946-2008. as more recent publications are still accruing citations, raw citation count is not an appropriate measure of their impact. therefore, we have calculated a blended citation score for articles published between 2009-2018, adding the phased in median citation count for the journal tier in which the article was published for the years 1946-2008 (see tables s4 & s5 and figure s14 ). the citations scores are normalized to the manuscript with the maximum number of citations (stupp et al. 2005 63 , with 13,341 citations), such that the maximum citation score is one. primary publications vs. updates: the baseline coefficient is one. for updates, this score is multiplied by a half-life decay coefficient; i.e., scores for the first update are multiplied by 50%; scores for the second update by 25%; and so forth. this rule is applied equally to updates and subgroup analyses. for manuscripts that reported on pooled updates of more than one trial, the score was multiplied by the half-life coefficient corresponding to the update that resulted in the maximum score. see examples in supplemental methods. www.nature.com/scientificreports/ subspecialty designation of each publication. each publication was assigned to one of 13 diseasespecific cancer subspecialties based on the cancer(s) studied (table s1 ). the majority of publications report on a clinical trial carried out in one disease or several diseases mapping to the same subspecialty. for publications studying diseases that map to more than one subspecialty, each author's impact score for that publication was divided evenly across the subspecialties. several clinical trials employ a site-agnostic approach, e.g., to a "cancer of unknown primary" or to biomarker-defined subsets of cancers (e.g., a basket trial 64 ); for these, impact across subspecialties was split manually (table s6) . subspecialty designation based on authorship. authors were eligible for assignment to a primary subspecialty based on whether they were a first or last author at least once in the subspecialty, or whether they had a cumulative impact of at least one standard deviation below the mean of the author impact score of all authors in the subspecialty. authors who met either of these criteria were assigned to a primary subspecialty based on where the majority of their impact lay; if an author had equal impact in two or more subspecialties they were assigned equally to the subspecialties. this assignment was recalculated on an annual basis if the author had new publications, and primary subspecialty was re-assigned if a new subspecialty met either of the criteria and the impact in that subspecialty was higher than in the previous primary subspecialty. authors not meeting either of these criteria were assigned a primary subspecialty of "none" and were not included in the homophily analysis or the network visualization. social network construction and metrics. a dynamic social network was created with nodes representing authors and links representing co-authorship. the dynamic social network was discretized by year and the authors, scores, and links were cumulative (e.g., the 20 th network was cumulative from 1946-1965). therefore, once an author is added to the network, they remain in the network, with their impact score cumulatively increasing as they publish and remaining constant if publication activity ceases. the following temporal metrics were calculated: (1) network density (the number of actual connections/links present divided by the total number of potential connections); (2) modularity 65 by subspecialty (a measure of how strongly a network is divided into distinct communities, in this case subspecialties, defined as the number of edges that fall within a set of specified communities minus the number expected in a network with the same number of vertices and edges whose edges were placed randomly); (3) assortativity 66 by subspecialty (a measure of the preference of nodes in a network to attach to others that are similar in a defined way, in this case the same subspecialty; assortativity is positive if similar vertices tend to connect to each other, and negative if they tend to not connect to each other); (4) betweenness centrality 67 (a measure reflecting how important an author is in connecting other authors, calculated as the proportion of times that an author is a member of the bridge that forms the shortest path between any two other authors); (5) pagerank 68 (another measure of centrality, this time considering the connection patterns among each author's immediate neighbors; its value for each author is the probability that a person starting at any random author and randomly selecting links to other authors will arrive at the author); and (6) proportion of co-authors sharing either the same primary subspecialty designation or the same gender (hereafter referred to as homophily). network density, modularity, and assortativity are calculated at the network level, while betweenness centrality, pagerank, and homophily are calculated at the author (node) level. further definitions of these metrics are provided in the supplemental glossary. all metrics incorporated the weighted co-authorship score, which takes into account each co-author's impact modified by the number of authors of an individual publication. for each pairwise collaboration, as defined by co-authorship on the same manuscript, a co-authorship score was calculated and used as the edge weight; duplicated edges were allowed to reflect the fact that weights could be distributed in a non-even fashion (e.g., two co-authors could be middle authors on a lower-impact publication as well as senior authors on a separate high-impact publication). this score was first calculated by multiplying the individual authors' manuscriptspecific impact scores together. in order to acknowledge the role of middle authors in large multi-institutional studies, this preliminary score was divided by the total number of authors on the manuscript. this has the effect of decreasing the weight of any individual co-authorship relationship in a paper with many authors, while allowing the overall weight of the neighborhood consisting of all co-authorship connections to increase linearly with the number of authors (see examples in supplemental methods). in order to visualize the final cumulative network, layout was determined using the distributed recursive graph algorithm 69 . nodes were sized by author impact score rank and colored by primary subspecialty designation. edge width was determined by the weighted co-authorship score. statistical analysis. non-independent network metrics including growth, density, assortativity, modularity, and pagerank are reported descriptively with medians and interquartile ranges (iqr). gender proportion over time was fit with locally estimated scatterplot smoothing (loess) regression using default settings of degree = 2 with smoothing parameter/span α = 0.75 70 . for the final cumulative network, the independent variables author impact score and longevity were compared (1) between genders and (2) by whether the author changed subspecialties over time; only those authors with longevity ≥ 1 year were included in the second comparison. these comparisons were made with the two-sided wilcoxon rank sum test; p value < 0.05 was considered statistically significant. www.nature.com/scientificreports/ sensitivity analysis. to determine whether the scoring algorithm was robust to modifications, we conducted a sensitivity analysis where the author role and trial design coefficients were varied by ± 67% and ± 50%, respectively. normalized density distributions for the final cumulative network under each permutation were calculated, and temporal assortativity and modularity were compared to baseline with pearson's correlation coefficient. a version of this manuscript is posted on the medrxiv preprint server, accessible here: https ://www.medrx iv.org/conte nt/10.1101/19010 603v1 . a very early version of the work was presented in poster format at the 2018 visual analytics in healthcare workshop (november 2018). there are no other prior presentations. the datasets generated and analyzed in this study are available at harvard dataverse 71 . received: 3 january 2020; accepted: 17 september 2020 scientific reports | (2020) 10:17536 | https://doi.org/10.1038/s41598-020-73466-6 www.nature.com/scientificreports/ the biological actions and therapeutic applications of the b-chloroethyl amines and sulfides nitrogen mustard therapy; use of methyl-bis (beta-chloroethyl) amine hydrochloride and tris (beta-chloroethyl) amine hydrochloride for hodgkin's disease, lymphosarcoma, leukemia and certain allied and miscellaneous disorders general principles of cancer chemotherapy historical and methodological developments in clinical trials at the national cancer institute a history of cancer chemotherapy cancer statistics associating co-authorship patterns with publications in high-impact journals breast cancer publication network: profile of co-authorship and co-organization nepotism and sexism in peer-review inequality quantified: mind the gender gap expectations of brilliance underlie gender distributions across academic disciplines gender contributes to personal research funding success in the netherlands comparison of national institutes of health grant amounts to first-time male and female principal investigators women and academic medicine: a review of the evidence on female representation the 'gender gap' in authorship of academic medical literature-a 35-year perspective bibliometrics: global gender disparities in science gender disparities in high-quality research revealed by nature index journals the gender gap in highest quality medical research-a scientometric analysis of the representation of female authors in highest impact medical journals historical comparison of gender inequality in scientific careers across countries and disciplines the structure of scientific collaboration networks strategic networks access to expertise as a form of social capital: an examination of race-and class-based disparities in network ties to experts broadening the science of broadening participation in stem through critical mixed methodologies and intersectionality frameworks the perils of intersectionality: racial and sexual harassment in medicine combination chemotherapy with adriamycin and cyclophosphamide for advanced breast cancer 1-phenylalanine mustard (l-pam) in the management of primary breast cancer. a report of early findings tamoxifen (antiestrogen) therapy in advanced breast cancer antitumor activity of thalidomide in refractory multiple myeloma phase i trial of the proteasome inhibitor ps-341 in patients with refractory hematologic malignancies phase i/ii study of ipilimumab for patients with metastatic melanoma improved survival with ipilimumab in patients with metastatic melanoma leadership in complex networks: the importance of network position and strategic action in a translational cancer research network a unified framework for the pareto law and matthew effect using scale-free networks experience versus talent shapes the structure of the web topology of evolving networks: local events and universality threshold models of collective behavior quantifying the evolution of individual scientific impact cancer chemotherapy-present status and prospects chemotherapy in the treatment of leukemia and wilms' tumor women in academic medicine leadership: has anything changed in 25 years? covid-19 amplifies gender disparities in research why aren't there more women leaders in academic medicine? the views of clinical department chairs the "gender gap" in authorship of academic medical literature-a 35-year perspective a network's gender composition and communication pattern predict women's leadership success distribution of medical school graduates by gender idelalisib and rituximab in relapsed chronic lymphocytic leukemia gender bias in scholarly peer review name-centric gender inference using data analytics research productivity in academia: a comparative study of the sciences, social sciences and humanities the gender gap in peer-reviewed publications by physical therapy faculty members: a productivity puzzle comparison of evidence of treatment effects in randomized and nonrandomized studies can the pharmaceutical industry reduce attrition rates? contradicted and initially stronger effects in highly cited clinical research double-blind peer review and gender publication bias org: a collaborative online knowledge platform for oncology professionals effects of vinorelbine on quality of life and survival of elderly patients with advanced non-small-cell lung cancer trying an authorship index measuring co-authorship and networking-adjusted scientific impact how has healthcare research performance been assessed? a systematic review a new index to use in conjunction with the h-index to account for an author's relative contribution to publications with high impact radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma new clinical trial designs in the era of precision medicine: an overview of definitions, strengths, weaknesses, and current use in oncology modularity and community structure in networks assortative mixing in networks a set of measures of centrality based on betweenness the anatomy of a large-scale hypertextual web search engine drl: distributed recursive (graph) layout locally weighted regression: an approach to regression analysis by local fitting replication data for: seven decades of chemotherapy clinical trials: a pan-cancer social network analysis vanderbilt university) conducted and are responsible for the data analysis. we declare the following interests gibson are members of the editorial board of hemonc.org. rozina a. chowdhery, ronald s. go and eddy j. chen were members of the editorial board of hemonc.org. all positions at hemonc.org are voluntary and uncompensated, and the stock of hemonc.org llc has no monetary value none of the funders had any direct role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. supplementary information is available for this paper at https ://doi.org/10.1038/s4159 8-020-73466 -6.correspondence and requests for materials should be addressed to j.l.w.reprints and permissions information is available at www.nature.com/reprints.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution 4.0 international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons licence, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons licence, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. key: cord-333088-ygdau2px authors: roy, manojit; pascual, mercedes title: on representing network heterogeneities in the incidence rate of simple epidemic models date: 2006-03-31 journal: ecological complexity doi: 10.1016/j.ecocom.2005.09.001 sha: doc_id: 333088 cord_uid: ygdau2px abstract mean-field ecological models ignore space and other forms of contact structure. at the opposite extreme, high-dimensional models that are both individual-based and stochastic incorporate the distributed nature of ecological interactions. in between, moment approximations have been proposed that represent the effect of correlations on the dynamics of mean quantities. as an alternative closer to the typical temporal models used in ecology, we present here results on “modified mean-field equations” for infectious disease dynamics, in which only mean quantities are followed and the effect of heterogeneous mixing is incorporated implicitly. we specifically investigate the previously proposed empirical parameterization of heterogeneous mixing in which the bilinear incidence rate si is replaced by a nonlinear term ks p i q , for the case of stochastic sirs dynamics on different contact networks, from a regular lattice to a random structure via small-world configurations. we show that, for two distinct dynamical cases involving a stable equilibrium and a noisy endemic steady state, the modified mean-field model approximates successfully the steady state dynamics as well as the respective short and long transients of decaying cycles. this result demonstrates that early on in the transients an approximate power-law relationship is established between global (mean) quantities and the covariance structure in the network. the approach fails in the more complex case of persistent cycles observed within the narrow range of small-world configurations. most population models of disease (anderson and may, 1992) assume complete homogeneous mixing, in which an individual can interact with all others in the population. in these well-mixed models, the disease incidence rate is typically represented by the term bsi that is bilinear in s and i, the number of susceptible and infective individuals (bailey, 1975) , b being the transmission coefficient. with these models it has been possible to establish many important epidemiological results, including the existence of a population threshold for the spread of disease and the vaccination levels required for eradication (kermack and mckendrick, 1927; anderson and may, 1992; smith et al., 2005) . however, individuals are discrete and not well-mixed; they usually interact with only a small subset of the population at any given time, thereby imposing a distinctive contact structure that cannot be represented in mean-field models. explicit interactions within discrete spatial and social neighborhoods have been incorporated into a variety of individual-based models on a spatial grid and on networks (bolker and grenfell, 1995; johansen, a b s t r a c t mean-field ecological models ignore space and other forms of contact structure. at the opposite extreme, high-dimensional models that are both individual-based and stochastic incorporate the distributed nature of ecological interactions. in between, moment approximations have been proposed that represent the effect of correlations on the dynamics of mean quantities. as an alternative closer to the typical temporal models used in ecology, we present here results on ''modified mean-field equations'' for infectious disease dynamics, in which only mean quantities are followed and the effect of heterogeneous mixing is incorporated implicitly. we specifically investigate the previously proposed empirical parameterization of heterogeneous mixing in which the bilinear incidence rate si is replaced by a nonlinear term ks p i q , for the case of stochastic sirs dynamics on different contact networks, from a regular lattice to a random structure via small-world configurations. we show that, for two distinct dynamical cases involving a stable equilibrium and a noisy endemic steady state, the modified mean-field model approximates successfully the steady state dynamics as well as the respective short and long transients of decaying cycles. this result demonstrates that early on in the transients an approximate power-law relationship is established between global (mean) quantities and the covariance structure in the network. the approach fails in the more complex case of persistent cycles observed within the narrow range of small-world configurations. # 2006 elsevier b.v. all rights reserved. 2005). simplifications of these high-dimensional models have been developed to better understand their dynamics, make them more amenable to mathematical analysis and reduce computational complexity (keeling, 1999; eames and keeling, 2002; franc, 2004) . these approximations are based on moment closure methods and add corrections to the mean-field model due to the influence of covariances, as well as equations for the dynamics of these second order moments (pacala and levin, 1997; bolker, 1999; brown and bolker, 2004) . we address here an alternative simplification approach closer to the original mean-field formulation, which retains the basic structure of the mean-field equations but incorporates the effects of heterogeneous mixing implicitly via modified functional forms (mccallum et al., 2001) . specifically, the bilinear transmission term (si) in the well-mixed equations is replaced by a nonlinear term s p i q (severo, 1969) , where the exponents p, q are known as ''heterogeneity parameters''. this formulation allows an implicit representation of distributed interactions when the details of individual-level processes are unavailable (as is often the case, see gibson, 1997) , and when field data are collected in the form of a time series (e.g., koelle and pascual, 2004) . we henceforth refer to these modified equations as the heterogeneous mixing, or ''hm'', model following maule and filipe (in preparation) . the hm model is known to exhibit important properties not observed in standard mean-field models, such as the presence of multiple equilibria and periodic solutions (liu et al., 1986 (liu et al., , 1987 hethcote and van den driessche, 1991; hochberg, 1991) . this model has also been successfully fitted to the experimental time series data of lettuce fungal disease to explain its persistence (gubbins and gilligan, 1997) . however, it is not well known whether these modified mean-field equations can indeed approximate the population dynamics that emerge from individual level interactions. motivated by infectious diseases of plants, maule and filipe (in preparation) have recently compared the dynamics of the hm model to a stochastic susceptible-infective (si) model on a spatial lattice. in this paper, we implement a stochastic version of the susceptible-infective-recovered-susceptible (sirs) dynamics, to consider a broader range of dynamical behaviors including endemic equilibria and cycles (bailey, 1975; murray, 1993; johansen, 1996) . recovery from disease leading to the development of temporary immunity is also relevant to many infectious diseases in humans, such as cholera (koelle and pascual, 2004) . for the contact structure of individuals in the population we use a small-world algorithm, which is capable of generating an array of configurations ranging from a regular grid to a random network (watts and strogatz, 1998) . theory on the structural properties of these networks is well developed (watts, 2003) , and these properties are known to exist in many real interaction networks (dorogotsev and mendes, 2003) . a small-world framework has also been used recently to model epidemic transmission processes of severe acute respiratory syndrome or sars (masuda et al., 2004; verdasca et al., 2005) . we demonstrate that the hm model can accurately approximate the endemic steady states of the stochastic sirs system, including its short and long transients of damped cycles under two different parameter regimes, for all config-urations between the regular and random networks. we show that this result implies the establishment early on in the transients of a double power-law scaling relationship between the covariance structure on the network and global (mean) quantities at the population level (the total numbers of susceptible and infective individuals). we also demonstrate the existence of a complex dynamical behavior in the stochastic system within the narrow small-world region, consisting of persistent cycles with enhanced amplitude and a well-defined period that are not predicted by the equivalent homogeneous mean-field model. in this case, the hm model captures the mean infection level and the overall pattern of the decaying transient cycles, but not their phases. the model also fails to reproduce the persistence of the cycles. we conclude by discussing the potential significance and limitations of these observations. the model 2.1. stochastic formulation the population structure, that is, the social contact pattern among individuals in the population, is modeled using a small-world framework as follows. we start with a spatial grid with the interaction neighborhood restricted to eight neighbors ( fig. 1a) and periodic boundary condition, and randomly rewire a fraction f of the local connections (avoiding self and multiple connections) such that the average number of connections per individual is preserved at n 0 (=8 in this case). we call f the ''short-cut'' parameter of the network. this is a two-dimensional extension of the algorithm described in watts and strogatz (1998) . as pointed out by newman and watts (1999) , a problem with these algorithms is the small but finite probability of the existence of isolated sub-networks. we consider only those configurations that are completely connected. for f = 0 we have a regular grid (fig. 1a ), whereas f = 1 gives a random network (fig. 1c ). in between these extremes, there is a range of f values near 0.01 within which the network exhibits small-world properties (fig. 1b) . in this region, most local connections remain intact making the network highly ''clustered'' like the regular grid, with occasional short-cuts that lower the average distance between nodes drastically as in the random network. these properties are illustrated with two quantities, the ''clustering coefficient'' c and the ''average path length'' l (watts, 2003) . c denotes the probability that two neighbors of a node are themselves neighbor, and l denotes the average shortest distance between two nodes in the network. the small-world network exhibits the characteristic property of having a high value of c and simultaneously a low value of l (fig. 1d ). once the network structure is generated using the algorithm described above, the stochastic sirs dynamics are implemented with the following rules: a susceptible individual gets infected at a rate n i b, where n i is the number of infective neighbors and b is the rate of e c o l o g i c a l c o m p l e x i t y 3 ( 2 0 0 6 ) 8 0 -9 0 disease transmission across a connected susceptibleinfective pair. an infective individual loses infection at a rate g and recovers temporarily. a recovered individual loses its temporary immunity at a rate d and becomes susceptible again. stochasticity arises because the rate of each event specifies a stochastic process with a poisson-distributed time-intervals between successive occurrences of the event, with a mean interval of (rate) à1 . total population size is assumed constant (demography and disease-induced mortality are not considered), and infection propagates from an infective to a susceptible individual only if the two are connected. correlations develop as the result of the local transmission rules and the underlying network structure. therefore, holding b, g and d constant while varying the short-cut parameter f allows us to explore the effects of different network configurations (such as fig. 1a -c) on the epidemic. analytical considerations one way to analytically treat the above stochastic system is by using a pair-wise formulation (keeling, 1999) , which considers partnerships as the fundamental variables, and incorporates the pair-wise connections into model equations. using the notations of keeling et al. (1997) , this formulation gives the following set of equations for the dynamics of disease growth, and [r n ] denote respectively the number of susceptible, infective and recovered individuals each with exactly n connections, and [s n i m ] denotes the number of connected susceptible-infective pairs with n and m connections. by writing p n ½s n ¼ ½s ¼ s, where s is the total number of susceptible individuals, and p n p m ½s n i m à á ¼ p n ½s n i ¼ ½si, where [si] denotes the total number of connected susceptible-infective pairs, we can rewrite the equations for the number of susceptible, infective and recovered individuals as even though this set of equations is exact, it is not closed and additional equations are needed to specify the dynamics of the [si] pairs, which in turn depend on the dynamics of triples, etc., in an infinite hierarchy that is usually closed by moment closure approximations. however, a satisfactory closure scheme for a locally clustered network is still lacking (but see keeling et al., 1997; rand, 1999 ). here we pursue a different avenue to approximate the stochastic system with modified mean-field equations, which consider only the dynamics of mean quantities but replace the standard bilinear term bsi with a nonlinear transmission rate as follows, where k, p, q are the ''heterogeneity'' parameters (severo, 1969; liu et al., 1986; hethcote and van den driessche, 1991; hochberg, 1991) . we call eq. (2) the ''heterogeneous mixing'' (hm) model (maule and filipe, in preparation) . we note from eq. (1) that the incidence rate of the epidemic can be estimated by counting the number of connected susceptible-infective pairs [si] in the network. furthermore, [si] is directly related to the correlation c si that arises between susceptible and infective individuals in the network (keeling, 1999) . therefore, comparing eqs. (2) with (1) we see that the hm model implicitly assumes a double power-law relationship between this covariance structure and the abundances of infective and susceptible individuals. for instance, in a homogeneous network (such as a regular grid) with identical number of connections n 0 for all individuals, we have where n = s + i + r is the population size (keeling, 1999) . relationships such as eq. (3) provide an important first step towards understanding how the phenomenological parameters k, p and q are related to network structure. for a homogeneous random network in which every individual is connected to exactly n 0 randomly distributed others (see appendix a), the susceptible and infective individuals are uncorrelated and the total number of interacting pairs [si] = (n 0 /n)si. eq. (1) then reduce to these equations incorporate the familiar bilinear term si for the incidence rate, and provide a mean-field approximation for the stochastic system in which each individual randomly mixes with n 0 others. note that the transmission coefficient b is proportionately reduced by a factor n 0 /n, which is the fraction of the population in a contact neighborhood of each individual. in a completely well-mixed population, n 0 = n, and these equations reduce to the standard kermack-mckendrick form (kermack and mckendrick, 1927) . eq. (4) exhibit either a disease-free equilibrium, i(t) = 0, or an endemic equilibrium, i(t) = [dn/(g + d)][1 à g/bn 0 ], depending on whether the basic reproductive ratio r 0 = n 0 b/g is less or greater than unity. it is to be noted that while eq. (4) describe a homogeneous random network exactly, it provides only an approximation for the random network with f = 1, in which individuals have a binomially distributed number of connections around a mean n 0 (appendix a). details of the implementation of the stochastic system are described in appendix b. one practical approach to estimate the parameters k, p and q of the hm model, when the individual-level processes are unknown, would be to fit these parameters using time series data (gubbins and gilligan, 1997; bjørnstad et al., 2002; finkenstä dt et al., 2002) . indeed, with a sufficient number of parameters a satisfactory agreement between the model and the data is almost always possible. a direct fit of time series, however, will not tell us whether the disease transmission rate is well approximated by the functional form ks p i q of the model. we instead fit the parameters k, p, q to the transmission rate ''observed'' in the output of the stochastic simulation. specifically, we obtain least-squared estimates of k, p, q by fitting the term ks p i q to the computed number of pairs [si] that gives the disease transmission rate of the stochastic system (see eq. (1)). we then incorporate these estimates in eq. (2), and compare the infective time series produced by this hm model to that generated by the original stochastic network simulation. example. in this way, we can address whether the transmission rate is well captured by the modified functional form, and if that is the case, whether the hm model approximates successfully the aggregated dynamics of the stochastic system. we compare the stochastic simulation with the predictions of three sets of model equations, representing different degrees of approximation of the system. besides the hm model described above, we consider the bilinear mean-field model given by eq. (4), which assumes k = n 0 /n and p = q = 1. this comparison demonstrates the inadequacy of the wellmixed assumption built into the bilinear formulation. we also discuss a restricted hm model with an incidence function of the form ðn 0 =nþs pr i qr in eq. (2), with only two heterogeneity parameters p r and q r , as originally proposed by severo (1969) and studied by liu et al. (1986) , hethcote and van den driessche (1991) and hochberg (1991) . the stochastic sirs dynamics are capable of exhibiting a diverse array of dynamical behaviors, determined by both the epidemic parameters b, g, d and the network short-cut parameter f. we choose the following three scenarios: stable equilibrium: infection levels in the population reach a stable equilibrium relatively rapidly after a short transient (fig. 3a) . noisy endemic state: infection levels exhibit stochastic fluctuations around an endemic state following a long transient of decaying cycles (fig. 3b ). persistent cycles: fluctuations with a well defined period and enhanced amplitude persist in the small-world region near f = 0.01 (fig. 3b ). the reason for choosing these different temporal patterns is to test the hm model against a wider range of behaviors of the stochastic system. the oscillatory case has epidemiological significance because of the observed pervasiveness of cyclic disease patterns (cliff and haggett, 1984; grenfell and harwood, 1997; pascual et al., 2000) . fig. 3a presents simulation examples of the epidemic time series for three values of the short-cut parameter f, representing the regular grid (f = 0), the small-world network (f = 0.01) and the random network (f = 1). the transient pattern depends strongly on f: a high degree of local clustering in a regular grid slows the initial buildup of the epidemic, whereas in a random network with negligible clustering (fig. 1d ) the disease grows relatively fast. the transient for the small-world network lies in between these two extremes. by contrast, the stable equilibrium level of the infection remains insensitive to f, implying that the equilibrium should be well predicted by the bilinear mean-field approximation (eq. (4)) itself. least-squared estimates of the two sets of heterogeneity parameters [k, p, q] and [k r = n 0 /n, p r , q r ], for the full and restricted versions of the hm model respectively, are obtained for a series of f values corresponding to different network configurations, as described in section 3. the disease parameters b, g and d are kept fixed throughout, making the epidemic processes operate at the same rates across different networks, so that the effects of the network structure on the dynamics can be studied independently. transient patterns, however, presents a different picture. the mean-field trajectory deviates the most from the stochastic simulation for the regular grid (f = 0), and the least for the random network (f = 1). the full hm model with its three parameters k, p and q, on the other hand, demonstrates an excellent agreement with the stochastic transients for all values of f. by comparison, the transient patterns of the restricted hm model with only two fitting parameters p r and q r differ significantly for low values of f ( fig. 4a and b) . the poor agreement of the restricted hm and the mean-field transients with the stochastic data for a clustered network (low f) is due to the failure of their respective incidence functions to fit the transmission rate of the stochastic system ( fig. 2a) . on the other hand, the random network has negligible clustering, and the interaction between susceptible and infective individuals is sufficiently well mixed for the restricted hm model to provide as good an approximation of the stochastic transient as the full hm model (fig. 4c ). the estimates [k, p, q] = [0.0001, 0.94, 0.97] and [n 0 /n, p r , q r ] = [0.00005, 0.99, 1] for these two models are also quite similar. the discrepancy for the mean-field transients (fig. 4c) is due to the fact that the mean-field model gives only an approximate description of the random network with f = 1 as noted before. at the other extreme, for a regular grid the estimates of the full and restricted hm models are [k, p, q] = [1.66, 0.3, 0.69] and [n 0 /n, p r , q r ] = [0.00005, 0.84, 1.13], which differ considerably from each other. fig. 5a and b demonstrate how the parameters k, p and q of the full hm model depend on the short-cut parameter f. all three of them approach their respective well-mixed values (k = 0.00005, p = q = 1) as f ! 1, and they deviate the most as f ! 0, in accord with the earlier discussion. in particular, k is significantly higher, and likewise p and q are lower, for the regular grid than a well-mixed system, implying a strong nonlinearity of the transmission mechanism in a clustered network. such a large value of k can be understood within the context of the incidence function ks p i q , and explains why only two parameters, p r and q r in the restricted hm model, cannot absorb the contribution of k. in a homogeneous random network with n 0 connections per individual, the term (n 0 /n)si gives the expected total number of pairs [si] that govern disease transmission in the network, and the exponent values p = q = 1 indicate random mixing (of susceptible and infective individuals). by contrast, local interactions in a clustered network lower the availability of susceptible individuals (infected neighbors of an infective individual act as a barrier to disease spread), resulting in a depressed value of the exponent p significantly below 1. this nonlinear effect, combined with a low initial infective number i 0 (0.5% of the total population randomly infected) requires k in the hm model to be large enough to match the disease transmission in the network. indeed, as table 1 demonstrates, both k and p are quite sensitive to i 0 for a regular grid, unlike the other exponent q that does not 5)) is plotted against f for the three epidemic models. depend on initial conditions. increasing i 0 facilitates disease growth by distributing infective and susceptible individuals more evenly, which causes an increase of the value of p and a compensatory reduction of k. an interesting pattern in fig. 5a and b is that the values of the heterogeneity parameters remain fairly constant initially for low f, in particular within the interval 0 f < 0.01 for the exponents p and q (the range is somewhat shorter for k), and then start approaching respective meanfield values as f increases to 1. this pattern of variation is reminiscent of the plot for the clustering coefficient c shown in fig. 1d , and suggests that the clustering of the network, rather than its average path length l, influences disease transmission strongly. a measure of the accuracy of the approximation can be defined by an error function erf, computed as a mean of the point-by-point deviation of the infective time series i m (t) predicted from the models, relative to the stochastic simulation data i s (t), over the length t of the transient (the equilibrium values of the models coincide with the simulation, see fig. 4 (multiplication by 100 expresses erf as a percentage of the simulation time series). fig. 5c shows erf as a function of f for the three models. the total failure of the mean-field approximation to predict the stochastic transients is evident in the large magnitudes of error (it is 25% even for the random networks). by contrast, the excellent agreement of the full hm model for all f results in a low error throughout. on the other hand, the restricted version of the hm model gives over 30% error for low f whereas it is negligible for high f. interestingly, erf for the restricted hm and mean-field models show similar patterns of variation with f as in fig. 5b , staying relatively constant within 0 f < 0.01 and then decreasing relatively fast. local clustering in a network with low f causes disease transmission to deviate from a well-mixed approximation, and thus influences the pattern of erf for these simpler models. the second type of dynamical behavior of the stochastic system exhibits a relatively long oscillatory transient that settles onto a noisy endemic state for most values of f, near 0 as well as above (fig. 3b ). stochastic fluctuations are stronger for f = 0 than f = 1. however, in a significant exception the cycles tend to persist with a considerably increased amplitude and well defined period for a narrow range of f near 0.01, precisely where the small-world behavior arises in the network. such persistent cycles are not predicted by the homogeneous epidemic dynamics given by eq. (4) , and are therefore a consequence of the correlations generated by the contact structure. to our knowledge such nonmonotonic pattern for the amplitude of the cycles with the network parameter f has not been observed before (see section 5 for a comparison of these results with those of other studies). we estimate two quantities, the ''coefficient of variation'' (cv) and the ''degree of coherence'' (dc), which determine respectively the strength and periodicity of the cycles for different values of f. cv has the usual definition where the numerator denotes the standard deviation of the infective time series of length t s (in the stationary state excluding transients), and the denominator denotes its mean over the same time length t s . fig. 6a exhibits a characteristic peak for cv near f = 0.01, demonstrating a maximization of the cycle amplitudes in the small-world region compared to both the high and low values of f. the plot also shows that the fluctuations at the left side tail of the peak are stronger than its right side tail. consistent with this pattern, sustained fluctuations in a stochastic sirs model on a spatial grid (f = 0) were also shown by johansen (1996) . by contrast, the low variability in the random network (f ! 1) is due to the fact that the corresponding mean-field model (eq. (4)) does not have oscillatory solutions. dc provides a measure of the sharpness of the dominant peak in the fourier power spectrum of the infective time series, and is defined as where h max , v max and dv are the peak height, peak frequency and the width at one-tenth maximum respectively of a gaussian fit to the dominant peak. the sharp nature of the peak, particularly for the small-world network, makes it unfeasible to use the standard ''width at half-maximum'' fig. 6 -the coefficient of variation, cv (eq. (6)), and the degree of coherence, dc (eq. (7)), are plotted against f in a and b, respectively (see text for definitions). each point in b represents estimates using a fourier power spectrum averaged over 10 independent realizations of the stochastic simulations. (gang et al., 1993; lago-ferná ndez et al., 2000) which is often zero here. the modified implementation in eq. (7) therefore considerably underestimates the sharpness of the dominant peak. even then, fig. 6b depicts a fairly narrow maximum for dc near f = 0.01, indicating that the cycles within the small-world region have a well-defined period. the low value of dc for f = 0 implies that the fluctuations in the regular grid are stochastic in nature. a likely scenario for the origin of these persistent cycles is as follows. stochastic fluctuations are locally maintained in a regular grid by the propagating fronts of spatially distributed infective individuals, but they are out of phase across the network. the infective individuals are spatially correlated over a length j / d à1 in the grid (johansen, 1996) , which typically has a far shorter magnitude than the linear extent of the grid used here (increasing d reduces the correlation length j further, which weakens these fluctuations and gives the stable endemic state observed for instance in fig. 3a ). the addition of a small number of short-cuts in a small-world network (fig. 1b) couples together a few of these local fronts, thereby effectively increasing the correlation length to the order of the system size and creating a globally coherent periodic response. as more short-cuts are added, the network soon acquires a sufficiently random configuration and the homogeneous dynamics become dominant. another important point to note in fig. 3b is that, in contrast to fig. 3a , the mean infection level i of the cycles is not independent of f: i now increases slowly with f. an immediate implication of this observation is that, unlike the earlier case of a stable equilibrium, the bilinear mean-field model of eq. (4) will no longer be able to accurately predict the mean infection for all f. fig. 7 shows the same examples of the stochastic time series as in fig. 3b , along with the solutions of the three models. as expected, the mean-field time series fails to predict the mean infection level at varying degrees in all three cases, deviating most for the regular grid (f = 0) and least for the random network (f = 1). by comparison, the equilibrium solutions of the full and restricted versions of the hm model both demonstrate good agreement with the mean infection level of the stochastic system. for the transient patterns, the two hm models exhibit similar decaying cycles of roughly the same period, and also of the same transient length, as the stochastic time series but they occur at a different phase. even though the transient cycles of the hm models persist the longest for f = 0.01, they eventually decay onto a stable equilibrium and thus fail to predict the persistent oscillations of the smallworld network. the mean-field model, on the other hand, shows damped cycles of much shorter duration and hence is a poor predictor overall. the close agreement of the two hm time series with each other for f = 0.01 (fig. 7b) is due to the fact that the leastsquared estimate of k for the full hm model is 0.00005, equal to n 0 /n of the restricted hm, and the exponents p, q likewise reduce to p r , q r . within the entire range 0 f 1, the estimates of p and p r for the full and restricted hm models lie between [0.68, 1.08] and [0.82, 0.93], respectively, whereas q and q r both stay close to a value of 1.1. it is interesting to note here that a necessary condition for limit cycles in an sirs dynamics with a nonlinear incidence rate s p i q is q > 1 (liu et al., 1986) , which both the full and restricted hm models appear to satisfy. one possible reason then for their failure to reproduce the cycles in the smallworld region is the overall complexity of the stochastic time series, which results from nontrivial correlation patterns present in the susceptible-infective dynamics. the threeparameter incidence function ks p i q of the full hm model may not have sufficient flexibility to adequately fit the cyclic incidence pattern of the stochastic system. we emphasize here that if the cycles are not generated intrinsically, but are driven by an external variable such as a periodic environmental forcing, the outcome is well predicted when an appropriate forcing term is included in eq. (2) (results not shown). as a final note, all of the above observations for both stable equilibria and cyclic epidemic patterns have been qualitatively validated for multiple sets of values of the disease parameters b, g and d. stochastic sirs dynamics implemented on a network with varying degrees of local clustering can generate a rich spectrum of behaviors, including stable and noisy endemic equilibria as well as decaying and persistent cycles. persistent cycles arise in our system even though the homogeneous mean-field dynamics do not have oscillatory solutions, thereby revealing an interesting interplay of network structure and disease dynamics (also see rand, 1999) . our results demonstrate that a three-variable epidemic model with a nonlinear incidence function ks p i q , consisting of three ''heterogeneity '' parameters [k, p, q] , is capable of predicting the disease transmission patterns, including the transient and stable equilibrium prevalence, in a clustered network. the relatively simpler (and more standard) form s pr i qr with two parameters [p r , q r ] falls short in this regard. this restrictive model, however, is an adequate predictor of the dynamics in a random network, for which the bilinear mean-field approximation cannot explain the transient pattern. interestingly, even the function ks p i q cannot capture the complex dynamics of persistent cycles in a small-world network that has simultaneously high local clustering and long-distance connectivity. it is worth noting, however, that such persistent cycles appear within a small region of the parameter space for f, and therefore the hm model appears to provide a reasonable approximation for most cases of clustered as well as randomized networks. an implication of these findings is that an approximate relationship is established early on in the transients, lasting all the way to equilibrium, between the covariance structure of the [si] pairs and the global (mean) quantities s and i. this relationship is given by a double power law of the number of susceptible and infective individuals. it allows the closure of the equations for mean quantities, making it possible to approximate the stochastic dynamics with a simple model (hm) that mimics the basic formulation of the mean-field equations but with modified functional forms. it reveals an interesting scaling pattern from individual to population dynamics, governed by the underlying contact structure of the network. in lattice models for antagonistic interactions, which bear a strong similarity to our stochastic disease system, a number of power-law scalings have been described for the geometry of the clusters (pascual et al., 2002) . it is an open question whether the exponents for the dynamic scaling (i.e., parameters p and q here) can be derived from such geometrical properties. it also needs to be determined under what conditions power-law relationships will hold between local structure and global quantities. the failure of the hm model to generate persistent cycles may result from an inappropriate choice of the incidence function ks p i q . it remains to be seen if there exists a different functional form that better fits the incidence rate of the stochastic system and is capable of predicting the variability in the data. it is also not known whether a moment closure method including the explicit dynamics of the covariance terms themselves (pacala and levin, 1997; keeling, 1999) can provide a good approximation to the mean infection level in a network with high degree of local clustering. of course, heterogeneities in space or in contact structure are not the only factors contributing to the nonlinearity in the transmission function s p i q ; a number of other biological mechanisms of transmission can lead to such functional forms. by rewriting bks p i q as ½bks pà1 i qà1 si bsi, wherebðs; iþ now represents a density-dependent transmission efficiency in the bilinear (homogeneous) incidence framework, one can relateb to a variety of density-dependent processes such as those involving vector-borne transmission, or threshold virus loads etc (liu et al., 1986) . interestingly, it has been suggested that in such cases cyclic dynamics are likely to be stabilized, rather than amplified, by nonlinear transmission (hochberg, 1991) . it appears then that network structure can contribute to the cyclic behavior of diseases with relatively simple transmission dynamics. it is interesting to consider the persistent cycles we have discussed here in light of other studies on fluctuations in networks. on one side, cycles have been described for random networks with f = 1 because the corresponding well-mixed dynamics also have oscillatory solutions (lago-ferná ndez et al., 2000; kuperman and abramson, 2001) . at the opposite extreme, johansen (1996) reported persistent fluctuations in a stochastic sirs model on a regular grid (f = 0), strictly generated by the local clustering of the grid since the meanfield equations do not permit cycles. recent work by verdasca et al. (2005) extends johansen's observation by showing that fluctuations do occur in clustered networks from a regular grid to the small-world configuration. they describe a percolation type transition across the small-world region, implying that the fluctuations fall off sharply within this narro terval. this observation is in significant contrast to our results, where the amplitudes of the cycles are maximized by the small-world configuration, and therefore require both local clustering and some degree of randomization. one difference between the two models is that verdasca et al. (2005) use a discrete time step for the recovery of infected individuals, while in our event-driven model, time is continuous and the recovery time is exponentially distributed. a more systematic study of parameter space for these models is warranted. we should also mention that there are other ways to generate a clustered network than a small-world algorithm. for example, keeling (2005) described a method that starts with a number of randomly placed focal points in a twodimensional square, and draws a proportion of them towards their nearest focal point to generate local clusters. network building can also be attempted from the available data on selective social mixing (morris, 1995) . the advantage of our small-world algorithm is that besides being simple to implement, it is also one of the best studied networks (watts, 2003) . this algorithm generates a continuum of configurations from a regular grid to a random network, and many real systems have an underlying regular spatial structure, as in the case of hantavirus of wild rats within the city blocks of baltimore (childs et al., 1988) . moreover, emergent diseases like the recent outbreak of severe acute respiratory syndrome (sars) have been studied by modeling human contact patterns using small-world networks (masuda et al., 2004; verdasca et al., 2005) . the network considered here remains static in time. while this assumption is reasonable when disease spreads rapidly relative to changes of the network itself, there are many instances where the contact structure would vary over comparable time scales. examples include group dynamics in wildlife resulting from schooling or spatial aggregation, as well as territorial behavior. dynamic network structure involves processes such as migration among groups that establishes new connections and destroys existing ones, but also demographic processes such as birth and death as well as disease induced mortality. another topic of current interest is the effect of predation on disease growth, which splices together predator-prey and host-pathogen dynamics in which the prey is an epidemic carrier (ostfeld and holt, 2004) . simple dynamics assuming the homogeneous mixing of prey and predators makes interesting predictions about the harmful effect of predator control in aggravating disease prevalence with potential spill-over effects on humans (packer et al., 2003; ostfeld and holt, 2004) . it remains to be seen if these e c o l o g i c a l c o m p l e x i t y 3 ( 2 0 0 6 ) 8 0 -9 0 conclusions hold under an explicit modeling framework that binds together the social dynamics of both prey and predator. more generally, future work should address whether modified mean-field models provide accurate simplifications for stochastic disease models on dynamic networks. so far the work presented here for static networks provides support for the empirical application of these simpler models to time series data. statistical mechanics of complex networks infectious diseases of humans: dynamics and control the mathematical theory of infectious diseases dynamics of measles epidemics: estimating scaling of transmission rates using a time series sir model analytic models for the patchy spread of plant disease space, persistence and dynamics of measles epidemics the effects of disease dispersal and host clustering on the epidemic threshold in plants the ecology and epizootiology of hantaviral infections in small mammal communities of baltimore: a review and synthesis island epidemics evolution of networks modelling dynamic and network heterogeneities in the spread of sexually transmitted disease a stochastic model for extinction and recurrence of epidemics: estimation and inference for measles outbreaks metapopulation dynamics as a contact process on a graph stochastic resonance without external periodic force (meta) population dynamics of infectious diseases a test of heterogeneous mixing as a mechanism for ecological persistence in a disturbed environment some epidemiological models with nonlinear incidence non-linear transmission rates and the dynamics of infectious disease a simple model of recurrent epidemics correlation models for childhood epidemics the effects of local spatial structure on epidemiological invasions the implications of network structure for epidemic dynamics a contribution to the mathematical theory of epidemics disentangling extrinsic from intrinsic factors in disease dynamics: a nonlinear time series approach with an application to cholera modeling infection transmission small world effect in an epidemiological model fast response and temporal coherent oscillations in small-world networks influence of nonlinear incidence rates upon the behavior of sirs epidemiological models dynamical behavior of epidemiological models with non-linear incidence rate how should pathogen transmission be modelled? relating heterogeneous mixing models to spatial processes in disease epidemics transmission of severe acute respiratory syndrome in dynamical small-world networks data driven network models for the spread of disease mathematical biology the spread of epidemic disease on networks scaling and percolation in the small-world network model are predators good for your health? evaluating evidence for top-down regulation of zoonotic disease reservoirs biologically generated spatial pattern and the coexistence of competing species keeping the herds healthy and alert: impacts of predation upon prey with specialist pathogens cholera dynamics and el niñ o-southern oscillation simple temporal models for ecological systems with complex spatial patterns epidemic spreading in scale-free networks correlation equations and pair approximations for spatial ecologies persistence and dynamics in lattice models of epidemic spread percolation on heterogeneous networks as a model for epidemics generalizations of some stochastic epidemic models the impacts of network topology on disease spread ecological theory to enhance infectious disease control and public health policy contact networks and the evolution of virulence recurrent epidemics in small world networks small worlds collective dynamics of smallworld networks we thank juan aparicio for valuable comments about the work, and ben bolker and an anonymous reviewer for useful suggestions on the manuscript. this research was supported by a centennial fellowship of the james s. mcdonnell foundation to m.p. it is important to distinguish among the different types of random networks that are used frequently in the literature. one is the random network with f = 1 that is generated using the small-world algorithm as described in section 2 ( fig. 1c ), which has a total nn 0 /2 distinct connections, where n 0 is the original neighborhood size (=8 here) in the regular grid and n is the size of the network. each individual in this random network has a binomially distributed number of contacts around a mean n 0 . there is also the homogeneous random network discussed in relation to the mean-field eq. (4), which by definition has fixed n 0 random contacts per individual (keeling, 1999) . these two networks are, however, different from the random network of erdő s and ré nyi (albert and barabá si, 2002) , generated by randomly creating connections with a probability p among all pairs of individuals in a population. the expected number of distinct connections in the population is then pn(n à 1)/2, and each individual has a binomially distributed number of connections with mean p(n à 1). for moderate values of p and large population sizes, the erdő s-ré nyi network is much more densely connected than the first two types. all three of them, however, have negligible clustering c and path length l, since the individuals do not retain any local connections (all connections are short-cuts). an appropriate network is constructed with a given f, and the stochastic sirs dynamics are implemented on this network using the rules described in section 2. for the initial conditions, we start with a random distribution of a small number of infective individuals, only 0.5% of the total population (=0.005n) unless otherwise stated, in a pool of susceptible individuals. all generated time series used for least-squared fitting of the transmission rate have a length of 20,000 time units. the structure of the network remains fixed during the entire stochastic run. stochastic simulations were carried out with a series of network sizes ranging from n = 10 4 -10 6 . the results presented here are those for n = 160,000 and are representative of other sizes. the values for the epidemic rate parameters b, g and d are chosen so that the disease successfully establishes in the population (a finite fraction of the population remains infected at all times). r e f e r e n c e s key: cord-327651-yzwsqlb2 authors: ray, bisakha; ghedin, elodie; chunara, rumi title: network inference from multimodal data: a review of approaches from infectious disease transmission date: 2016-09-06 journal: j biomed inform doi: 10.1016/j.jbi.2016.09.004 sha: doc_id: 327651 cord_uid: yzwsqlb2 networks inference problems are commonly found in multiple biomedical subfields such as genomics, metagenomics, neuroscience, and epidemiology. networks are useful for representing a wide range of complex interactions ranging from those between molecular biomarkers, neurons, and microbial communities, to those found in human or animal populations. recent technological advances have resulted in an increasing amount of healthcare data in multiple modalities, increasing the preponderance of network inference problems. multi-domain data can now be used to improve the robustness and reliability of recovered networks from unimodal data. for infectious diseases in particular, there is a body of knowledge that has been focused on combining multiple pieces of linked information. combining or analyzing disparate modalities in concert has demonstrated greater insight into disease transmission than could be obtained from any single modality in isolation. this has been particularly helpful in understanding incidence and transmission at early stages of infections that have pandemic potential. novel pieces of linked information in the form of spatial, temporal, and other covariates including high-throughput sequence data, clinical visits, social network information, pharmaceutical prescriptions, and clinical symptoms (reported as free-text data) also encourage further investigation of these methods. the purpose of this review is to provide an in-depth analysis of multimodal infectious disease transmission network inference methods with a specific focus on bayesian inference. we focus on analytical bayesian inference-based methods as this enables recovering multiple parameters simultaneously, for example, not just the disease transmission network, but also parameters of epidemic dynamics. our review studies their assumptions, key inference parameters and limitations, and ultimately provides insights about improving future network inference methods in multiple applications. dynamical systems and their interactions are common across many areas of systems biology, neuroscience, healthcare, and medicine. identifying these interactions is important because they can broaden our understanding of problems ranging from regulatory interactions in biomarkers, to functional connectivity in neurons, to how infectious agents transmit and cause disease in large populations. several methods have been developed to reverse engineer or, identify cause and effect pathways of target variables in these interaction networks from observational data [1] [2] [3] . in genomics, regulatory interactions such as disease phenotype-genotype pairs can be identified by network reverse engineering [1, 4] . molecular biomarkers or key drivers identified can then be used as targets for therapeutic drugs and directly benefit patient outcomes. in microbiome studies, network inference is utilized to uncover associations amongst microbes and between microbes and ecosystems or hosts [2, 5, 6] . this can include insights about taxa associations, phylogeny, and evolution of ecosystems. in neuroscience, there is an effort towards recovering brain-connectivity networks from functional magnetic resonance imaging (fmri) and calcium fluorescence time series data [3, 7] . identifying structural or functional neuronal pairs illuminates understanding of the structure of the brain, can help better understand animal and human intelligence, and inform treatment of neuronal diseases. infectious disease transmission networks are widely studied in public health. understanding disease transmission in large populations is an important modeling challenge because a better understanding of transmission can help predict who will be affected, and where or when they will be. network interactions can be further refined by considering multiple circulating pathogenic strains in a population along with strain-specific interventions, such as during influenza and cold seasons. thus, network interactions can be used to inform interventional measures in the form of antiviral drugs, vaccinations, quarantine, prophylactic drugs, and workplace or school closings to contain infections in affected areas [8] [9] [10] [11] . developing robust network inference methods to accurately and coherently map interactions is, therefore, fundamentally important and useful for several biomedical fields. as summarized in fig. 1 , many methods have been used to identify pairwise interactions in genomics, neuroscience [12, 13] and microbiome research [14] including correlation and information gain-based metrics for association, inverse covariance for conditional independence testing, and granger causality for causation from temporal data. further, multimodal data integration methods such as horizontal integration, model-based integration, kernelbased integration, and non-negative matrix factorization have been used to combine information from multiple modalities of 'omics' data such as gene expression, protein expression, somatic mutations, and dna methylation with demographic, diagnoses, and phenotypical clinical data. bayesian inference has been used to analyze changes in gene expression from microarray data as dna measurements can have several unmeasured confounders and thereby incorporate noise and uncertainty [15] . multi-modal integration can be used for classification tasks, to predict clinical phenotypes such as tumor stage or lymph node status, for clustering of patients into subgroups, and to identify important regulatory modules [16] [17] [18] [19] [20] . in neuroscience, not just data integration, but multimodal data fusion has been performed by various methods such as linear regression, structural equation modeling, independent component analysis, principal component analysis, and partial least squares [21] . multiple modalities such as fmri, electroencephalography, and diffusion tensor imaging (dti) have been jointly analyzed to uncover more details than could be captured by a single imaging technique [21] . in metagenomics, network inference from microbial data has been performed using methods such as inverse covariance and correlation [2] . in evolutionary biology, the massive generation of molecular data has enabled bayesian inference of phylogenetic trees using markov chain monte carlo chain (mcmc) techniques [22, 23] . in infectious disease transmission network inference, bayesian inference frameworks have been primarily used to integrate data such as dates of pathogen sample collection and symptom report date, pathogen genome sequences, and locations of patients [24] [25] [26] . this problem remains challenging as the data generative processes and scales of heterogeneous modalities may be widely different, transformations applied to separate modalities may not preserve the interactions between modalities, and separately integrated models may not capture interaction effects between modalities [27] . as evidence mounts regarding the complex combination of biological, environmental, and social factors behind disease, emphasis on the development of advanced modeling and inference methods that incorporate multimodal data into singular frameworks has increased. these methods are becoming more important to consider given that the types of healthcare data available for understanding disease pathology, evolution, and transmission are numerous and growing. for example, internet and mobile connectivity has enabled mobile sensors, point-of-care diagnostics, web logs, and participatory social media data which can provide complementary health information to traditional sources [28, 29] . in the era of precision medicine, it becomes especially important to combine clinical information with biomarker and environmental information to recover complex genotype-phenotype maps [30] [31] [32] [33] . infectious disease networks are one area where the need to bring together data types has long been recognized, specifically to better understand disease transmission. data sources including high-throughput sequencing technologies have enabled genomic data to become more cost effective, offering support for studying transmission by revealing pathways of pathogen introduction and evolution in a population. yet, genomic data in isolation is insufficient to obtain a comprehensive picture of disease in the population. while these data can provide information about pathogen evolution, genetic diversity, and molecular interaction, they do not capture other environmental, spatial, and clinical factors that can affect transmission. for infectious disease surveillance, this information is usually conveyed through epidemiological data, which can be collected in various ways such as in clinical settings from the medical record, or in more recent efforts through web search logs, or participatory surveillance. participatory surveillance data types typically include age, sex, date of symptom onset, and diagnostic information such as severity of symptoms. in clinical settings, epidemiological data are generally collected from patients reporting illness. this can include, for example, age at diagnosis, sex, race, family history, diagnostic information such as severity of symptoms, and phenotypical information such as presence or absence of disease which may not be standardized. highthroughput sequencing of pathogen genomes, along with linked spatial and temporal information, can advance surveillance by increasing granularity and leading to a better understanding of the spread of an infectious disease [37] . considerable efforts have been made to unify genomic and epidemiologic information from traditional clinical forms into singular statistical frameworks to refine understanding of disease transmission [24] [25] [26] [34] [35] [36] . one approach to design and improve disease transmission models has been to analytically combine multiple, individually weak predictive signals in the form of sparse epidemiological, spatial, pathogen genomic, and temporal data [24, 25, 34, 35, 38] . molecular epidemiology is the evolving field wherein the above data types are considered together; epidemiological models are used in concert with pathogen phylogeny and immunodynamics to uncover disease transmission patterns [39] . pathogen genomic data can capture within-host pathogen diversity (the product of effective population size in a generation and the average pathogen replication time [25, 26] ) and dynamics or provide information critical to understanding disease transmission such as evidence of new transmission pathways that cannot be inferred from epidemiological data alone [40, 41] . in addition, the remaining possibilities can then be examined using any available epidemiological data. as molecular epidemiology and infectious disease transmission are areas in which network inference methods have been developed for bringing together multimodal data we use this review to investigate the foundational work in this specific field. a summary of data types, relevant questions and purpose of such studies is summarized in fig. 2 , and we further articulate the approaches below. in molecular epidemiology, several approaches have been used to overlay pathogen genomic information on traditionally collected epidemiologic information to recover transmission networks. additional modeling structure is needed in these problems because infectious disease transmission occurs through contact networks of heterogeneous individuals, which may not be captured by compartmental models such as susceptible-infec tious-recovered (sir) and susceptible-latent-infectious-recov ered (slir) models [42] . as well, for increased utility in epidemiology, there is a necessity to estimate epidemic parameters in addition to the transmission network. unlike other fields wherein recovery of just the topology of the networks is desired, in molecular epidemiology bayesian inference is commonly used to reverse engineer infectious disease transmission networks in addition to estimating epidemic parameters (fig. 2 ). while precise features can be extracted from observed data, there are latent variables not directly measured which must simultaneously be considered to provide a complete picture. thus, bayesian inference methods have been used to simultaneously infer epidemic parameters and structure of the transmission network in a single framework. instead of capturing pairwise interactions, such as correlations or inverse covariance, bayesian inference is capable of considering all nodes and inferring a global network and transmission parameters [7] . moreover, bayesian inference is capable of modeling noisy, partially sampled realistic outbreak data while incorporating prior information. while this review focuses on infectious disease transmission, network inference methods have implications in many areas. modeling network diffusion and influence, identifying important nodes, link prediction, influence probabilities and community topology and parameter detection are key questions in several fields ranging from genomics to social network analysis [43] . analogous frameworks can be developed with different modalities of observational genomics or clinical data to model information propagation and capture the influences of nodes, nodes that are more influential than others, and the temporal dynamics of information diffusion. for modeling information spread in such networks, influence and susceptibility of nodes can serve to be analogous to epidemic transmission parameters. however, these modified methods should also account for differences in the method of information propagation in such networks from infectious disease transmission by incorporating constraints in the form of temporal decay of infection, strengths of ties measured from biological domain knowledge, and multiple pathways of information spread. to identify the studies most relevant for this focused review, we queried pubmed. for practicality and relevance, our search, summarized in fig. 3 , was limited to papers from the last ten years. as our review is focused on infectious disease transmission network inference, we started with the keywords 'transmission' and 'epidemiological'. to ensure that we captured studies that incorporate pathogen genomic data, we added the keywords 'genetic', 'genomic' and 'phylogenetic' giving 5557 articles total. next, to narrow the results to those that are comprised of a study of multi-modal data, we found that the keywords 'combining' or 'integrating' alongside 'bayesian inference' or 'inference' were comprehensive. these filters yielded 73 and 61 articles in total. we found that some resulting articles focused on outbreak detection, sexually transmitted diseases, laboratory methods, and phylogenetic analysis. also, the focus of several articles was to either overlay information from different modalities or to sequentially analyze them to eliminate unlikely transmission pathways. after a full-text review to exclude these and focus on methodological approaches, 8 articles resulted which use bayesian inference to recover transmission networks from multimodal data for infectious diseases, and which represent the topic of this review. this included bayesian likelihood-based methods for integrating pathogen genomic information with temporal, spatial, and epidemiological characteristics for infectious diseases such as foot and mouth disease (fmd), and respiratory illnesses, including influenza. as incorporating genomic data simultaneously in analytical multimodal frameworks is a relatively novel idea, the literature on this is limited. recent unified platforms have been made available to the community for analysis of outbreaks and storing of outbreak data [44] . thus, it is essential to review available literature on this novel and burgeoning topic. for validation, we repeated our queries on google scholar. although google scholar generated a much broader range of papers, based on the types of papers indexed, we verified that it also yielded the articles selected from pubmed. we are confident in our choice of articles for review as we have used two separate publications databases. below we summarize the theoretical underpinnings of the likelihood-based framework approaches, inference parameters, and assumptions about each of these studies and articulate the limitations, which can motivate future research. infectious disease transmission study is a rapidly developing field given the recent advent of widely available epidemiological, social contact, social networking and pathogen genomic data. in this section we briefly review multimodal integration methods for combining pathogen genomic data and epidemiological data in a single analysis, for inferring infection transmission trees and epidemic dynamic parameters. advances in genomic technology such as sequences of whole genomes of rna viruses and identifying single nucleotide variations using sensitive mass spectrometry have enabled the tracing of transmission patterns and mutational parameters of the severe acute respiratory syndrome (sars) virus [45] . in this study, phylogenetic trees were inferred based on phylogenetic analysis using parsimony (paup â�� ) using a maximum likelihood criterion [46] . mutation rate was then inferred based on a model which assumes that the number of mutations observed between an isolate and its fig. 3 . study design and inclusion-exclusion criteria. this is a decision tree showing our searches and selection criteria for both pubmed and google scholar. we focused only on genomic epidemiology methods utilizing bayesian inference for infectious disease transmission. ancestor is proportional to the mutation rate and their temporal difference [47] . their estimated mutation rate was similar to existing literature on mutation rates of other viral pathogens. phylogenetic reconstruction revealed three major branches in taiwan, hong kong, and china. gardy et al. [29] analyzed a tuberculosis outbreak in british columbia in 2007 using whole-genome pathogen sequences and contact tracing using social network information. epidemiological information collection included completing a social network questionnaire to identify contact patterns, high-risk behaviors such as cocaine and alcohol usage, and possible geographical regions of spread. pathogen genomic data consisted of restriction-fragmentlength polymorphism analysis of tuberculosis isolates. phylogenetic inference of genetic lineage based on single nucleotide polymorphisms from the genomic data was performed. their method demonstrated that transmission information inference such as identifying a possible source patient from contact tracing by epidemiological investigation can be refined by adding ancestral and diversity information from genomic data. in one of the earliest attempts to study genetic sequence data, as well as dates and locations of samples in concert, jombart et al. [38] proposed a maximal spanning tree graph-based approach that went beyond existing phylogenetic methods. this method was utilized to uncover the spatiotemporal dynamics of the influenza a (h1n1) from 2009 and to study its worldwide spread. a total of 433 gene sequences of hemagglutinin (ha) and of neuraminidase (na) were obtained from genbank. classical phylogenetic approaches fail to capture the hierarchical relationship between both ancestors and descendants sampled at the same time. using their algorithm called seqtrack [48] , the authors constructed ancestries in samples based on a maximal-spanning tree. seqtrack [38] utilizes the fact that in the absence of recombination and reverse mutations, strains will have unique ancestors characterized by the fewest possible mutations, no sample can be the ancestor of a sample which temporally preceded it, and the likelihood of ancestry can be estimated from the genomic differentiation between samples. seqtrack was successful in reconstructing the transmission trees in both completely and incompletely sampled outbreaks unlike phylogenetic approaches, which failed to capture ancestral relationships between the tips of trees. however, this method cannot capture the underlying within-host virus genetic parameters. moreover, mutations generated once can be present in different samples and transmission likelihood based on genetic distance may not be reliable. the above methods exploit information from different modalities separately. recent methodological advancements have seen simultaneous integration of multiple modalities of data in singular bayesian inference frameworks. in the following section we discuss state-of-the-art approaches based on bayesian inference, to reconstruct partially-observed transmission trees and multiple origins of pathogen introduction in a host population [25, 34, 35, 49, 50] . we specifically focus on bayesian likelihood-based methods as the methods consider heterogeneous modalities in a single framework and simultaneously infer the transmission network and epidemic parameters such as rate of infection transmission and rate of recovery. infectious disease transmission network inference is one problem area wherein there is a foundational literature of bayesian inference methods; reviewing them together allows understanding and comparison of specific related features across models. methods are summarized in table 1 . in bayesian inference, information recorded before the study is included as a prior in the hypothesis. based on bayes theorem as shown below, this method incorporates prior information and likelihoods from the sample data to compute a posterior probability distribution or, pã°hypothesisjdataã�. the denominator is a normalization constant or, the marginal probability density of the sample data computed over all hypotheses [51] . the hypothesis for this problem can be expressed in the form of a transmission network over individuals, locations, or farms, parameters such as rate of infectiousness and recovery, or mutation probability of pathogens. the posterior probability distribution can then be estimated as in the equation below. the posterior probability is then a measure that the inferred transmission tree and parameters are correct. it can be extremely difficult to analytically compute the posterior probability distribution as it involves iterating over all possible combinations of branches of such a transmission tree and parameter values. however, it is possible to approximate the posterior probability distribution using mcmc [52] techniques. in mcmc, a markov chain is constructed which is described by the state space of the parameters of the model and which has the posterior probability distribution as its stationary distribution. for an iteration of the mcmc, a new tree is proposed by stochastically altering the previous tree. the new tree is accepted or rejected based on a probability computed from a metropolis-hastings or gibbs update. the quality of the results from the mcmc approximation can depend on the number of iterations that it is run for, the convergence criterion and the accuracy of the update function [22] . cottam et al. [40] developed one of the earliest methods to address this problem studying foot-and-mouth disease (fmd) in twenty farms in the uk. in this study, fmd virus genomes (the fmd virus has a positive strand rna genome and it is a member of the genus aphthovirus in the family picornaviridae) were collected from clinical samples from the infected farms. the samples were chosen so that they could be used to study variation within the outbreak and the time required for accumulation of genetic change, and to study transmission events. total rna was extracted directly from epithelial suspensions, blood, or esophageal suspensions. sanger sequencing was performed on 42 overlapping amplicons covering the genome [53] . as the rna virus has a high substitution rate, the number of mutations was sufficient to distinguish between different farms. they designed a maximum likelihood-based method incorporating complete genome sequences, date at which infection in a farm was identified, and the date of culling of the animals. the goal was to trace the transmission of fmd in durham county, uk during the 2001 outbreak to infer the date of infection of animals and most likely period of their infectiousness. in their approach, they first generated the phylogenies of the viral genomes [54, 55] . once the tip of the trees were generated, they constructed possible transmission trees by recursively working backwards to identify a most recent common ancestor (mrca) in the form of a farm and assigned each haplotype to a farm. the likelihood of each tree was then estimated using epidemiological data. their study included assumptions of the mean incubation time prior to being infectious to be five days, the distribution of incubation times to follow a discrete gamma distribution, the most likely date of infection to be the date of reporting minus the date of the oldest reported lesion of the farm minus the mean incubation time, and the farms to be a source of infection immediately after being identified as infected up to the day of culling. spatial dependence in the transmission events was determined from the transmission tree by studying mean transmission distance. [25] developed a bayesian likelihood-based framework integrating genetic and epidemiological data. this method was tested on an epidemic dataset of 241 poultry farms in an epidemic of avian influenza a (h7n7) in the netherlands in 2003 consisting of geographical, genomic, and date of culling data. consensus sequences of the ha, na and polymerase pb2 genes were derived by pooling sequence data from five infected animals for 185 out of the 241 farms analyzed. the likelihood of one farm infecting another increased if the former was not culled at the time of infection of the latter, if they were in geographical proximity, or if the sampled pathogen genomic sequences were related. their model included several assumptions such as non-correlation of genetic distance, time of infection, and geographical distance between host and target farms. the likelihood function was generated as follows: for the temporal component, a farm could infect another if its infection time was before the infection time of the target farm or if the infection time of the latter was between the infection and culling time of the former. if a farm was already culled, its infectiousness decayed exponentially. for the geographical component, two farms could infect each other with likelihood equal to the inverse of the distance between them. this likelihood varied according to a spatial kernel. for the genomic component, probabilities of transitions and transversions, and the presence or absence of a deletion was considered. if there was no missing data, the likelihood function was just a product of independent geographical, genomic, and temporal components. this method also allowed missing data by assuming that all the links to a specific missing data type are in one subtree. mcmc [52] was performed to sample all possible transmission trees and parameters. marginalizing over a large number of subtrees over all possible values can also prove computationally expensive. mutations were assumed to be fixed in the population before or after an infection, ignoring a molecular clock. in the method by morelli et al. [24] , the authors developed a likelihood-based function that inferred the transmission trees and infection times of the hosts. the authors assumed that a premise or farm can be infected at a certain time followed by a latency period, a time period from infectiousness to detection, and a time of pathogen collection. this method utilized the fmd dataset from the study by cottam et al. in order to simplify the posterior distribution further, latent variables denoting unobserved pathogens were removed and a pseudo-distribution incorporating the genetic distance between the observed and measured consensus sequences was generated. the posterior distribution corresponded to a pseudo-posterior distribution because the pathogens were sampled at observation time and not infection time. the genetic distance was measured by hamming distance between sequences in isolation without considering the entire genetic network. several assumptions including independence of latency time and infectiousness period were made. in determining the interval from the end-of-latency period to detection, the informative prior was centered on lesion age. this made this inference technique sensitive to veterinary estimates of lesion age. this study considered a single source of viral introduction in the population, which is feasible if the population size considered is small. this technique did not incorporate unobserved sources of infection and assumed all hosts were sampled. the authors also assumed that each host had the same probability of being infected. teunis et al. [56] developed a bayesian inference framework to infer transmission probability matrices. the authors assumed that likelihood of infection transmission over all observed individuals would be equal to the product of conditional probability distributions between each pair of individuals i and j, and the correspond-ing entry from the transition probability matrix representing any possible transmissions from ancestors to i. the inferred matrices could be utilized to identify network metrics such as number of cases infected by each infected source and transmission patterns could be detected by analyzing pairwise observed cases during an outbreak. the likelihood function could be generated by observed times of onset, genetic distance, and geographical locations. their inferred parameters were the transmission tree and reproductive number. their method was applied to a norovirus outbreak in a university hospital in netherlands. in a method developed by ypma et al. [34] , the statistical framework for inferring the transmission tree simultaneously generated the phylogenetic tree. this method also utilized the fmd dataset from the study by cottam et al. their approach for generating the joint posterior probability of the transmission tree differed from existing methods in including the simultaneous estimation of the phylogenetic tree and within-host dynamics. the posterior probability distribution defined a sampling space consisting of the transmission tree, epidemiological parameters, and withinhost dynamics which were inferred from the measured epidemiological data and the phylogenetic tree and mutation parameters which were inferred from the pathogen genomic data. the posterior probability distribution was estimated using the mcmc technique. the performance of the method was evaluated by measuring the probability assigned to actual transmission events. the assumptions made were that all infected hosts were observed, time of onset was known, sequences were sampled from a subpopulation of the infected hosts, and a single source/host introduced the infection in the population. in going beyond existing methods, the authors did not assume that events in the phylogenetic tree coincide with actual transmission events. a huge sampling fraction would be necessary to capture such microscale genetic diversity. this method works best when all infected hosts are observed and sampled. mollentze et al. [49] have used multimodal data in the form of genomic, spatial and temporal information to address the problem of unobserved cases, an existing disease well established in a population, and multiple introductions of pathogens. their method estimated the effective size of the infected population thus being able to provide insight into number of unobserved cases. the authors modified morelli et al.'s method described above by replacing the spatial kernel with a spatial power transmission kernel to accommodate wider variety of transmission. in addition, the substitution model used by morelli et al. was modified by a kimura three parameter model [57] . this method was applied to a partially-sampled rabies virus dataset from south africa. the separate transmission trees from partially-observed data could be grouped into separate clusters with most transmissions in the under-sampled dataset being indirect transmissions. reconstructions were sensitive to choice of priors for incubation and infectious periods. in a more recent approach to study outbreaks and possible transmission routes, jombart et al. [35] , in addition to reconstructing the transmission tree, addressed important issues such as inferring possible infection dates, secondary infections, mutation rates, multiple pathways of pathogen introduction, foreign imports, unobserved cases, proportion of infected hosts sampled, and superspreading in a bayesian framework. jombart tested their algorithm outbreaker on the 2003 sars outbreak in singapore using 13 known cases of primary and secondary infection [35, 45, 58] . in this study, 13 genome sequences of severe acute respiratory syndrome (sars) were downloaded from genbank and analyzed. their method relies on pathogen genetic sequences and collection dates. similar to their previous approach [50] , their method assumed mutations to be parameters of transmission events. epidemiological pseudo-likelihood was based on collection dates. genomic pseudo-likelihood was computed based on genetic distances between isolates. this method would benefit from known transmission pathways and mutation rates and is specifically suitable for densely sampled outbreaks. their method assumed generation time-time from primary to secondary infections-and time from infection to collection were available. their method ignored within-host diversity of pathogens. instead of using a strict molecular clock, this method used a generational clock. didelot et al. [26] developed a framework to examine if wholegenome sequences were enough to capture transmission events. unlike other existing studies, the authors took into account within-host evolution and did not assume that branches in phylogenetic trees correspond to actual transmission events. the generation time corresponds to the time between a host being infected and infecting others. for pathogens with short generation times, genetic diversity may not accrue to a very high degree and one can ignore within-host diversity. however, for diseases with high latency times and ones in which the host remains asymptomatic, there is scope for accumulation of considerable within-host genetic diversity. their method used a timed phylogenetic tree from which a transmission tree is inferred on its own or can be combined with any available epidemiological support. their simulations revealed that considering within-host pathogen generation intervals resulted in more realistic phylogenies between infector and infected. the method was tested on simulated datasets and with a real-world tuberculosis dataset with a known outbreak source with only genomic data and then modified using any available epidemiological data. the latter modified network resembled more the actual transmission activity in having a web-like layout and fewer bidirectional links. their approach would work well for densely sampled outbreaks. some of the most common parameters inferred for infectious disease transmission in these bayesian approaches are the transmission tree between infected individuals or animals, the mutation rates of different pathogens, phylogenetic tree, within-host diversity, latency period, and infection dates [24, 34, 40, 26] . additional parameters in recent work are reproductive number [26] , foreign imports, superspreaders, and proportion of infected hosts sampled [35] . several simplifying assumptions have been made in the reviewed bayesian studies, limiting their applicability across different epidemic situations. in cottam's [40] approach, the phylogenetic trees generated from the genomic data are weighed by epidemiological factors to limit analysis to possible transmission trees. however, sequential approaches may not be ideal to reconstruct transmission trees and a method that combines all modalities in a single likelihood function may be necessary. ypma et al. [25] assumed that pathogen mutations emerge in the host population immediately before or following infections. moreover, the approach weighed each data type via their likelihood functions and considers each data type independent of the others, which may not be a realistic assumption. jombart et al. [38] also inferred ancestral relationships to the most closely sampled ancestor as all ancestors may not be sampled. morelli et al. [24] assumed flat priors for all model parameters. however, the method was estimated with the prior for the duration from latency to infection centered on the lesion age making the method sensitive to it and to veterinary assessment of infection age. the method developed by mollentze et al. [49] required knowledge of epidemiology for infection and incubation periods. identifying parents of infected nodes, as proposed by teunis et al., [56] assumes that all infectious cases were observed which may not be true in realistic, partiallyobserved outbreaks. didelot et al. [26] developed a framework based on a timed phylogenetic tree, which infers within-host evolutionary dynamics with a constant population size and denselysampled outbreaks. several of these approaches rely on assumptions of denselysampled outbreaks, a single pathogen introduction in the population, single infected index cases, samples capturing the entire outbreak, that all cases comprising the outbreak are observed, existence of single pathogen strains, and all nodes in the transmission network having constant infectiousness and the same rate of transmission. however, in real situations the nodes will have different infectiousness and rate of spreading from animal to animal, or human to human. moreover, the use of clinical data only is nonrepresentative of how infection transmits to a population as it generally only captures the most severely affected cases. our literature review is summarized in table 1 . as large-scale and detailed genomic data becomes more available, analyses of existing bayesian inference methods described in our review will inform their integration in epidemiological and other biomedical research. as more and more quantities of diverse data becomes available, developing bayesian inference frameworks will be the favored tool to integrate information and draw inference about transmission and epidemic parameters simultaneously. the specific focus in this review on the application of network inference in infectious disease transmission enables us to consider and comment on common parameters, data types and assumptions (summarized in table 1 ). novel data sources have increased the resolution of information as well as enabled a closer monitoring and study of interactions; spatial and genomic resolution of the bayesian network-inference studies reviewed are summarized in fig. 4 to illustrate the scope of current methods. further, we have added suggestions for addressing identified challenges in these methods regarding their common assumptions and parameters in table 2 . given the increasing number and types of biomedical data available, we also discuss how models can be augmented to harness added value from these multiple and highergranularity modalities such as minor variant identification from deep sequencing data or community-generated epidemiological data. existing methods are based on pathogen genome sequences which may largely be consensus in nature where the nucleotide or amino acid residue at any given site is the most common residue found at each position of the sequence. other recent approaches have reconstructed epidemic transmission using whole genome sequencing. detailed viral genomic sequence data can help distinguish pathogen variants and thus augment analysis of transmission pathways and host-infectee relationships in the population. highly parallel sequencing technology is now available to study rna and dna genomes at greater depth than was previously possible. using advanced deep sequencing methods, minor variations that describe transmission events can be captured and must also then be represented in models [59, 60] . models can also be encumbered with considerable selection bias by being based on clinical or veterinary data representative of a subsample of only the most severely infected hosts who access clinics. existing multi-modal frameworks are designed based on clinical data such as sequences collected from cases of influenza [35, 38] or veterinary assessment of fmd [24, 53] , which generally represent the most severe cases with access to traditional healthcare institutions and automatically inherit considerable selection bias. models to-date do not consider participatory surveillance data that has become increasingly available via mobile and internet accessibility (e.g. data from web logs, search queries, web survey-based participatory efforts such as goviral with linked symptomatic, immunization, and molecular information [61] and online social networks and social network questionnaires). another approach to improve the granularity of collected data could be community-generated data. these data can be finegrained and can capture information on a wide range of cases from asymptomatic to mildly infectious to severe. this data can be utilized to incorporate additional transmission parameters of a community which can be more representative of disease transmission. as exemplified in fig. 4a , community-generated data can be collected at the fine-grained spatial level of households, schools, workplaces, or zip codes and models must then also accommodate these spatial resolutions. studies to-date have also generally depended on available small sample sizes and some are specifically tailored to a specific disease or pathogen such as sars, avian influenza, or fmd [34, 35, 40] . hiseq platform with m. tuberculosis cdc1551 reference sequence and aligned using burrows-wheeler aligner algorithm. sars dna sequences were obtained from genbank and aligned using muscle. for avian influenza, rna consensus sequences of the haemagglutinin, neuriminidase and polymerase pb2 genes were sequenced. for h1n1 influenza, isolates were typed for hemagglutinin (ha) and neuraminidase (na) genes. methods will have to handle missing data and unobserved and unsampled hosts to be applicable to realistic scenarios. in simpler cases, assumptions of single introductions of infection with single strains being passed between hosts may be adequate. however, robust frameworks will have to consider multiple introductions of pathogens in the host population with multiple circulating strains and co-infections in hosts. in order to be truly useful, frameworks have to address questions regarding rapid mutations of certain pathogens, phylogenetic uncertainty, recombination and reassortment, population stochastics, super spreading, exported cases, multiple introductions of pathogens in a population, within and between-host pathogen evolution, and phenotypic information. methods will also need to scale up to advances in nextgeneration sequencing technology capable of producing large amounts of genomic data inexpensively [62, 63] . in the study of infectious diseases, the challenge remains to develop robust statistical frameworks that will take into account the relationship between epidemiological data and phylogeny and utilize that to infer pathogen transmission while taking into account realistic evolutionary times and accumulation of withinhost diversity. moreover, to benefit public health inference methods need to uncover generic transmission patterns, wider range of infections and risks including asymptomatic to mildly infectious cases, clusters and specific environments, and host types. network inference frameworks from the study of infectious diseases can be analogously modified to incorporate diverse forms of multimodal data and model information propagation and interactions in diverse applications such as drug-target pairs and neuronal connectivity or social network analysis. the detailed examination of models, data sources and parameters performed here can inform inference methods in different fields, and bring to light the way that new data sources can augment the approaches. in general, this will enable understanding and interpretation of influence and information propagation by mapping relationships between nodes in other applications. review of multimodal integration methods for transmission network inference a comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks sparse and compositionally robust inference of microbial ecological networks model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals dialogue on reverse-engineering assessment and methods molecular ecological network analyses marine bacterial, archaeal and protistan association networks reveal ecological linkages network modelling methods for fmri modeling the worldwide spread of pandemic influenza: baseline case and containment interventions a 'smallworld-like' model for comparing interventions aimed at preventing and controlling influenza pandemics reducing the impact of the next influenza pandemic using household-based public health interventions estimating the impact of school closure on influenza transmission from sentinel data model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals network modelling methods for fmri sparse and compositionally robust inference of microbial ecological networks a bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes mvda: a multi-view genomic data integration methodology information content and analysis methods for multi-modal high-throughput biomedical data a novel computational framework for simultaneous integration of multiple types of genomic data to identify microrna-gene regulatory modules a kernel-based integration of genome-wide data for clinical decision support predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks a review of multivariate methods for multimodal fusion of brain imaging data bayesian inference of phylogeny and its impact on evolutionary biology mrbayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space a bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data unravelling transmission trees of infectious diseases by combining genetic and epidemiological data bayesian inference of infectious disease transmission from whole-genome sequence data methods of integrating data to uncover genotype-phenotype interactions why we need crowdsourced data in infectious disease surveillance wholegenome sequencing and social-network analysis of a tuberculosis outbreak novel clinico-genome network modeling for revolutionizing genotype-phenotype-based personalized cancer care integrative, multimodal analysis of glioblastoma using tcga molecular data, pathology images, and clinical outcomes an informatics research agenda to support precision medicine: seven key areas the foundation of precision medicine: integration of electronic health records with genomics through basic, clinical, and translational research relating phylogenetic trees to transmission trees of infectious disease outbreaks bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data extracting transmission networks from phylogeographic data for epidemic and endemic diseases: ebola virus in sierra leone h1n1 pandemic influenza and polio in nigeria the role of pathogen genomics in assessing disease transmission reconstructing disease outbreaks from genetic data: a graph approach molecular epidemiology: application of contemporary techniques to the typing of microorganisms integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus the distribution of pairwise genetic distances: a tool for investigating disease transmission the mathematics of infectious diseases dynamics and control of diseases in networks with community structure outbreaktools: a new platform for disease outbreak analysis using the r software mutational dynamics of the sars coronavirus in cell culture and human populations isolated in phylogenetic analysis using parsimony (and other methods). version 4, sinauer associates molecular evolution and phylogenetics adegenet: a r package for the multivariate analysis of genetic markers a bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data bayesian inference in ecology an introduction to mcmc for machine learning molecular epidemiology of the foot-and-mouth disease virus outbreak in the united kingdom in tcs: a computer program to estimate gene genealogies a cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and dna sequence data. iii. cladogram estimation infectious disease transmission as a forensic problem: who infected whom? estimation of evolutionary distances between homologous nucleotide-sequences comparative full-length genome sequence analysis of 14 sars coronavirus isolates and common mutations associated with putative origins of infection extensive geographical mixing of 2009 human h1n1 influenza a virus in a single university community quantifying influenza virus diversity and transmission in humans surveillance of acute respiratory infections using community-submitted symptoms and specimens for molecular diagnostic testing eight challenges in phylodynamic inference sequencing technologies-the next generation the authors declare no conflict of interest. key: cord-319055-r16dd0vj authors: dumitrescu, cătălin; minea, marius; costea, ilona mădălina; cosmin chiva, ionut; semenescu, augustin title: development of an acoustic system for uav detection † date: 2020-08-28 journal: sensors (basel) doi: 10.3390/s20174870 sha: doc_id: 319055 cord_uid: r16dd0vj the purpose of this paper is to investigate the possibility of developing and using an intelligent, flexible, and reliable acoustic system, designed to discover, locate, and transmit the position of unmanned aerial vehicles (uavs). such an application is very useful for monitoring sensitive areas and land territories subject to privacy. the software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (conns). an analysis of the detection and tracking performance for remotely piloted aircraft systems (rpass), measured with a dedicated spiral microphone array with mems microphones, was also performed. the detection and tracking algorithms were implemented based on spectrograms decomposition and adaptive filters. in this research, spectrograms with cohen class decomposition, log-mel spectrograms, harmonic-percussive source separation and raw audio waveforms of the audio sample, collected from the spiral microphone array—as an input to the concurrent neural networks were used, in order to determine and classify the number of detected drones in the perimeter of interest. in recent years, the use of small drones has increased dramatically. illegal activity with these uavs has also increased, or at least became more evident than before. recently, it has been reported that such vehicles have been employed to transport drugs across borders, to transport smugglers to prisons, to breach the security perimeter of airports and to create aerial images of senzitive facilities. to help protect against these activities, a drone detection product could warn of a security breach in due time to take action. this article tries to answer the following questions: (1) is it possible to build an audio detection, recognition, and classification system able to detect the presence of several drones in the environment, with relatively cheap commercial equipment (cots)? (2) assuming that it can function as a prototype, what challenges could be raised when scaling the prototype for practical use? the questions will be approached in the context of a comparison between the performance of systems using concurrent neural networks and the algorithm proposed by the authors. the proposed solution employs for the acoustic drone detector competing neural networks with spectrogram variants both in frequency and psychoacoustic scales, and increased performance for neural network architectures. two concepts are investigated in this work: (i) the way that a concept of competition in a collection of neural networks can be implemented, and (ii) how different input data can influence the performance of the recognition process in some types of neural networks. the subject of this article is in the form recognition domain, that offers a very broad field of research. recognition of acoustic signatures is a challenging task, grouping a variety of issues, which include the recognition of isolated characteristic frequencies and identification of unmanned aerial vehicles, based on their acoustic signatures. neural networks represent a tool that has proven its effectiveness in solving a wide range of applications, including automated speech recognition. most neural models approach form recognition as a unitary, global problem, without distinguishing between different input intakes. it is a known fact that the performance of neural networks may be improved via modularity and by applying the "divide et impera" principle. in this paper, the identification and classification of uavs is performed by the means of two neural networks: the self-organizing map (som) and the concurrent neural network (conn). the newly introduced conn model combines supervised and unsupervised learning paradigms and provides a solution to the first problem. a process of competition is then employed in a collection of neural networks that are independently trained to solve different sub-problems. this process is accomplished by identifying the neural network which provides the best response. as experimental results demonstrate, a higher accuracy may be obtained when employing this proposed algorithm, compared to those employed in non-competitive cases. several original recognition models have been tested and the theoretical developments and experimental results demonstrate their viability. the obtained databases are diverse, being both standard collections for different types of uavs' soundings and sets made specifically for the experiments in this paper, containing acoustic signatures of proprietary drones. based on the tests performed on some models and standard form recognition data sets, it can be illustrated that these may be also used in contexts other than the recognition of acoustic signals generated by drones. in order to reduce the complexity of recognition through a single neural network of the entire collection of isolated acoustic frequency of all drones, a solution of a modular neural network, consisting of neural networks specialized on subproblems of the initial problem has been chosen. the concurrent neural networks classification has been introduced as a collection of low-volume neural networks working in parallel, where the classification is made according to the rule where the winner takes all. the training of competing neural networks starts from the assumption that each module is trained with its own data set. the system is made up of neural networks with various architectures. multi-layered perceptron types, time-lagged and self-mapping neural network types have been used for this particular case, but other variants may also be employed. the recognition scheme consists of a collection of modules trained on a subproblem and a module that selects the best answer. the training and recognition algorithms implement these two techniques that are custom for multilayer perceptron (mlp), time delayed neural networks (tdnn) and self-organizing maps (som). mlp-conn and tdnn-conn use supervised trained modules and training instruction sets contain both positive and negative examples. in contrast, som-conn consists of modules that are trained by an unsupervised algorithm and the data consist only of positive examples. the remaining of this article is organized as follows: section 2 presents a selective study on similar scientific works (related work), section 3, the problem definition and the proposed solution, section 4, employing conns in the uav recognition process, section 5 our experimental results and section 6, the discussion and conclusions. physical threats that may arise from unauthorized flying of uavs over forbidden zones is analyzed by other researchers [7] , along with reviewing various uav detection techniques based on ambient radio frequency signals (emitted from drones), radars, acoustic sensors, and computer vision techniques for detection of malicious uavs. in s similar work [8] , the detection and tracking of multiple uavs flying at low altitude is performed with the help of a heterogeneous sensor network consisting of acoustic antennas, small frequency modulated continuous wave (fmcw) radar systems and optical sensors. the researchers applied acoustics, radar and lidar to monitor a wide azimuthal area (360 • ) and to simultaneously track multiple uavs, and optical sensors for sequential identification with a very narrow field of view. in [9] the team presents an experimental system dedicated for the detection and tracking of small aerial targets such as unmanned aerial vehicles (uavs) in particular small drones (multi-rotors). a system for acoustic detection and tracking of small objects in movement, such as uavs or terrestrial robots, using acoustic cameras is introduced in [10] . in their work, the authors deal with the problem of tracking drones in outdoor scenes, scanned by a lidar sensor placed on the ground level. for detecting uavs the researchers employ a convolutional neural network approach. afterwards, kalman filtering algorithms are used as a cross-correlation filtering, then a 3d model is built for determining the velocity of the tracked object. other technologies involved in unauthorized flying of drones over restricted areas include passive bistatic radar (pbr) employing a multichannel system [11] . in what concerns the usage of deep neural networks in this field of activity, aker and kalkan [12] present a solution using an end-to-end object detection model based on convolutional neural networks employed for drone detection. the authors' solution is based on a single shot object detection model, yolov2 [13] , which is the follow-up study of yolo w. for a better selection of uavs from the background, the model is trained to separate these flying objects from birds. in the conclusion section, the authors state that by using this method drones can be detected and distinguished from birds using an object detection model based on a cnn. further on, liu et al. [14] employ an even more complex system for drone detection, composed from a modular camera array system with audio assistance, which consists of several high-definition cameras and multiple microphones, with the purpose to monitor uavs. in the same area of technologies, popovic et al. employ a multi-camera sensor design acquiring near-infrared (nir) spectrum for detecting mini-uavs in a typical rural country environment. they notice that the detection process needs detailed pixel analysis between two consecutive frames [15] . similarly, anwar et al. perform drone detection by extracting the required features from adr sound, mel frequency cepstral coefficients (mfcc), and implementing linear predictive cepstral coefficients (lpcc). classification is performed after the feature extraction, and support vector machines (svm) with various kernels are also used for improving the classification of the received sound waves [16] . supplementary, the authors state that " . . . the experimental results verify that svm cubic kernel with mfcc outperform lpcc method by achieving around 96.7% accuracy for adr detection". moreover, the results verified that the proposed ml scheme has more than 17% detection accuracy, compared with correlation-based drone sound detection scheme that ignores ml prediction. a study on the cheap radiofrequency techniques for detecting drones is presented by nguyen et al. [17] , where they focus on autonomously detection and characterization of unauthorized drones by radio frequency wireless signals, using two combined methods: sending a radiofrequency signal and analyzing its reflection and passive listening of radio signals, process subjected to a second filtration analysis. an even more complex solution for drone detection using radio waves is presented by nuss et al. in [18] , where the authors employ a system setup based on mimo ofdm radar that can be used for detection and tracking of uavs on wider areas. keeping the research in the same field, the authors of [19] present an overview on passive drone detection with a software defined radio (sdr), using two scenarios. the authors state that "operation of a non-los environment can pose a serious challenge for both passive methods". it has been shown that the drone flight altitude may play a significant role in determining the rician factor and los probability, which in turn affects the received snr. several other approaches are presented in similar work [20, 21] . in what concerns the acoustic signature recognition, the scientific literature is comparatively rich. bernadini et al. obtained a resulting accuracy of the drone recognition of 98.3% [22] . yang et al. also propose an uav detection system with multiple acoustic nodes using machine learning models, with an empirically optimized configuration of the nodes for deployment. features including mel-frequency cepstral coefficients (mfcc) and short-time fourier transform (stft) were used by these researchers for training. support vector machines (svm) and convolutional neural networks (cnn) were trained with the data collected in person. the purpose was to determine the ability of this setup to track trajectories of flying drones [23] . in noisy environments, sound signature of uavs is more difficult to recognize. moreover, there are different environments with specific background soundings. lin shi et al. deal with this challenge and present an approach to recognize drones via sounds emitted by their propellers. in their paper, the authors declare that experimental results validate the feasibility and effectiveness of their proposed method for uav detection based on sound signature recognition [24] . similar work is described in papers [25] and [26] . finally, it can be concluded that this research field is very active and there are several issues that haven't been yet fully addressed, such as separation of the uav from environment (birds, obstructing trees, background mountains, etc.), issue depending very much on the technology chosen for drone detection. however, one approach proves its reliability-that is the usage of multisensory constructions, where weaknesses of some technologies can be compensated by others. therefore, we consider that employing a multisensory approach has more chances of success than using a single technology. classification of environmental sound events is a sub-field of computational analysis of auditory scenes, which focuses on the development of intelligent detection, recognition, and classification systems. detecting the acoustic fingerprints of drones is a difficult task because the specific acoustic signals are masked by the noises of the detection environment (wind, rain, waves, sound propagation in the open field/urban areas). unlike naturally occurring sounds, drones have distinctive sound characteristics. taking advantage of this aspect, the first part of the article focuses on building an audio detection, recognition, and classification system for the simultaneous detection of several drones in the scene. as presented in the initial part of this work, the main task of the proposed system is to detect unauthorized flying of uavs over restricted areas, by locating these vehicles and tracking them. the difficulty of the process resides in the environmental noise, and the visibility at the moment of detection. different types of microphones and a specific arrangement is used for improving the performance of the acoustic detection component. thus, the system employed for detection, recognition and automatic classification of drones using the acoustic fingerprint is composed of a hardware-software assembly as shown in figures 1 and 2 . the first functional component to be addressed is the sensing block, composed of an area of spiral-type microphones with mems, in the acoustic fields, with a spiral arrangement, shown in figure 2 . the microphone area is composed of 30 spiral-shaped mems digital microphones, so as to achieve adaptive multi-channel type weights with variable pitch. the following components have been employed for the microphone array: knowles (knowles electronics, llc, itasca, il, usa) mems microphones with good acoustic response types (typically 20 hz to >20 khz +/− 2 db frequency ratings). the system allows the detection of the presence of the acoustic signal of reduced complexity. for improving the quality of the received signal, adaptive methods to cancel the acoustic reaction, as well as adaptive methods to reduce the acoustic noise were also used. for the protection of the restricted area, the developed acoustic system was configured in a network composed of at least eight microphone array modules, arranged on the perimeter of the protected area. to increase the detection efficiency, the number of microphone array may also be increased, and the network of acoustic sensors can be configured both linearly and in depth, thus forming a safety zone around the protected area. performing acoustic measurements highlights the presence of a tonal component at frequencies of 200-5000 hz (small and medium drones-multicopter) and in the frequency range 200-10,000 hz (medium and large drones-multicopter), which is the typical sound emission of uav in the operation phase of flight. for medium and large multicopter drones the harmonics of the frequencies characteristic are also found over 10 khz (16) (17) (18) (19) (20) (21) (22) (23) (24) . the identification of this frequency is a sign of the presence of a uav in the environment. in figure 3 is presented the spiral microphone array simulation along with the beamforming analysis using multiple signal classification & direction of arrival (music doa). doa denotes the direction from which typically a propagation wave arrives at a point where a set of sensors are placed. the image in the right section shows the energetic detection of the acoustic signal generated by the drone's engines and rotors, detecting the location position (azimuth and elevation), for the two acoustic frequencies characteristic of drones (white color), represented on the frequency spectrum (bottom right). using the application in figure 3 we have tested the beamforming capabilities of the system and also directivity, using the spiral microphone array. in this simulation, the atmospheric conditions (turbulence) that may affect the propagation of sounds were not taken into account. employing a set of multiple microphones with beamforming and a signal processing technique used filtering in order to obtain a better signal reception increased the maximum detection distance in the presented mode. the process that is common for all forms of acoustic signals recognition systems is the extraction of characteristic vectors from uniformly distributed segments of time of the sampled sound signal. prior to extraction of these features, the uav generated signal must undergo the following processes: (a) filtering: the detector's input sound needs filtering to get rid of unwanted frequencies. on the other hand, the filter must not affect the reflection coefficients. in the experiments an iir notch adaptive filter has been used. (b) segmentation: the acoustic signal is non-stationary for a long-time observation, but quasistationary for short time periods, i.e., 10-30 ms, therefore the acoustic signal is divided into fixed-length segments, called frames. for this particular case, the size of a frame is 20 ms, with a generation period of 10 ms, so that a 15 ms overlap occurs from one window to the next one. (c) attenuation: each frame is multiplied by a window function, usually hamming, to mitigate the effect of finishing windows segmentation. (d) mel frequency cepstrum coefficients (mfcc) parameters: to recognize an acoustic pattern generated by the uav, it is important to extract specific features from each frame. many such features have been investigated, such as linear prediction coefficients (lpcs), which are derived directly from the speech production process, as well as the perceptual linear prediction (plp) coefficients that are based on the auditory system. however, in the last two decades, spectrum-based characteristics have become popular especially because they come directly from the fourier transform. the spectrum-based mel frequency cepstrum coefficients are employed in this research and their success is due to a filter bank which make use of wavelet transforms for processing the fourier transform, with a perceptual scale similar to the human auditory system. also, these coefficients are robust to noise and flexible, due to the cepstrum processing. with the help of the uav sonic generated specific mfcc coefficients, recognition dictionaries for the training of neural networks are then shaped. (e) feature extraction for mfcc. the extraction algorithms of the mfcc parameters are shown in figure 4 . the calculation steps are the following: • performing fft for each frame of the utterance and removing half of it. the spectrum of each frame is warped onto the mel scale and thus mel spectral coefficients are obtained. • discrete cosine transform is performed on mel spectral coefficients of each frame, hence obtaining mfcc. • the first two coefficients of the obtained mfcc are removed as they varied significantly between different utterances of the same word. liftering is done by replacing all mfcc except the first 14 by zero. the first coefficient of mfcc of each frame is replaced by the log energy of the correspondent frame. delta and acceleration coefficients are found from the mfcc to increase the dimension of the feature vector of the frames, thereby increasing the accuracy. • delta cepstral coefficients add dynamic information to the static cepstral features. for a short-time sequence c[n], the delta-cepstral features are typically defined as: where n is the index of the analysis frame and in practice m is approximately 2 or 3. coefficients describing acceleration are found by replacing the mfcc in the above equation by delta coefficients. • feature vector is normalized by subtracting their mean from each element. • thus, each mfcc acoustic frame is transformed into a characteristic vector with size 35 and used to make learning dictionaries for feature training of concurrent neural networks (feature matching). the role of the adaptive filter is to best approximate the value of a signal at a given moment, based on a finite number of previous values. the linear prediction method allows very good estimates of signal parameters, as well as the possibility to obtain relatively high computing speeds. predictor analysis is since a sample that can be approximated as a linear combination of the previous samples. by minimizing the sum of square differences on a finite interval, between real signal samples and those obtained by linear prediction, a single set of coefficients called prediction coefficients can be determined. the estimation of model parameters according to this principle leads to a set of linear equations, which can be solved efficiently for obtaining the prediction coefficients. equations (2) and (3) are considered: where h(z) is the acoustic environment feedback, z is transfer function of a linear model and a(z) = 1 − p k=1 α k z −k is the z transfer function model of reverberations and multipath reflection of environment, it is noted that it is possible to establish a connection between the gain factor constant, g, the excitation signal and the prediction error. in the case of ak = α = const, the coefficients of the real predictor and of the model are identical: e(n) = gs(n). this means that the input signal is proportional to the error signal. practically, it is assumed that the error signal energy is equal to that of the input signal: it should be noted, however, that for the uav-specific audio signal if s(n) = δ(n), it is necessary for the p-order of the predictor to be enough large so as to consider all the effects, eventually the occurrence of the transient waves. in the case of sounds without a specific uav source, the signal s(n) is assumed to be white gaussian noise with unitary variation and zero mean. (g) time-frequency analysis. the analysis of the acoustic signals can be performed by one-dimensional or two-dimensional methods. one-dimensional methods involve that the analysis is made only in the time domain or only in the frequency domain and generally have low degree of complexity. although they have the advantage of offering, in many cases, a way of quickly first evaluating and analyzing signals, in many situations, especially in the case of analyzing the transient values that appear in the acoustic signals generated by the drones, the information that is obtained, regarding the shape and the parameters they is limited and with a low degree of approximation. the second category of methods, meaning the two-dimensional representations in the time-frequency domain, represent powerful signal analysis tools and it is therefore advisable to use, if the situation allows, a pre-processing of signals, in order to identify transient waves. these representations have the advantage of allowing to emphasize certain "hidden" properties of the signals. from the point of view of the acoustic systems for detecting and analyzing the sound signals generated by the drones, it is of interest to analyze the signals at the lowest level, compared to the noise of the device. therefore time-frequency analyzes should be performed on signals affected by noise, the signal-to-noise ratio being of particular importance in assessing transient waves. a comparison is shown below in table 1 . table 1 compares the properties verified by several time-frequency representations in cohen's class. the cohen class method involves the selection of the nearest nucleus function that corresponds to the fundamental waveform that describes the acoustic signatures specific to drones. thus, the shape of the nucleus based on the peak values (localization), and the amplitude of a "control" function must be chosen. the frequency resolution corresponding to spectrum analysis, that varies over time, is equal to the nyquist frequency divided by 2 n (n = 8). the resolution in the time domain is 2 n ms (n = 4), as required by the applied method. the class of time-frequency representations, in the most general form has been described by cohen: where φ is an arbitrary function called kernel function. after the choice of this function, several specific cases are obtained corresponding to certain distributions (t, ω). the time-frequency representations in cohen's class must fulfill certain properties. compliance with these properties is materialized by imposing certain conditions on the nucleus function. the first two properties relate to the temporal and frequency gap (compatibility with filtering and modulation operations) as follows: p 1 : for these conditions to be met, it may be observed that the kernel function φ must be independent of t and ω: two other properties that must characterize time-frequency representations refer to the conservation of marginal laws: the restrictions corresponding to these properties that the function must fulfill are: the function φ must therefore take the following form: for time-frequency representations to be real, the following condition is to be met: this happens only if: the most representative time-frequency distributions in cohen's class are presented in table 2 . according to tables 1 and 2, it becomes easy to note that the wigner-ville transform has the highest number of properties, which justifies the special attention that will be given hereafter. the wigner-ville distribution. the wigner-ville interdependence of two signals is defined by: the wigner-ville self-distribution of a signal is given by: the wigner-ville distribution can be regarded as a short fourier transform in which the window continuously adapts with the signal because this window is nothing but the signal itself, reversed over time. the wigner-ville transform is thus obtained as a result of the following operations: (a) at any moment t, multiply the signal with the conjugate "mirror image", relative to the moment of evaluation: (b) calculate the fourier transform for the result of this multiplication, in relation to the offset variable τ. one of the properties of this time-frequency representation is that it can also be defined starting from the spectral functions: it is thus obtained: using the application presented in figure 5 , the spectrograms related to the sounds produced by uavs are obtained, and the results are used to the neuronal network training files. for training 30 files with wigner-ville spectrograms were made, each file having 200 spectrograms images of 128 × 128 dimension. in total a few 6000 training spectrograms for neuronal network have been employed. the presented quadratic representations, which are part of the broader category described by cohen's class, provide excellent time-frequency analysis properties of acoustic signals. following the carried out experiments, some important aspects can be emphasized regarding the use of the analysis of the acoustic signals generated by the drones using the wigner-ville time-frequency distributions, of cohen's class, namely: the energy structure of the analyzed signals can be identified and located with a good accuracy in the time-frequency plane. when the type, duration, frequency, and temporal arrangement of the signals are not a priori known, they can be estimated using time-frequency distributions. the possibility of implementing these analysis algorithms in systems for analyzing the transient acoustic signals generated by the drones becomes thus available. useful databases can be created to identify the transient acoustic signals generated by the drones detected in the environment, as their "signature" can be individualized using the wigner-ville time-frequency representations. this algorithm implements the concept of competition at the level of a collection of neural networks and determines the importance of the inputs which influence the performances in the recognition of the acoustic fingerprint, using neural networks. it is known that modularity and the "divide et impera" principle applied to neural networks can improve their performance [27] . the algorithm employs the model of concurrent neural networks (conn) that combines the paradigms of supervised and unsupervised learning and offers an optimal solution for detecting acoustic fingerprints specific to uavs. the principle of competition is used in this process within a collection of neural networks that are independently trained to solve different subproblems. the conn training was performed offline using the system in figure 1 , and the training data was performed for 3 available drones, corresponding different flying distances (0 to 25 m), (25 to 50 m), (50 to 100 m), (100 to 200 m) and (200 to 500 m) . there have been tested three quadcopter models: a dji phantom 2 (mini class), dji matrix 600 (medium class) and a homemade drone (medium class). the first training data were gathered in an anechoic chamber, for the drones specified in the article at different engine speeds, sampling frequency 44 khz. the 2nd training data set: the drone sound data was recorded in a quiet outdoor place (real-life environment without the polyphonic sound environment typical of outside areas, such as on the rooftop of a building in a calm place or isolated environment) at successive distances of 20, 60 and 120 m for two types of behavior, hovering and approaching, with a total time of 64 s. exact labeling of the drone sound was achieved by starting to record after the drone is activated and stopping before deactivation. recognition is performed by identifying the neural network that provides the best response. the experiments performed demonstrated that, compared to the cases where competition is not used at all, the obtained recognition process accuracy was higher when employing the model proposed in the present solution. the system consists of neural networks with various architectures. multilayer perceptron type modules were employed, along with time delay neural network and self-organizing map types. the recognition scheme consists of a collection of modules trained on a subproblem and a module that selects the best answer. the training and recognition algorithms, presented in modular/competing neural networks are based on the idea that, in addition to the hierarchical levels of organization of artificial neural networks: synapses, neurons, neuron layers and the network itself, a new level can be created by combining several neural networks. the model proposed in this article, called concurrent neural networks, introduces a neural recognition technique that is based on the idea of competition between multiple modular neural networks which work in parallel using the ni board controller [27, 28] . the number of networks used is equal to the number of classes in which the vectors are grouped, and the training is supervised. each network is designed to correctly recognize vectors in a single class, so that the best answers appear only when vectors from the class with which they were trained are presented. this model is in fact a framework that offers architecture flexibility because the modules can be represented by different types of neural networks. starting from the conn model proposed in this work, the concurrent self-organizing maps (csom) model has been introduced, which detaches itself as a technique with excellent performance to implementation on fpga and intel core i7 processor (3.1 ghz). the general scheme used to train competing neural networks is presented in figure 6 . in this scheme, n represents the number of neural networks working in parallel, but it is also equal to the number of classes in which the training vectors are grouped [29, 30] . the x set of vectors is obtained from the preprocessing of the acquired audio signals for the purpose of network training. from this set are extracted the sets of vectors xj, j = 1, 2 . . . n with which the n neural networks will be trained. following the learning procedure, each neural network will have to respond positively to a single class of vectors and to give negative responses to all other vectors. the training algorithm for the competing network is as follows: step 1. create the database containing the training vectors obtained from the preprocessing of the acoustic signal. step 2. the sets of vectors specific to each neural network are extracted from the database. if necessary, the desired outputs are set. step 3. apply the training algorithm to each neural network using the vector sets created in step 2. recognition and classification using conn, is performed in parallel, using the principle of competition, according to the diagram in figure 7 . it is assumed that the neural networks were trained by the algorithm described above. when applying the test vector, the networks generate an individual response, and the selection consists of choosing the network that generated the strongest response. the network selected by the winner rule is declared winner. the index of the winning network will be the index of the class in which the test vector is placed. this method of recognizing features therefore implies that the number of classes with which the competing network will work is a priori known and that there are sufficient training vectors for each class. the recognition algorithm is presented in the following steps: step 1. the test vector is created by preprocessing the acoustic signal. step 2. the test vector is transmitted in parallel to all the previously trained neural networks. step 3. the selection block sets the network index with the best answer. this will be the index of the class in which the vector is framed. the models can be customized by placing different architectures instead of the neural networks. multilayer perceptron (mlp), time delay neural networks (tdnn) and kohonen (som) maps were used for this work, thus obtaining three different types of competing neural networks. this section deals with the experiments performed on the problem of multiple drone detection with the custom collected dataset. the experiments are organized in the following order: (1) concurrent neural networks (conn) with wigner-ville spectrogram class. (2) concurrent neural networks (conn) with mfcc dictionary class. (3) concurrent neural networks (conn) with mif class. to establish the performance values it is necessary to calculate the confusion matrix. the confusion matrix consists of real values on one dimension and predicted labels for the second dimension, and each class consists of a row and a column. the diagonal elements of the matrix represent the correctly classified results. the values calculated from the confusion matrix represent precision, recall and f1-score which is the harmonic average of the accuracy and recall and the accuracy of the classification: precision: it is defined as the number of samples which contain the existence of the drone. recall: it is defined that the ratio between the expected number of samples to contain a drone and the number of samples that contain the drone. f-measure: it is defined as the harmonic average between accuracy and recall. f1 scores are calculated for each class, followed by the average of the scaled scores with weights. the weights are generated from the number of samples corresponding to each class. the spectrograms extracted in the experiment are transformed into logarithmic domain. the transformed features are used as the input to the model. the network is trained for 200 epochs with batch size of 16. this experiment resulted in classification accuracy of 91 percent, recall is 0.96, and microaverage f1-score is 0.91. the confusion matrix is shown in table 3 and classification report is show in table 4 . for the mfcc dictionary extracted in the experiment, the mel filter bank with 128 mel filters is applied. conn is trained for 200 epochs with batch size of 128 sample. this experiment resulted in classification accuracy of 87 percent, recall is 0.95, and micro-average f1-score is 0.86. the confusion matrix is shown in table 5 and classification report is show in table 6 . table 7 and classification report is show in table 8 . after observing that the conn model shows remarkable improvements of recognition rates of acoustic fingerprints compared to the classic models, this section will focus on the recognition and identification of uavs' specific acoustic signals. a training database was created using the wigner-ville spectrogram, mfcc and mif dictionaries corresponding to the acoustic signals of 6 multirotor drones. we tested six multirotor models: (1) dji matrice 600 (medium), (2) (3) (4) homemade drones (medium and large) three units, (5) dji phantom 4 (mini), (6) parrot ar drone 2 (mini). the drone was tracked outdoors on a test field between buildings, a street with pedestrian and cars/tram traffic nearby (urban conditions). the atmospheric conditions for the real-time tests were sunny weather, temperature 30-35 degrees celsius, precipitation 5%, humidity 73%, wind 5 km/h, atmospheric pressure 1013 hpa (101.3 kpa) and presence of noise in urban conditions (source: national agency for the weather). each of these drones were tested ten times. for each iteration, the training vectors of the recognition system were extracted from the first five data sets, keeping the next five data sets for testing. in this way, 200 training sets were obtained for the preparation of the system and another two hundred for its verification. in addition to speaker recognition, a set of experiments was performed using a first unique neural network to recognize the model and then conn. the results obtained in the real-time tests are presented in figures 8-18 . model dji phantom 4, type of classification-small (5) for this stage only the kohonen network was tested, given the results that were obtained in recognition speakers and their behavior compared to that of a conn. for the variant that uses a single som, the network was trained with the whole sequence of vectors obtained after preprocessing the selected acoustic signals. a kohonen network was trained in two stages with 10 × 15 nodes through the self-organizing feature map (sofm) algorithm. the first stage, the organization of clusters, took place along 1000 steps and the neighborhood gradually declined to a single neuron. in the second stage, the training was performed in 10,000 steps and the neighborhood remained fixed to the minimum size. following training and calibration of the neural network with the training vectors we obtained a set of labeled (annotation) prototypes whose structure is that of table 6 . the applied technique for recognition is described below. the acoustic frequencies identified in the test signal are preprocessed by means of a window from which a vector of the component parts is calculated. the description of the technique applied for recognition continues. the frequencies identified in the test signal are preprocessed by the means of a window from which a vector of the component parts is calculated. the window moves with a step of 50 samples and a collection of vectors is obtained, whose sequence describes the evolution of the acoustic signal specific to the drones. for each vector, the position that corresponds to the signal is kept, and the minimum quantization error, i.e., the tag that the neural network calculates. experimentally, a maximum threshold for quantization error was set to eliminate frequencies that are supposed not belonging to any class. through this process, a sequence of class labels that show how the acoustic signals specific to the drones were recognized by the system was obtained. in table 9 the experimental results are presented in percentages of recognition and identification of the drones with som and conn. in table 10 , when we refer to the "accuracy" of "conn", we refer to a different top-level architecture that: (1) takes raw audio data and creates functions for each of the three mentioned networks (2) run the data through each network and get an "answer" (a distribution of the probability of class predictions) (3) select the "correct" output of the network with the highest response (highest class confidence) (4) this architecture being explained in figure 7 . the general classifier based on concurrent neural networks, providing the same test framework for all 30 training files has been tested. using the maximum win strategy, the output tag was identified, with a resulting precision of the drone recognition is 96.3%. the time required to extract the characteristics of a 256 × 250 spectrogram image using conn is 1.26 s, while the time required to extract the characteristics of an mfcc and mif sample from audio samples is 0.5 s. the total training time required the model for the spectrograms image data set was 18 min, while the model training time for the mfcc and mif audio sample are 2.5 min. the time required to train the combined model data set was 40 min. the trained model classifies objects in 3 s. comparing the method proposed in this article with similar methods presented in the literature for drone detection using acoustic signature, which uses a supervised learning machine, the authors report detection accuracies between 79% and 98.5%, without mentioning the detection distance of the signals acoustics generated by drones [31] [32] [33] [34] [35] . the method proposed by us has an average accuracy of almost 96.3% for detecting the sounds generated by the drone, for a distance between 150 m for small class drones and 500 m for middle and large class drones. our tests were performed in a test range with a maximum length of 380 m, but from the results shown in figures 8-18 , it results that the detection distance of the acoustic signals from the drones reaches approximately 500 m, for different classes of drones. the proposed conn model classifies objects in about 4 s, this time being sufficient for warning because the network of microphone areas is distributed in width and depth, thus creating a safety zone. this paper investigates the effectiveness of machine learning techniques in addressing the problem of uav unauthorized flight detection, in the context of critical areas protection. for extracting the acoustic fingerprint of an uav, a time-frequency analysis using wigner-ville is adopted by the warning system, to recognize specific acoustic signals. dictionaries mfcc and mif (mean instantaneous frequency) coefficients specific to each type of drone have been also added in this process to improve the recognition precision of conn. the contributions of the proposed solution are the following: -development of a spiral microphone array, combining microphones in the audible and ultrasonic fields, set in an interchangeable configuration with multichannel adaptive weights. -introduction of the possibility of detecting low intensity acoustic signals specific to multirotor mini drones, at a distance of~120 m. the recognition scheme consists of a collection of models trained on a subproblem and a module that selects the best answer. -tests have shown that large multirotor (diameter 1.5 m) can be detected at a distance of~500 m, and medium multirotor (diameter less than 1 m) can be detected at a distance of at least 380 m. the possibility of integrating the microphone area in a network structure (scalability), which can be controlled by a single crio system by integrating several acquisition boards. the placement of the acoustic sensors within the network can be done linearly and in depth, so that a safety zone can be created around the perimeter restricted for the flight of drones. from the results obtained in the experiments performed, engineered features employed on conn proved to have better performances. conn architectures have resulted in better generalization performance and faster convergence for spectro-temporal data. the wigner-ville spectrograms show improved performance among other spectrogram variants (for example transformed into short-term fft-stft). the results obtained with both the datasets lead to the conclusion that multiple drone detection employing audio analysis is possible. in future work, as presented in [36] , a video camera for drone detection and recognition will be integrated in the microphone area. the two modules, acoustic and video, will work in parallel and the results will be integrated to increase the recognition capacity and classification of drones. a final radio detection (rf) module will also be integrated on the final architecture, and the results will be displayed in a command and control system. part of this research has been previously tested for developing a method for anonymous collection of travelers flowing in a public transport system and resulted in a patent application: ro a/00493, "method and system for anonymous collection of information regarding position and mobility in public transportation, employing bluetooth and artificial intelligence" in 2019. results of this research culminated in a patent application: ro a/00331, "system and method for detecting active aircraft (drone) vehicle by deep learning analysis of sound and capture images" in 2020. advances in intelligent systems and computing investigating cost-effective rf-based detection of drones drone detection systems. u.s. patent no. us 2017/0092.138a1; application no. us 15/282,216, publication of us 20170092138a1 drone detection and classification methods and apparatus based small drone detection in augmented datasets for 3d ladar detection, tracking, and interdiction for amateur drones multi-sensor field trials for detection and tracking of multiple small unmanned aerial vehicles flying at low altitude ghz fmcw drone detection system detection and tracking of drones using advanced acoustic cameras digital television based passive bistatic radar system for drone detection using deep networks for drone detection yolo9000: better, faster, stronger acoustic detection of low flying aircraft near-infrared high-resolution real-time omnidirectional imaging platform for drone detection machine learning inspired sound-based amateur drone detection for public safety applications micro-uav detection and classification from rf fingerprints using machine learning techniques mimo ofdm radar system for drone detection low-complexity portable passive drone surveillance via sdr-based signal processing drones: detect, identify, intercept and hijack a new feature vector using selected bi-spectra for signal classification with application in radar target recognition is&t international symposium on electronic imaging 2017, imaging and multimedia analytics in a web and mobile world. electron. imaging uav detection system with multiple acoustic nodes using machine learning models adaptive noise cancellation using labview empirical study of drone sound detection in real-life environment with deep neural networks convolutional neural networks for analyzing unmanned aerial vehicles sound efficient classification for multiclass problems using modular neural networks an overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes application of the wavelet transform in machine-learning malicious uav detection using integrated audio and visual features for public safety applications hidden markov model-based drone sound recognition using mfcc technique in practical noisy environments svm-based drone sound recognition using the combination of hla and wpt techniques in practical noisy environment drone sound detection by correlation drone detection based on an audio-assisted camera array uav detection employing sensor data fusion and artificial intelligence funding: this research received no external funding. the authors declare no conflict of interest. key: cord-318716-a525bu7w authors: van den oord, steven; vanlaer, niels; marynissen, hugo; brugghemans, bert; van roey, jan; albers, sascha; cambré, bart; kenis, patrick title: network of networks: preliminary lessons from the antwerp port authority on crisis management and network governance to deal with the covid‐19 pandemic date: 2020-06-02 journal: public adm rev doi: 10.1111/puar.13256 sha: doc_id: 318716 cord_uid: a525bu7w in this article we describe and illustrate what we call a network of networks perspective and map the development of a lead network of the antwerp port authority that governs various organizations and networks in the port community before and during the covid‐19 pandemic. we find that setting a collective focus and selective integration to be crucial in the creation and reproduction of an effective system to adequately deal with a wicked problem like the covid‐19 pandemic. we use the findings on crisis management and network governance to engage practitioners and public policy planners to revisit current design and governance of organizational networks within organizational fields that have been hit by the covid‐19 pandemic. in line with the recently introduced exogenous perspective on whole networks, the notion of network of networks is further elaborated in relation to the scope and nature of the problem faced by that organizational field. by changing the structure and network governance mode from a lead/network administration organization into a lead network, the antwerp port authority was able to install institutions and structures of authority and collaboration to deal with the scale and complexity of the covid-19 pandemic. a network is a system of three or more organizations (incepted either voluntary or by mandate) that work together to achieve a purpose that none of the participating organizations can achieve independently by themselves (provan, fish, and sydow 2007; provan and kenis 2008) . they are distinct entities with unique identities that require examination as a whole (provan, fish, and sydow 2007; provan and kenis 2008; raab and kenis 2009) . despite the prominence of networks in practice, their popularity as a research subject and relevance for society (raab and kenis 2009), we still tend to study individual organizations to understand the collective behavior of networks. studies on networks from multiple disciplines predominantly focus on organizations and their relations (ego-networks), potentially neglecting, or even misinterpreting the relationship between the details of a network and the larger view of the whole (bar-yam 2004; provan, fish, and sydow 2007) . in addition, we sometimes forget that studying networks from a whole network perspective is necessary, but not sufficient alone to understand ‗such issues as how networks evolve, how they are governed, and, ultimately, how collective outcomes might be generated' (provan, fish, and sydow 2007, p.480) . recently, for instance, it has been argued by nowell, hano, and yang (2019, p. 2) that an external (outside-in) perspective should accompany our dominant internal focus on networks to explain ‗the forces that may shape and constrain action in network settings'. inspired by this so-called network of networks perspective, this article shows how such a perspective allows for a better grasping of (and hence, dealing with) wicked problems (cartwright 1987) , a type of problem that an organizational field such as a port or city is frequently encountering. this article is protected by copyright. all rights reserved. the covid-19 pandemic can be understood as a wicked problem because there are no quick fixes and simple solutions to the problem, every attempt to solve the issue is a -one shot operation‖ and the nature of the problem is not understood until after the formulation of a solution (conklin 2005) . wicked problems are ‗defined [1] by a focus, rather than a boundary' (cartwright 1987, p.93) and successfully managing such problems therefore requires a reassessment of how a group of organizations and networks make temporally sense and structure a wicked problem. the covid-19 pandemic has therefore directed our attention to a pivotal point in network governance: the connection between complexity and scale (bar-yam 2004) . it has led us to acknowledge that an appreciation for the scope and detailed nature of a wicked problem is essential, while simultaneously pairing it with a network solution that matches in scale and complexity (bar-yam 2004) . to understand how to deal with the covid-19 pandemic then, one needs to comprehend the relation between a larger, complex system and the scope and nature of the problem. we call this larger, complex system an organizational field (kenis and knoke 2002) . in the classics of public administration literatures the relationship between an organization and its environment has been studied from a variety of perspectives focusing i.e., on selection or adaption to institutional pressures and resource dependence (aldrich and pfeffer 1976; oliver 1991 ). an emphasis on environments is therefore not new. interestingly, however, network scholars in public administration have only recently intensified their efforts to use concepts of the environment as an explanatory factor of the creation, reproduction, or dissolution of networks (raab, mannak, and cambré 2015; lee, rethemeyer, and park 2018; nowell, hano, and yang 2019) . this article is protected by copyright. all rights reserved. building on dimaggio and powell's (1983) understanding of an organizational field, kenis and knoke (2002, p.275) link interorganizational relationships and mechanisms such as tie formation and dissolution, to define an -organizational field-net‖ as ‗the configuration of interorganizational relationships among all the organizations that are members of an organizational field. ' the key issue here is on which scale and what details we should consider examining intersections of organizations and networks embedded in a certain environment, since environmental dynamics are crucial in our understanding of the creation and reproduction of both system within as well as the larger system as a whole (cf. mayntz 1993) . however, in order to define and examine such larger, complex systems like organizational fields, we need to understand -why‖ organizations and networks come together, cooperate, and consequently create and reproduce such a larger, complex system (kenis and knoke 2002; provan, fish, and sydow 2007; nowell, hano, and yang 2019) . we therefore propose that instead of focusing on an organizational network as the unit of analysis (provan, fish, and sydow 2007; provan and kenis 2008) , a shift to a collective of networks that is embedded in an organizational field is instructive (cf. nowell, hano, and yang 2019) . this means our unit of observation shifts from one network as separate entity with a unique identity building upon maier's (1998) system of systems approach and using nowell, hano, and yang (2019) notion of network of networks, we accordingly define network of networks as an assemblage of networks, which individually may be regarded as subsystems that are accepted article this article is protected by copyright. all rights reserved. operationally and managerial autonomous, but which are part of a larger, complex organizational field by many types of connections and flows (maier 1998; provan, fish, and sydow 2007; nowell, hano, and yang 2019) . --figure 1 . around here. ---in this article, we adopt a set-theoretic approach to network of networks, in line with the long-standing recommendation by christopher alexander (2015) . in such an approach, a network of networks is ‗best understood as clusters of interconnected structures and practices' of various networks being distinct entities and having unique identities (fiss 2007 (fiss , p.1180 provan and kenis 2008; raab et al., 2013) . this means a clean break from the predominant linear paradigm and instead adopting a systemic view in which we assume that ‗patterns of attributes will exhibit different features and lead to different outcomes depending on how they are arranged' (fiss 2007 (fiss : 1181 provan et al., 2007; provan & kenis 2008) . moreover, we note that often assumptions on the structure and governance of networks are used that are suspect at best for dealing with the complexity that networks bring (rethemeyer & hatmaker 2008; raab, lemaire, and provan 2013) . further, most network studies only employ an endogenous perspective on networks, which in some cases is bound to the performance of an individual organization, a network cluster, or a certain organizational domain (i.e., health or social care), despite the fact that networks by nature are multilevel, multidisciplinary and interdependent (provan and milward 2001; provan, fish, and sydow 2007; raab, mannak, and cambré 2015) . this article is protected by copyright. all rights reserved. in particular, scholars often tend to ignore the specific nature of the problems that networks face in their environments (raab and milward 2003; mcchrystal, collins, silverman, and fussell 2015) . this is an issue, because not fully understanding the interdependence of a collection of smaller systems nor understanding what the larger, complex system is up against makes dealing with a wicked problem like the covid-19 pandemic very difficult. as part of a larger applied research project of a collaboration between the fire and emergency services (antwerp fire service, antwerp port authority, police, municipality among others) in the port of antwerp and antwerp management school (van den oord, vriesacker, brugghemans, albers, and marynissen 2019), we focus in this article on how the port authority of the port of antwerp (belgium) dealt with the covid-19 pandemic. in particular, we examine the network structure and the embeddedness of individual actors of both the crisis management team and the leadership team of the antwerp port authority (apa) to describe how this network managed the crisis and governed the port community composed of various organizations and networks before and during the covid-19 pandemic. by providing descriptive evidence concerning the development of the overall network structure and the embeddedness of individual actors before and during the covid-19 pandemic, we aim to ground the notion of network of networks and hope to engage practitioners and public administration scholars to rethink current design and governance of organizational networks within their respective organizational fields that have been hit by the covid-19 pandemic. for this article, we narrowed the scope of the analysis to two levels of analysis to describe the interdependence between crisis management and network governance on the accepted article this article is protected by copyright. all rights reserved. operational and policy level of the antwerp port authority [2] . as alter and hage (1993) suggested we need to make a minimum distinction between the policy level and the administrative level because coordination of joint efforts tends to transcend organizational hierarchical levels as well as involve multiple different functional units like divisions or departments. the data allowed us to differentiate between these two levels having access to the crisis management team (operations or in alter and hage's terms administration) and the leadership team (policy). in order to understand how apa attempted to manage the crisis throughout the covid-19 pandemic, we conduct a network analysis based on three sources of data. for our primary data source we draw upon the records and minutes of three types of meetings: the crisis management team meetings (cmt), the nautical partners meetings (np), and the leadership team meetings (lt). the data covers a period of 12 weeks (20/01-12/04) including 53 meetings mentioning 73 unique actors involved in the port community. data records are based on 26 crisis management team meetings with a total estimated duration of 20hrs (66 logbook pages), 16 leadership team meetings with a total estimated duration of 10hrs (19 logbook pages) and 11 nautical partners meetings with a total estimated duration of 12hrs. in addition, we consult data from sciensano, which is the belgian institute for health responsible for the epidemiological follow-up of the covid-19 epidemic in collaboration with its partners and other healthcare actors. these data provide insight into the dynamics of the pandemic. the third source of data were our co-authors two, four and five that managed the pandemic in the port of antwerp. the second author attended all the crisis management team meetings and participated in various leadership team meetings, while the fourth author was present in some of the task force meetings (not examined in this article). by collaborating with accepted article this article is protected by copyright. all rights reserved. these practitioners, we are able to go back and forth to the data during these periods allowing for interpretation of relationships between apa and actors in the port community as well as built rich narratives of issues discussed in these meetings. in table 1, we have portrayed descriptive measures of the data on the meetings. for data analysis, meetings were grouped among four phases of the covid-19 crisis for which each a network structure was created: (1) pre crisis network (20/01-01/03 -6 weeks), (2) pre lock down network (02/03-15/03 -2 weeks), (3) lock down network (16/03-29/03 -2 weeks), and (4) crisis network (30/03-12/04 -2 weeks). --table 1 . around here. --the reason why we opted for six rather than two weeks in the first period is to illustrate what we observed as a -slow start of the covid-19 pandemic that increased exponentially‖ (sciensano 2020). this aligns with the meta-data of the statistical reports of sciensano, which started issuing data only from the 1 st of march, providing a daily report from march 14 th onwards [3] . a total of four network plots (and one overview plot) are presented to provide descriptive evidence concerning overall network structures and the embeddedness of individual actors in the four phases. for each phase, we present a one-mode matrix based on the actors' list in which we weight ties between two actors based on frequency of mentioning in the records of the logbook and/or minutes of the meetings. three rounds of coding were executed in an iterative manner in which we went back and forth to the data and codes of various issues and actors involved in each crisis management and this article is protected by copyright. all rights reserved. leadership team meeting that was reported in the data. in appendix c we have provided an excerpt of data cleaning and the coding process. we aimed to minimize bias by having the first and second author agree on codes and accordingly discuss the application of codes with the third author to agree on the content of issues and the involvement of actors reported in the various meetings. simultaneously with the coding process, an actor list of apa departments was developed, indexed and pseudonymized (differentiating between operations and policy, n=18) as well as actors of the port community (n=55). the coding process was performed in microsoft excel. to calculate centralization and density scores reported in table 1 we used ucinet 6 (borgatti, everett, and freeman 2002) . to develop the network plots, we use the node-and centrality lay-out based on degree centrality analysis in the network visualization tool visone 2.7.3 (http://visone.info/; brandes and wagner 2004) . the remainder of the article is organized in three sections. in the first results section we present the findings on the structure and governance of the network of networks. we display in five figures an overall overview (figure 2), as well as more detailed views for each period (figures 3-6) to describe the development of the network structure and the embeddedness of individual actors of both the crisis management team and the leadership team of the antwerp port authority (apa) during the covid-19 pandemic. in the second results section, we elaborate on the findings of how this network of networks managed the crisis before and during the covid-19 pandemic. we close with a discussion and conclusion section in which we present recommendations for future research and practice. due to space limitations, readers can find more detail on the broader research project online. in appendix a, we have provided a background on the port of antwerp, a description of the antwerp port authority, and a more detailed account on the two levels of analysis and the this article is protected by copyright. all rights reserved. this article is protected by copyright. all rights reserved. in figure 2, we have mapped the network structures along the evolution of the covid-19 pandemic. in the first six weeks, the network structure was composed of 28 actors dealing with 17 issues in total (see also tables 1 and 2). however, in the course of merely four weeks (march 1 -29), the number of actors doubled, the number of initial ties multiplied more than five times, and together with its partners apa had to deal with 195 issues in just two weeks at its peak. in the last phase the situation stabilized (march 30 -april 4). in those periods each network structure shows various links between apa actors (purple and red nodes) and external actors (yellow nodes) from which apa derives its legitimacy (human and provan 2000) . the goal of apa was to avoid a legitimacy crisis in which they could potentially lose its formal authority of the port of antwerp (human and provan 2000) . as such, they aimed to prevent at all cost to close the port due to the covid-19 pandemic. the apa followed what human and provan (2000) term a dual legitimacy-building strategy in which personnel were provided resources and support to arrange institutions and structures of authority and collaboration such as a crisis management team on the level of operations, and a task force and nautical partners meeting on the policy level. the leadership team compared to the crisis management team was more externally focused on building outside-in legitimacy, through solidifying relationships with important stakeholders from the port community. the antwerp port authority can be considered as successful in managing the pandemic in the sense that the port kept fully operational throughout the four phases on all terminals. in this article is protected by copyright. all rights reserved. when comparing the four network structures to each other, we find that after the first six weeks a core group of actors from apa assembled. together these seven actors-purple and red nodes displayed in the center of the figures, formed what we coin a lead network (cf. provan and kenis 2008) . in analogy to provan and kenis' lead organization-a mode of governance involving a single organization acting as a highly centralized network broker to govern all major network-level activities and key decisions, a lead network from a network of networks point of view, represents a single network composed of multiple functional units from various organizations and networks that differed in (lateral) position, categories of relevant resources, knowledge, or experience, and in proportion of socially valued tangible and intangible assets or resources within the organizational field of the port antwerp (cf. harrison and klein 2007) . the comparison of structures with the issues being dealt before and during the covid-19 pandemic (table 1 ), shows that in the pre-crisis network ( figure 2 ) the departments and divisions of apa acted under business as usual. before the pandemic started, the mode of governance of apa was best described as a brokered governance which both governs the sustainment of the port community and its activities as well as participates as a broker in major port activities and key decisions (see appendix a). this corresponds with elements from a lead organization (level of operations) as well as a network administrative organization (policy level) as governance modes (cf. provan and kenis 2008). when we examine the development of apa evolving from a -pre-lock-down network‖, to a -lock-down network‖, to a -crisis network‖, we observe that the governance of the network developed into a lead network that in its core was composed of apa actors (cf. nowell and steelman, velez, and yang 2018). over the course of the pandemic, its structure evolved from a state of loosely coupled links that were targeted and appropriate (figures 3-4) , towards a state of this article is protected by copyright. all rights reserved. tightly coupled links that were stronger and more intense based on the frequency of interactions . note that although, the -crisis network‖ structure in figure 6 is highly similar in number of actors and amount of ties compared with the -lock down network‖ in figure 5 , the number of issues to be resolved by apa in collaboration with others dropped significantly from 195 to 97 in two weeks' time. once the port of antwerp entered the phase of lock-down and subsequently crisis, the functional units of apa took the lead, enacted by multiple brokerage roles allowing them pooling information and resources and working together with port community actors to guarantee the operability of the port as well as safeguard that the main sea gateway remained open for shipping. we found incidence for five types of brokerage (gould and roberto 1989 ; see appendix b) with the lead apa network selectively integrating various overlapping subgroups both within, as well as in a later stage, between functional units of various organizations and networks. the strategic orientation in brokering followed by the lead network was to collaborate in order to achieve the field-level goal: keeping the port open (cf. soda, tortoriello, and iorio 2018). during the pandemic this was exemplified by three distinct brokering behaviors: separating, mediating, and joining (stratification of) organizations and networks (grosser, obstfeld, labianca, and borgatti 2019; gulati, puranam, and tushman 2012) . the network analysis also showed that another network was created by the lead apa network to safeguard, monitor, and control the sea gate to the port during -the lock-down‖ period ( figure 5 ). in figure 5 , this network is difficult to isolate due to the density and multiplexity of the network structure (table 1) , however, in figure 6 the network is more evident in the left top corner. the inception of the network can be derived from the nautical partners meeting initiated this article is protected by copyright. all rights reserved. by apa on march 12 th . the inception of this network as an institution and structure of authority and collaboration is interesting because of several reasons. the network was highly selective in its member base, representing a limited number of actors responsible for the nautical operations in relation to the flemish and dutch ports of the scheldt estuary, including the port authorities, tugboat companies, and pilots. in line with the shared-participant network governance mode (provan and kenis, 2008) , this network of a small group of actors aligned together around one common purpose: keeping the sea gate to the ports in the scheldt river open at all costs, despite the fact that these actors are historically in competition with each other. the priority of keeping each port in the scheldt estuary open during the pandemic likely explains why they were willing to redistribute operational resources among each other as long as this safeguarded the attainment of not closing down their port. although the network was originally incepted as a temporary information diffusion network, its function altered over the course from sharing information, to problem solving, to building (inter)national capacity to address future community needs which might arise. the presidency of the meeting was handed over to the transnational nautical authority over the scheldt estuary as from april 1 st in order to be consistent with its formal authority towards external parties (i.c. shipping) and to further enhance consensus and power symmetry between the actors. by stepping down as chair, the lead apa network safeguarded that competitors remained working together. this article is protected by copyright. all rights reserved. in table 2 , we provide an overview of what apa did in terms of crisis management differentiating between the level of operations and policy. when we look at what issues were addressed in the apa meetings, we discovered a shift in attention from covid-19 as a public health issue, towards the effects of this pandemic on economy and society. further the type of issues addressed in the various meetings over the four analyzed periods suggests that covid-19 as a wicked problem was mostly perceived as a problem of -information provision‖, -decision-making‖, and to a lesser degree -sense-making‖ of the current situation apa was in (see table 1 for a summary and table 2 for details). when we retrospectively examined how apa managed the crisis in the port of antwerp throughout the covid-19 pandemic, we found several interesting matters that highlighted the idiosyncrasies of this information problem. both on the level of operations as well as policy, apa acquired, distributed, interpreted, and integrated information (flores, zheng, rau, and thomas 2012). this suggests that covid-19 was mostly perceived as an information problem because both a lack and abundance of information lead to not fully understanding the nature of the problem, which made covid-19 wicked. information was transferred through various means of communication. at some point, apa even organized webinars to ensure national and international partners of the operational readiness and continuation of the port. however, in most cases feedback and updates were exchanged within apa and with actors from the port community. apa made sure that information was present at those that needed to execute particular tasks or coordinated crisis management and communication (puranam, alexy, and reitzig 2014). this article is protected by copyright. all rights reserved. another emerging topic was their operational method resulting into a clear collective focus on the tasks at hand. this helped apa to get some kind of grip on the crisis situation. related to this was how apa developed a collective focus within the port community. internally, the crisis management team (operations) reported daily updates to the leadership team (policy) on various issues. they made sure that they collected the perception of parties that were not involved in the task force. the task force was assembled by apa to have policy level meetings with the representative bodies of the main industry and shipping stakeholders, public actors such as the federal police, the fire department, the federal public service of health and representatives of the municipality, province and regional government to ensure alignment over the whole logistic chain and the environment in which it acts. externally, apa detected early (warning) signals i.e., from the evolving situation in china due to their national and international network of ports. after verifying the signals received, apa could take informed measures to contain and manage the crisis. --table 2 . around here. ---based on being informed quickly and accurately, apa was able to take the lead and act proactively. as a response to numerous inquiries on dealing with inland navigation barges within the covid-19 context, a procedure was drafted by apa, shared for consent with the other ports in the scheldt estuary and with the authorities responsible for inland navigation on march 21st and released on april 8th after final verification with the inland navigation representative bodies. the extensive, but delaying-consent seeking, led to a unified approach towards a highly scattered subgroup, such as the inland navigation industry, which fully embraced it. this article is protected by copyright. all rights reserved. another example was how they prepared and dealt with the lock down. belgium went in lock down from 18 th of march, but at the 16 th apa was already defining what essential functions of the port needed to remain operational for traveling and transportation. apa's highperformance can be (at least) partially attributed to them being principal-driven, rather than rulethis likely explains why we found that from the 23rd of march the crisis was contained and consequently from 2 nd april apa decided to reduce the frequency of meetings. interestingly, however, when we took into account what actual solutions apa had devised to solve the pandemic, we found they made a public poster, initiated a digital campaign, launched two websites, and arranged a call center to provide a hot line to help personnel. our findings inform further research on network of networks in public administration in three ways. this article is protected by copyright. all rights reserved. although this article only examined the notion of network of networks from an egocentric point of view (apa within the port of antwerp community), we gained a first glimpse of the scale and complexity that was involved with the covid-19 pandemic. future research could in particular build on and extend this exogenous network of networks perspective, focusing on a collection of multiple networks that in some way are interdependent within an organizational field to explain why and how they might come together to potentially create a larger, more complex system like a port or city. based on the preliminary evidence presented here and building forth on the work of others, we propose two governance mechanisms that can be crucial in these explanations; first, is how a network of networks provides and motivates a collective focus by an organizational field on the problem being faced (cf. kenis this article is protected by copyright. all rights reserved. third, the findings shed some preliminary evidence for addressing anticipatory and mitigative actions among a network of high reliability organizations, i.e. fire and emergency services, police, and municipality (weick and sutcliffe 2007), and networks, i.e., the lead apa network and the network of nautical partners (berthod, grothe-hammer, müller-seitz, raab, and sydow 2016) . the concept of high reliability organizations (hro) gives directions for anticipating and containing incidents within a single organization, and focuses on maintaining a high degree of operational safety that can be achieved by ‗a free flow of information at all times' (rochlin, 1999 (rochlin, , p.1554 this research helps us to understand the response to crisis in a very specific case and context. nevertheless, several preliminary findings may be generalizable to other organizational fields such as (air)ports, cities, safety regions, health-and social care systems or innovation regions like the brainport region. for instance, one important aspect we found was the consistency of communication and the selective integration of organizations and networks with adequate monitoring and control, avoiding to impose strong constrains that limit cooperation or minimize the independence of various subsystems to be crucial. however, in some contexts like safety regions this may be at odds with common practices in crisis management among public organizations that are dominated by a strong command and control approach (groenendaal and helsloot, 2016; moynihan 2009 ). this article is protected by copyright. all rights reserved. moreover, as we are increasingly not only dealing with one specific organization, but with multiple organizational networks that are involved in ‖taming‖ a wicked problem, the findings suggests that network managers (brokers) and public policy planners (designers) need to think together of how a network of a collection of organizational networks can create, selectively integrate, and reproduce an effective complex, larger system that offers more adequate functionality and performance to match the scope and detailed nature of a problem that faces an organizational field. future research needs to determine which configuration of structure and governance of network of networks consistently achieves what field-level outcomes given the context of an organizational field. when limited diversity is present among various organizational fields, we can then start with revisiting the preliminary theorems introduced by keith provan and colleagues. this calls for further investigating various wicked problems as co-evolutionary patterns of interaction between networks and organizations as separate from, yet part of, an environment external to these networks and organizations themselves (alter 1990 to stimulate fresh thinking in practice and spur on empirical research on network of networks, our viewpoint is that: the key to the solution in how to deal with a wicked problem is to structure a system in such a way that provides appropriate incentives for collective focus and accepted article this article is protected by copyright. all rights reserved. selective integration with adequate monitoring and control, but to avoid imposing strong constrains that might limit cooperation or minimize the operational and managerial independence of various subsystems that make up this larger, complex system. in this article, we reported on how the antwerp port authority (apa) dealt with the covid-19 pandemic by examining the network structure and the embeddedness of individual actors of both the crisis management team and the leadership team. we drew upon the records and minutes of three types of meetings: the crisis management team meetings (cmt), the nautical partners meetings (nm), and the leadership team meetings (lt). the data covered a period of 12 weeks (20/01-12/04) including 53 meetings mentioning 73 unique actors involved. the network analysis revealed how the structure of lead apa network developed during various phases of the crisis. we found various indications on interdependence and emergence between apa as a lead network in a network of networks within the port community. in addition, the results show how the lead apa network governed organizations and networks in the port community. practitioners and scholars should be tentative in generalizing these preliminary findings presented here, because the data only allowed us to merely employ an egocentric network perspective based on the apa lead network. by having provided descriptive evidence concerning the development of the structure, governance, and crisis management by the apa lead network before and during the covid-19 pandemic, we hope to engage practitioners and network scholars to rethink current design and governance of organizational networks within organizational fields that have been hit by the covid-19 pandemic. it would be very promising for policy and practice to be able in the nearby future to identify what factors of a wicked this article is protected by copyright. all rights reserved. problem that faces an organizational field determines what combination of structure and governance arrangement we need to employ when, and why. this article is protected by copyright. all rights reserved. this article is protected by copyright. all rights reserved. this article is protected by copyright. all rights reserved. [1] with defining a focus, we mean a temporally unfolding and contextualized process of input regulation and interaction articulation in which a -wicked problem‖ is scanned, given meaning to, and structured by decomposing it into a set of tasks that can be divided and allocated (faraj and xiao 2006; daft and weick 1984; puranam, alexy, and reitzig 2014) . note that this process of defining or making sense is temporal and subjective in nature and a key challenge is organizing solutions towards problems of which are not fully in scope nor understand its detailed nature require that we make sense of how we understand problems (weick 1995). [2] in future research, we aim to broaden this scope by expanding the periods as well as the triangulate the egocentric perspective from the port authority with other perspectives from actors of the port community. [3] sciensano is the belgian institute for health responsible for the epidemiological follow-up of the covid-19 epidemic in collaboration with its partners and other healthcare actors. data can be accessed here: https://www.sciensano.be/en [4] port of houston, incident report, march 19th 2020; retrieved from: https://porthouston.com/wpcontent/uploads/covid19_2020_03_18_bct_bpt_incident_report.pdf tables table 1. types of data and descriptive measures of four networks (full table is this article is protected by copyright. all rights reserved. by comparing the semilattice structure (a) to a tree structure (b), it becomes clear that ‗a one is wholly contained in the other or else they are wholly disjoint (alexander 2015, p.6-7) .' this article is protected by copyright. all rights reserved. centrality layout based on node value (degree std., link strength frequency value). ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. ties with a frequency 7-9 are displayed in black ties with a frequency 10-12 are displayed in black with a larger size. ties with a frequency 13+ are displayed in black with a larger size. red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. centrality layout based on node value (degree std., link strength uniform). 28 actors with 89 ties in total are displayed. ties with a frequency 1-3 are displayed. red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. this article is protected by copyright. all rights reserved. centrality layout based on node value (degree std., link strength uniform). 32 actors with 146 ties in total are displayed. ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. red nodes refer to leadership of paln. purple nodes refer to operations of paln. yellow nodes refer to partners of poa. this article is protected by copyright. all rights reserved. legend: centrality layout based on node value (degree std., link strength uniform). 57 actors with 482 ties in total are displayed. ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. ties with a frequency 7-9 are displayed in black ties with a frequency 10-12 are displayed in black with a larger size. ties with a frequency 13+ are displayed in black with a larger size. red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. accepted article figure 5 . crisis network centrality layout based on node value (degree std., link strength uniform). 53 actors with 416 ties in total are displayed. ties with a frequency 1-3 are displayed in light grey. ties with a frequency 4-6 are displayed in dark grey. ties with a frequency 7-9 are displayed in black red nodes refer to leadership of apa. purple nodes refer to operations of apa. yellow nodes refer to partners of poa. this article is protected by copyright. all rights reserved. since 2018, the antwerp fire service (brandweer zone antwerpen or bza) and antwerp management school (ams) have been joining forces in an applied research project to develop a future vision on the organization of emergency services in antwerp (hence the involvement of the fourth and fifth author, being the fire chief and company commander). in this project, we examine how to organize larger, complex systems for collective, field-level behavior; how organizations and networks in an organizational field can be selectively integrated-so that those networks that need to work together do so, while others that not need to work together, do not; and what institutions and structures of authority and collaboration we need that provide network managers (system designers) the means to create, collectively define, integrate, and dissolve a network solution to organizational field challenges such as public safety and health. this article is protected by copyright. all rights reserved. with 235mio tons of maritime freight volume (in 2019), the port of antwerp is the second largest port in europe. stretching over a surface of 120 km² and centrally positioned some 80km inland it has 60% of europe's production and purchasing power within a 500km radius. under the authority of the federal and flemish government, the port area stretches over two provinces with the scheldt river in between. apart from the port of antwerp, the scheldt estuary houses another three seaports, of different size and government structure, and their ancillary services. nearly 62,000 persons work in the port community, which also contains the second largest petrochemical cluster in the world. this positions the port of antwerp as one of the main gateways to europe. not only is the port critical to the logistics required to support governments in their attempts to reduce the effects of the covid-19 pandemic, it is also a vital infrastructure that allows for continued economic activity in europe in general and belgium in particular. the antwerp port authority (apa) is a public limited company under public law, fully owned by the municipality of antwerp. apa plays a key role in the port's day-to-day operations. on the level of operations, we found three actors that played a central role in the network of networks. these actors were the department harbor, safety and security (op/hss), the vessel traffic department (op/vt), and the safety and health department (ad/sh). these three departments acted as the executives of the port by controlling and monitoring the port community. based on participant observations by the second author, we also had access to data on the operations director (op) who has been heavily involved in managing these departments on the policy level-four actors were found to be involved with the port governance and this article is protected by copyright. all rights reserved. the director of operations (op) oversees the nautical operations department, which is responsible for the fleet and the above mentioned shipping traffic management and harbor safety and security. asset management and port projects respectively deal with the development and management of the dry and maritime structure and with technical projects that have an impact on port infrastructure. sustainable and balanced market for goods and services. this agency was more centrally involved with apa as it is in charge of the quality control and distribution of medical equipment and personal protective equipment (ppe). the second category involves the port stakeholders. in general, we identified five types of stakeholders in the port: industry, shipping, services, road-, and rail transportation. based on table 1 and figure 1, we found that the industry stakeholders and the shipping stakeholders were more mentioned in the covid-19 crisis meetings. note that next to those two stakeholders, there are also the inland navigation owners, operators and representatives, who are more scattered and are often smaller businesses. in belgium, this represents merely 1150 vessels with a total capacity of 1,8mio ton and around 1800 persons, a majority self-employed. apa alone already handles about 99,3 mio ton goods for over 52000 inland navigation vessels annually, which emphasizes the international context of this stakeholder segment (2019). the industry stakeholders comprise all companies that are based in the port (900 companies in approximation). these include terminal operators (containers, liquid, dry bulk, etc.) and chemical production companies, often subsidiaries of multinational companies within the port area. they have a commercial relationship with apa, being concessionaires and apa being the landlord. the shipping stakeholders on the other hand, are those that own, manage, operate or represent the shipping lines. this includes shipping companies but also agencies and representative bodies. their commercial relationship with apa takes the form of port dues. the final category represents the nautical service providers that act as the pilots for the different sections of the river scheldt, the dock pilots, helmsmen, boatmen and other supportive this article is protected by copyright. all rights reserved. par covid-19 viewpoint symposium 58 58 services that ensure safe navigation from sea to port and vice versa. these service providers closely collaborate with the operations department of apa. this article is protected by copyright. all rights reserved. we provide for each brokerage type (structural position) of gould and roberto (1989) a short qualitative account to illustrate the veracity of this dynamic process of brokerage conducted by the apa lead network. the strategic orientation in brokering followed by the lead network was to collaborate in order to achieve the field-level goal: keeping the port open (cf. soda et al., 2018) . during the pandemic this was exemplified by three distinct brokering behaviors: separating, mediating, and joining organizations and networks (cf. grosser et al., 2019) . (2) itinerant broker between subgroups within the port community: mediation. in this role functional units of the lead apa network acted as a mediator between two subgroups of the community. in one example they mediated a concern regarding a parking lot dedicated to trucks. whereas the parking was closed in agreement with the port police force on march 20th, it was reopened with additional enforcement measures on march 26 th after extensive dialogue between representative bodies of the road transportation industry and the police force. this could not have been the case without mediation of the lead apa network between conflicting subgroups of the port community. this article is protected by copyright. all rights reserved. (3) gatekeeper of the port community: mediating. in this brokerage role the lead apa network in close collaboration with the federal agency saniport (responsible for sanitary control of international traffic) which ships were allowed to enter. they acted as a go-between controlling access from the sea to the land. apa preventively assigned a lay-by berth as quarantine area for suspected ships and the apa harbour master office played a key role in authorizing suspected vessels to berth and under what conditions, informing the right parties and providing conditions to leave port after being cleared following an infection. in its gatekeeper role on several occasions the lead apa network needed to switch to mediation when disputes between the different actors (ships, service providers, shore industry, etc.) over preventive measures need to be promoted for reconciliation-always with the aim of safeguarding port operations whilst protecting the health of those involved. (4) representative of the port community: mediating. although during the covid-19 pandemic formally no additional authority was assigned to apa, its primary legitimacy base derived from its central position as a core provider of services to the industry and safeguarding the shipping interests reaching far beyond the local and regional economy of antwerp. this also means that apa as a broker represents the port community for instance illustrated by a press release on the task force on april 2nd: -at the moment port of antwerp is not experiencing any fall in the volume of freight. in fact there is a noticeable increase in the volume of pharmaceuticals and e-commerce. the supply of foodstuffs is also going smoothly. on the other hand there has been a fall in imports and exports of cars and other industrial components due to various industries closing down.‖ 2008 ). by limiting the number of network participants the lead apa network was able to create a narrow orientation for the network purpose: safeguard access to the seaport for shipping. in addition, to increase effectiveness the lead apa network mediated by handing over the presidency of the meeting to the transnational nautical authority over the scheldt estuary as from april 1 st in order to be consistent with its formal authority towards external parties (i.c. shipping) and to further enhance consensus and power symmetry between the actors. this article is protected by copyright. all rights reserved. strategic alliance structures: an organization design perspective environments of organizations a city is not a tree. 50 th anniversary edition an exploratory study of conflict and coordination in interorganizational service delivery system making things work: solving complex problems in a complex world from high-reliability organizations to high-reliability networks: the dynamics of network governance in the face of emergency ucinet for windows: software for social network analysis analysis and visualization of social networks the lost art of planning a search for beauty: a struggle with complexity: christopher alexander structures of mediation: a formal approach to brokerage in transaction networks a preliminary examination of command and control by incident commanders of dutch fire services during real incidents measuring mediation and separation brokerage orientations: a further step toward studying the social network brokerage process meta-organization design: rethinking design in interorganizational and community contexts what's the difference? diversity constructs as separation, variety, or disparity in organizations wicked problems in public policy legitimacy building in the evolution of small-firm multilateral networks: a comparative study of success and demise how organizational field networks shape interorganizational tie-formation rates par covid-19 viewpoint symposium 45 appendix c note. due to sensitivity of data we do not show raw data and cleaned data.accepted article key: cord-322815-r82iphem authors: zhang, weiping; zhuang, xintian; wang, jian; lu, yang title: connectedness and systemic risk spillovers analysis of chinese sectors based on tail risk network date: 2020-07-04 journal: nan doi: 10.1016/j.najef.2020.101248 sha: doc_id: 322815 cord_uid: r82iphem abstract this paper investigates the systemic risk spillovers and connectedness in the sectoral tail risk network of chinese stock market, and explores the transmission mechanism of systemic risk spillovers by block models. based on conditional value at risk (covar) and single index model (sim) quantile regression technique, we analyse the tail risk connectedness and find that during market crashes, stock market exposes to more systemic risk and more connectedness. further, the orthogonal pulse function shows that herfindahl-hirschman index (hhi) of edges has a significant positive effect on systemic risk, but the impact shows a certain lagging feature. besides, the directional connectedness of sectors shows that systemic risk receivers and transmitters vary across time, and we adopt pagerank index to identify systemically important sector released by utilities and financial sectors. finally, by block model we find that the tail risk network of chinese sectors can be divided into four different spillover function blocks. the role of blocks and the spatial spillover transmission path between risk blocks are time-varying. our results provide useful and positive implications for market participants and policy makers dealing with investment diversification and tracing the paths of risk shock transmission. in recent years, financial markets have become extremely volatile, especially the global financial crisis in 2008 and the continued global plunge in global stock markets caused by the covid-19 1 in 2020. this has drawn lots of attention from academia trying to measure systemic risks and grasp the system risk spread across sectors or markets. there is some evidence that financial systemic risk threatens "the function of a financial system" (european central bank (ecb), 2010) and impairs "the public confidence or the financial system stability" (billio et al., 2012) . it is widely observed that systemic risk spillovers have a significant "production-contagion-re-contagion" patterns. for the interconnectedness within a market, once one sector encounters a risk shock, the risk will affect other sectors through strong linkages and contagion mechanisms, and even spread to the entire financial markets. in this context, investigation into the connectedness among financial markets and the systemic risk spillovers contagion mechanism across sectors or markets become important and necessary, which is helpful for regulars to identify sources of risks and formulate intervention strategies, and for investors to make smarter portfolio strategies. the complex relationship between financial markets and their internal elements is the carrier of systemic risk transmission, and their connectedness patterns or structures play an important role in the formation and infection process of systemic risks. moreover, the concept of systemically important financial institutions (sifis) can be extended to broader markets or sectors. some scholars find that sectors have different response to shocks due to their own market and sectoral heterogeneity and risk features (ewing, 2002; ranjeeni, 2014; yang et al., 2016; wu et al., 2019) . for stock market participants, sectoral indexes can be used as a significant indicator to access portfolio performance. identifying which sector is the most influential and how systemic risks spillover among sectors is essential for effective risk management and optimal portfolios. therefore, in addition to cross-institution risk spillovers, cross-sector and cross-market risk transfer has become increasingly prominent. it not only greatly increases the probability of cross-contagion of financial risks, but may also trigger broader systemic risks. referring the knowledge of sifis, this paper measures the systemic risk of each sector in chinese stock market, analyzes the spatial connectedness of various sectors to determine which sectors play a leading role in risk spillover or market co-movements, and explores the risk spillover transmission paths. the results of this study could manage the systemic risk and preserve financial stability, which in turn, contribute to the smooth functioning of the real economy. although its (cross-sectors) great important, the existing literature on this topic is relatively scarce. this study we first develop and apply the tail risk network with the single-index generalized quantile regression model of h rdle et al. (2016) , which takes into consideration non-linearity a relationships and variable selection. further, we investigate the tail risk network topology and its dynamic changes to analyze the spatial connectedness of 24 chinese sectors during 2007-2018. in order to understand the impact of network connectivity on systemic risk, we draw on the orthogonal pulse function to find that hhi of edges has a significant positive effect on systemic risk. second, we adopt pagerank index to identify systemically important sector. it is observed that although the systemically important industries are time-varying, and the utilities and financial sectors (including banks, insurance and diversified finance) still should be received more attention. finally, we innovatively use block models to assess the roles of different spillover blocks, and excavate the transmission paths of risk spillover in different blocks. the remainder of the paper is organized as follows. in section 2 related literature about our study is outlined. section 3 shows the data and methodology. section 4 is the empirical results. section 5 presents our conclusion. systemic risk threatens the stability and functioning of financial markets when the stock market is confronted with sharp downtrend, reduced with market confidence and willingness of risk taking. and it is considered to be the risk of causing a large number of participants in the market to suffer serious losses at the same time and quickly spread into the system (benoit et al., 2017) . a number of researchers have discussed the measurement of financial system's systemic risk and the macroprudential risk management approaches (laeven et al., 2016; acharya et al., 2012) . the relevant literature in this filed can be roughly divided into four categories: the first is conditional value-atrisk (covar). adrian and brunnermerier (2007) put forward the covar, defined as the var of the financial system when a single market or sector encounters some specific events. and then they proposed a measure of systemic risk, δcovar (adrian and brunnermerier, 2016), which is defined as the difference between covars when a sector is and is not under turmoil. girardi and ergun (2013) used covar approach to measure the systemic risk contribution of four financial sectors composed by lots of institutions, and investigated the relationship between institutional characteristics and systemic risk contributions. the second method emphasizes the default probability of financial institutions through the interrelationships between financial assets. for example, principal component-based analysis, e.g. kritzman et al. (2011 ), bisias et al. (2012 , rodriguez-moreno et al. (2013) and others; cross-correlation coefficient-based analysis, e.g. huang et al (2009 ), patro et al. (2013 . the third category uses the copula function to calculate the systemic risk with biased tail of stock market. krause et al. (2012) adopted copula function to calculate the nonlinear correlation of time series and established an interbank lending network. the results show that externally failed banks can trigger potential banking crisis and analyze the spread of risks within the banking system. the last category looks at an institution's expected equity loss when the financial system is suffering losses. acharya et al. (2017) proposed marginal expected shortfall (mes) and systemic expected shortfall (ses), which are two systemic risk measures. further approaches take into the information of market capitalization and liability, such as, the srisk (brownlees et al., 2017) and the component expected shortfall (ces) (banulescu et al., 2015) . both the srisk and ces methods especially focus on the interdependence between a financial institution and the financial system, and ignore the interconnectedness among financial agents from a whole system perspective. however, as pointed out by bluhm et al. (2014) macro-prudential monitoring is still at a very early stage, quantifying the magnitude of systemic risk and identifying the transmission paths need more scientific analysis. to do so we apply network methodology to quantify the interconnectedness among sectors in financial system. network theory has always been a leading tool for analyzing the intricate connectedness relationship because it can conquer the "dimension barrier" of multivariate econometric models and simplify complex financial systems (acemoglu et al., 2015; battiston et al., 2016; huang et al, 2018) . and in financial network, financial entities (e.g. institutions, sectors and markets) are abstracted to nodes, and correlations among agents are abstracted to edges. the early literature on classic network construction methods is correlation-based networks, such as the minimal spanning tree (mst) (mantegna, 1999) , the planar maximally filtered graph (pmfg) (tumminello et al., 2005) and the partial correlation-based network . the main disadvantage of the correlation-based network is that the economic or statistical meanings of their topological constraints are unclear (onnela et al., 2003; kenett et al., 2015; zhang et al., 2019) . more recently, several econometric-based networks have been constructed to uncover information spillover paths and contagion sources (výrost et al., 2015; ly csa et al. 2017; belke et al. 2018) . o the extensively econometric-based networks are classified into three groups: (i) mean-spillover network (also called granger-causality network), which is proposed by billio et al. (2012) ; (ii) volatility spillover network, e.g., the variance decomposition frame-based network of diebold and yilmaz (2014) , and the garch model-based network of liu et al. (2017) ; (iii) risk spillover network, which major includes tail-risk driven network of hautsch et al. (2015) and h rdle et al. a (2016), and extreme risk network of wang et al. (2017) . of course, many studies have discussed the application of spillover networks. this study is distinguished from existing information spillover literature by focusing on systemic risk spillover, especially tail risk spillover. the last strand is associated with the tail risk spillover network and its applications. hautsch et al. (2015) used the least-absolute shrinkage and selection operator (lasso) method to build a tail risk network for the financial system, and evaluated the systemic importance of financial firms. the 2016) and is constructed by semiparametic quantile regression framework that considers non-linearity and variable selection. they have discovered the asymmetry and non-linear dependency structure between financial institutions and identified systemically important institutions. wang et al. (2017) applied caviar tool and granger causality test to measure systemic risk spillovers, and then proposed an extreme risk spillover network for studying the interconnection among financial institutions. our work contributes to the literature in three major aspects. first, we analyze the characteristics of spatial connectedness and systemic risk spillovers of tail risk network using sectoral data in chinese stock market. we extend the literature on interconnectedness and systemic risk of sectors level data while extant literature generally focuses on the financial institutions data. second, we innovatively adopt orthogonal pulse function to explore the impact of network connectivity on systemic risk of financial system. besides, we employ pagerank index to identify systemically important sectors that spread systemic risk spillovers to entire system. third, we apply block model in our study to assess the roles of different spillover blocks of 24 sectors in risk contagion process, and excavate the tail risk transmission paths and contagion mechanisms. the existing literature focuses more on the network topology and the identification of important financial institution nodes in the financial institution network, but lacks the risk propagation mechanism analysis. importantly, it is necessary to clarify how system risk transfers across sectors. in order to analyze the systemic risk spillovers and its interconnectedness across chinese sectors, we select the weekly closing prices of 24 sectors in china's stock market (name abbreviations , of 24 industries are seen in appendix table a1 ). the sample data ranges from january 4, 2007 to december 31, 2018 (total of 613 trading weeks), and the industry classification data is available from wind database. our analysis centralizes the weekly returns of each sector, which is defined as . table 1 presents the descriptive statistics for weekly returns of 24 sectors , = ln ( , / , -1 ) during the sample period. note that maximum of return series except for pbls and df, is less than the absolute minimum, implying that there is extreme risk in the left-tail of the yield distribution. besides, the jb statistics for each sector is significant at 1% level that rejects the null-hypothesis of gaussian distribution for the series. thereby, we can use single-index model (sim) quantile regressions to estimate the covar. apart from the closing price data, motivated by , we also collect five macro state variables and four internal variables. the macro state variables contain the weekly market returns, the market volatility, the real estate sector returns, the credit spread and liquidity spread, which depict the economic situation. the internal variables contain the size, the turnover rate, the p/e ratio and p/bv, which reflect the influence of industry from the fundamental characteristics. the detailed definition of these variables can be seen from table 2 . notes: *** denote significance at 1%. table 2 variable definitions. weekly market returns calculated by shanghai securities composite index. defined as the conditional variances of the shanghai composite index returns estimated by garch(1,1) real estate sector returns the weekly logarithmic yield of real estate index. credit spread difference between 10-year treasury bond and aaa-rated corporate bond yield liquidity spread difference between three-month shanghai interbank offered rate and three-month treasury bond yield. size defined as the turnover, which is equal to the volume multiplied by the average price. the weekly turnover rate can be available from wind info. p/e ratio the weekly price earnings ratio. p/bv calculated by price/book value. in this paper, we adopt the novel tenet framework proposed by h rdle et al. (2016) to a measure the tail risk interconnectedness among various industries and build dynamic evolution tail risk network in china's stock market. as we all known, adrian and brunnermeier (2016) only perform linear interaction between two financial institutions, however, chao et al. (2015) find that any two interacting financial assets show non-linear dependency, especially in uncertain economic periods. therefore, accounting for non-linearity dependency, h rdle et al. (2016) develop the a bivariate model to a high-dimensional state and solve the variable selection problem by single-index quantile regressions. accordingly, we also exert three estimation steps to complete tail risk network's construction. first step, the var for each industry i at quantile level should be first estimated by ∈ (0,1) using linear quantile regression. given the return of industry i at time t: represents the macro state variables, is the estimated parameters. -1 second step, the covar is the basic element of the network, and it can reflect the systemic (tail) risk interconnections of sectors. the tail risk interconnectedness from one industry to another in the tail risk network stands for the systemic risk contagion and network spillovers. thus the covar should be second estimated. we perform a risk connectedness analysis by accounting for nonlinear dependency in high dimensional variables, and adopt sim quantile regression technique to gain the systemic risk contribution due to a change in the relevant industry. it is obtained via 2 : , -1 , , -1 } -, = { 1, , 2, ,…, , } independent variables including the returns of all industries apart from industry j; n denotes the number of sectors; is the internal features of every industry, i.e., size, turnover rate, p/e , -1 ratio, p/bv; is the parameters, and . the represents network risk triggered by tail-event, which includes of all other relevant industries on industry j and the non-linearity that is reflected by the function quantifying the marginal effect of covariates, and is the componentwise where could reflect the risk spillover effects among sample industries. note that we only centralize the partial derivatives of industry j on the other industries ( ) in given network. additionally, we can also use rolling window estimation to estimate all |parameters. last step, the directed tail risk network should be constructed. it is denoted as graph g (v, e) with a set of nodes and a set of edges e. the adjacency matrix with all = { 1 , 2 ,…, } linkages at window s is to be: (7) where denotes the name of industry i. the absolute value of is the element of weighted matrix, and it | is the risk connectedness from industry i to j. (1) network concentration concentration is also an important indicator of network structure and represents the density of the linkages. following fang et al. (2019) , we apply the herfindahl-hirschman index (hhi) that is generally used to measure the extent of concentration in an industry. hhi index equals the sum of the square of market share of each financial institution, and can be used to measure the degree of monopoly. thereby, the hhi index can reflect the degree of risk network concentration, which is consistent with our definition. it is calculated by: where is the number of edges connected for the node i at window s, is the total number of network edges, and denotes the proportion of connected edges of node i to in window , which stand for the degree of node i's relative linkages. (2) node strength the node strength considers not only the number of directly connected edges but also the weights of edges. it can be seen from the adjacency matrix of formula (7) that risk spillover is directional. therefore, this article pays more attention to the sectors that spread or absorb risks, that is, the out-strength (in-strength) is used to measure the risk contagion (absorption) ability of each sector. now, we introduce two directional measures of the sector strength, i.e., the out strength and the in strength, and these are used to measure each sector's outgoing and incoming connectedness, respectively. the out-strength (os) of sector i is the sum of weights of outgoing edges from | | | sector i to other sectors, as follows: in-strength (is) of a sector i is the sum of weights of incoming edges from other sectors to | | | sector i, as follows: (3) pagerank assume that node i has a direct link to node j, the more important of node i is, the higher the contribution value of node j is. thereby, the pagerank reflects the connectedness between one industry and another while considering the influence ability of its neighbors. pagerank algorithm is a variant index of the eigenvector centrality in "adjacency matrix". as in wang et al. (2019) , we compute the pagerank (pr) indicator through the iterative method which introduces a dynamic process of information spread. first, we compute the centrality value of sector i based on the risk network matrix (eq. (7)). and the effect weight is normalizing as follows: where denotes the effect weight by sector i on sector j at window s. second, we adopt the , pagerank algorithm proposed by page et al. (1999) to get pagerank: , where d is a damping factor (generally set to 0.85), is the pagerank of sector i and its value always positive. a higher value means that sector i has a greater contribution to systemic risks of network. the block model is the main method for spatial cluster analysis of the complex financial networks. it is first proposed by white et al. (1976) , which is a method of studying network location modules and is to view social life as an interconnected system of roles. later, scholars conducted in-depth research and promotion of this concept from many aspects. in addition, many scholars also use the "block model" to study some specific issues, such as the study of the scientific community (breiger, 1976) , the world economy (snyder et al., 1979) , the organizational issues and the regional contagion effects (shen et al., 2019) . in short, the concept and the method of the block model have been widely used. therefore, the block model could identify the aggregation characteristics among individuals to divide the network into location blocks. actually, this method not only determines the members included in each block, but also analyzes the role played by each block in the risk propagation process, and explores the risk spatial propagation path (li et al., 2014; zhang et al., 2020) . there are four role blocks: (i) main benefit, members of this block receive links not only from external members but also from their own members, and the proportion of internal relations is large, while the proportion of external relations is small. in extreme cases, it is called isolated block, that is, the block has no connection with outside. (ii) main spillover, members of this block send more links to other blocks, but send less links to inside members, and receive less links from external. (iii) bilateral spillover, its members send more links to their own members and other blocks' members, but receive very few external links from other blocks. (iv) brokers, its members both send and receive external relationships, while there are fewer connections between their internal members. motivated by wasserman et al. (1994) , we analyze the relationship of each member from block by the evaluation indicators shown in table 3 . there are nodes in block , then the number of possible relationships inside is . the entire network contains n nodes, so ( -1) all possible relationships among members in are . in this way, we expect the total ( -1) relationships expectation ratio of the block to be . ( -1)/ ( -1) = ( -1)/( -1) table 3 four types of blocks. received linkages ratio internal linkages ratio ≈ 0 > 0 ≥ ( -1)/( -1) bilateral spillover main benefit < ( -1)/( -1) main spillover brokers in this part, we apply sliding windows to estimate time-varying var, covar and construct dynamic evolution tail risk network. we use linear or non-linear quantile regression model to estimate var and covar at the quantile level 3 , and the sliding window size is set to be = 0.05 trading weeks (corresponds to one year's weekly data). though the way, we get whole = 50 . to obtain the preliminary analysis about the whole sample dataset, we present = 563 the log returns and covar of 24 sectors, and the dynamic evolution about the total connectedness and average lambda of tail risk network from 2008-01-04 to 2018-12-31 (window size w=50, w=563). from fig. 1 , we can find that chinese stock market ( fig.1 (bottom) shows the dynamic evolution of the total connectedness of the tail risk network. we observe that the value appeared twice λ obvious peaks, corresponding to the us subprime mortgage crisis in 2008 and the domestic stock market turmoil in 2015. however, the total connectedness of tail risk network had at least five peaks, corresponding to the us subprime mortgage crisis in 2008, the european debt crisis in 2011, the money shortage in 2013, the stock market turmoil in 2015 and the trade friction between us and china in 2018. this phenomenon reflects that the total connectedness of tail risk network is more sensitive to the shock of chinese stock market, and may be an alarm before the market turmoil. in this section we first measure the network edge concentration to reflect the overall connectivity of chinese sectoral tail risk network, and investigate the impact of network edge concentration on systemic risk at the global level. fig.2 shows the dynamic evolution trend of the sectoral tail risk spillover network edge concentration. from fig.2 we can see that the sectoral tail risk network edge concentration (hhi) has apparent periodic variation characteristics. this finding is basically consistent with the periodic evolution of systemic risk in the time dimension. further, the first and last peaks are most notable, which correspond to the 2008 financial crisis and the 2015 domestic stock market turbulence. now, we take the period (2015/1/30-2016/12/30) of the last peak as an example. in this period, the most significant change of the hhi value is a rapid climb from 0.187 (may 2015) to 0.228 (july 2015). as the potential risks continue to accumulate, the concentration of edges reaches a maximum of 0.232 (january 2016). in the early stage, the chinese government issued a series of economic reform measures, which stimulated investors' blind optimistic expectations. besides, large-scale funds of financial institutions entered the stock market through the way of "highly leveraged" off-market allocation, and the excessive risk-taking behavior of different types of firms in the stock market has led to an increase in indirect correlation. gradually, due to the downward pressure of china's macro economy and the strict investigation of off-market allocation by the china securities regulatory commission, a large-scale withdrawal of credit funds and an avalanche-like chain reaction led to the 2015-2016 stock market crash. this shows that as the market turbulence intensifies, the concentration of the risk network will increase, and the edges of the entire network are mostly concentrated in a few highly centralized sectors. at this time, the stability of the network structure is very poor. if these nodes encounter a risk shock or infection, the systemic risk will quickly spread throughout the network, and the risk spillover effect between sectors will be significantly strengthened. conversely, as the risk is released, the market gradually stabilizes or rises, and the hhi indicator value will become smaller. this phenomenon indicates that the tail risk network exhibits the characteristics of multi-centers rather than a central node. the multi-center network structure facilitates the dispersal of risk information through multiple channels and is conducive to maintaining the stability of the stock market network. ) is the source of shock. here, we adopt orthogonal pulse function to test the hhi short-term dynamic relationship between network edges concentration and systemic risk. this method is widely used to analyze the relationships between variables (pradhan, 2015; berument et al., 2009 ). the pulse function not only presents the direction of the influence, but reflects significance level and time lags. fig.3 depicts the response of systemic risk to network edges concentration. in fig.3 , the vertical axis denotes the systemic risk for the same sector, while the horizontal axis denotes the time lag after the shock in the sample sectors for that month. it is observed that the hhi of edges initially has no significant positive effect on systemic risk. with the accumulation of risks, hhi begins to shows a positive effect on systemic risk from the second month, and reaches the maximum in the fourth month. gradually, variable hhi disappears to meaningless after nine months. the result shows that hhi has a significant positive impact on systemic risk, but the impact shows a certain lagging feature. the reasonable explanation for this phenomenon is that as the hhi value increases, the connected edges in the network are more controlled on a few central nodes. so, the systemic risks of the network are cumulatively amplified. however, the characteristics of systemic risk "slow accumulation and rapid release" and the shortcomings of chinese financial market under severe macro-regulation are important reasons for the lagging effect. of course, it provides strong evidence supporting the results in fig.1 , which proposes that the concentration of the risk network is more sensitive to the cumulative systemic risks. in addition to analyzing the overall connectedness of the tail risk network, we also analyze the weighted and directed edges of individual industry nodes. fig.4 and fig.5 reflect the dynamic evolution of the risk propagation and risk absorption of each sector during the entire period, respectively. first of all, we can see that both the ability of risk propagation or risk absorption of each sector change over time. many in-strength values are less than one, and only a few sectors have larger values (see fig.4 ), suggesting that these few sectors are seriously infected by external shocks and receive the highest tail risk. in the first shock event period (2008/1/25-2009/12/31), four sectors, i.e., business and professional services (bps), media (med), home and personal items (hpi), healthcare equipment and services (hes) have the largest in-strength values and are the top receivers of tail risk. the results show that the systemic risk from us subprime mortgage crisis has seriously shocked china's real economy sectors, and these industries have accumulated more tail risks. in the second extreme event period (2010/1/29-2012/12/28) that covers the european sovereign debt crisis, the healthcare equipment and services (hes), software and service index (ss) and semiconductor and semiconductor production equipment (sspe) receive the largest incoming links. during the "chinese stock market turbulence" period (2015/1/30-2016/12/30), the strong incoming links come from business and professional services (bps), media (med), software and service index (ss) and semiconductor and semiconductor production equipment (sspe). in the "trade friction between us and china" period (2017/1/26-2018/12/28), five sectors, i.e., software and service index (ss), semiconductor and semiconductor production equipment (sspe), insurance (ins), utilities (ut) and business and professional services (bps) have the strong incoming links, showing that these sectors are the most affected by tail risk. this finding supports the evidence that in 2018, the us imposes trade sanctions on various industries' commodities in china, including: communications, electronics, machinery and equipment, automobiles, furniture and so on, which corresponds to the above-mentioned industry classification. hence, the greater the in-strength value, the deeper the bad impact of a sector by other sectors, and more serious of the damage. out-strength fig. 5 out-strength of each sector for dynamic tail risk network notes: the horizontal axis (x) denotes time windows, and the vertical axis (y) denotes the abbreviation code of sectors (the corresponding full name of each code is presented in appendix table a1 ). as can be seen from fig.5 , the distribution of the out-strength differs from that of the in-strength and is relatively even. for example, many sectors have the lower out-strength value, and only a few out-strength values are larger 4, indicating that the few sectors emit the highest systemic risk. in the first event period (2008/1/25-2009/12/31), the strong connected sectors with outgoing links are energy (ene), diversified finance (df), insurance (ins), and utilities (ut). it indicates that affected by the us subprime mortgage crisis, these industries are the main senders of tail risk. in the third event period (2013/1/25-2014/12/31), which covers the money shortage in 2013, the home and personal items (hpi), media (med), and diversified finance (df) send the largest outgoing links to others. one of the reasons for the sector of home and personal items with a high level of outgoing links is that the reduction of currency circulation in the market directly reduces the daily consumption level of consumers. in the fourth event period (2015/1/30-2016/12/30) which covers the "2015-2016 china stock market turbulence", two financial sectors including bank (bank) and diversified finance (df), and media (med) have strong outgoing links and are involved in most risk spillovers. this phenomenon proves that financial institutions (especially security sector) trigger the recent bear market. overall, the greater the value of the out-strength, the stronger the ability of one sector to spread the tail risk to other sectors, and the greater the impact on others. connectedness alone cannot stand for the systemic importance of an individual sector. we thus calculate the pagerank index since it considers both the interconnectedness and the influence ability of neighbor nodes. to achieve a comprehensive knowledge of the systemic importance for each sector, we draw the heatmap of pagerank value which is shown in fig.6 . obviously, the influence of different industries in different periods varies greatly. from fig.6 , we observe that the pagerank value of most industries is less than 0.05, while only a few sectors have high hhi, showing that these sectors could act as influential sector in chinese stock market. for example, in first risk event period, the top three sectors are utilities (ut), diversified finance (df) and media (med), which are thus systemically important. and utilities (ut) and insurance (ins) are the systemically important sectors in second risk event period. the most important reason why ut becomes a systemically important industry is that the utility industry provides infrastructure protection for the development of other industries. furthermore, in third risk event period the home and personal items (hpi) and diversified finance (df) have larger pagerank value and are thus the largest tail risk contributors during that period. in fourth risk event period only diversified finance (df) consistently presents higher pagerank value. one of the major reason for diversified finance being the influential sector is that large-scale abnormal securities margin transactions have caused a surge in systemic risk under unregulated conditions, which in turn affects many associated industries due to asset-liability relationships or high leverage. at the end of 2017, utilities (ut) and energy (ene) are systemically important sectors. as mentioned above, the utilities and financial sectors should be received more attention in the overall time period from both regulators and investors as they become systemically important sectors in many risk event periods. therefore, in the near future, the dependence of utilities industry not reduce significantly, which may reinforce the utilities stocks. besides, for financial sectors, the development of the whole industry depends much on balancing financial structure, strengthening financial regulation and improving financial innovation. the most fundamental reason is that the financial sector is an important sector in the national economy. it has the characteristics of high industry linkage and strong driving ability. it provides financial support for the development of enterprises in many sectors. once the financial industry is in a downturn, it will affect the development of the entire industry chain. fig. 6 heat map of the pagerank value of each sector for dynamic tail risk network. notes: the horizontal axis (x) denotes time windows, and the vertical axis (y) denotes the abbreviation code of sectors (the corresponding full name of each code is presented in appendix table a1 ). this section divided 24 sectors into different blocks through block model, and find out which sectors are likely to cluster in the same community, and then further to examine the relative roles of each block in the sectoral tail risk network. this method can more simply and clearly reflect the function of various industries and risk propagation paths in the risk spillover process. and it is more conducive for the regulatory authorities and investors to grasp risks transmission mechanism, formulate risk prevention measures and optimize asset allocation strategies. here, we conduct a segmented sample study which covers five sub-samples: period 1 is us subprime crisis from 2008 to 2009; period 2 is european debt crisis from 2010-2012; period 3 is money shortage period from 2013-2014; period 4 is 2015-2016 chinese stock market turbulence; period 5 is trade friction between us and china from 2017 to 2018. according to existing research practices (chen et al., 2019; zhang et al., 2020) , we used the ucinet software to divide the block position of the tail risk network adjacency matrix. and, in this process we choose maximum separation depth is 2 and the convergence criterion is 0.2. therefore, we get four risk spillover blocks in five sub-samples. table 4 presents the spatial connectedness and role analysis between risk blocks of sectors in five sub-samples. from table 4 , it is observed that there are significant differences in the roles played by the four major blocks and the features of different blocks vary across time. now we take the period 1 and period 5 as examples to analyze risk spatial linkages of 24 sectors. specifically, in the period 1 and 5, the internal linkages between the four blocks are 69 and 28, respectively, while the cross-linkages between four blocks are 67 and 88, respectively. it indicates that the spatial spillovers between four blocks are very obvious. in period 1, the number of sending relations in first block is 13, of which there are 5 relations within the block, and the receiving relations from other blocks are 19; the expected internal relation ratio is 13.04%, and the actual relation ratio is 38.46%, so it is called "main benefit block". members of first block are ene, ut, bank and re, indicating that the tail risk spillovers between these sectors are closely linked and they are easily affected by the external risk shocks. furthermore, the number of sending relations in second block is 22, of which there are 9 relations within the block, and the receiving relations from other blocks are 12; the expected internal relation ratio is 13.04%, and the actual relation ratio is 40.91%, so it is called "bilateral spillover block". similarly, the third and fourth block are all "bilateral spillover block". members of the second block are tsp, cs, df and ins, showing that if fluctuations generated by these sectors, there will be great subsequent fluctuations to other sectors. e.g., the transportation industry (tsp) is an upstream industry for many industries, and when risks occur, the risks are transferred to other industries through sector linkages. overall, the internal links ratio of the first and second blocks is low, while the ratio of the third and fourth blocks is high and the third-fourth block emit more links with each other. in period 5, the number of sending relations in first block is 46, of which there are 16 relations within the block, and the receiving relations from other blocks are 17, including 10 links from fourth blocks; the expected internal relation ratio is 26.09%, and the actual relation ratio is 34.78%, thereby it is called "bilateral spillover block". the sending links in second block is 4, and the internal links of this block is 0, while only 1 link send to fourth block; so the actual relation ratio of second block is 0% and it is called "net benefit block". members of second block are re, bank, tsp, mat and the, indicating that these industries are more sensitive to external risk shocks and are the largest systemic risk contributors during the us-china trade friction period. such as, the most possibility reason for the real estate industry (re) to act as the risk transmitter in the risk network is that, it has a high degree of industry connectivity, which drives the development of materials, manufacturing, bank, home and personal items and other industries. once the real estate industry is sluggish, it will cause turmoil in the entire industry chain. the sending links of third block is 20, of which 3 links are in this block, and it mainly accepts the relationship from the fourth block; the expected internal relation ratio is 26.09%, and the actual relation ratio is 15%, thereby it is called "broker block", which plays a role as a "bridge" in systemic risk transmission. importantly, strong spillover transmission between blocks may depend on the functions of "broker block". the reason may be that mutual linkage and bidirectional economic or financial effects between their members and other blocks' members. the sending links of the fourth block is 45, of which 9 internal links of this block, and it mainly sends the relationship to third block; the expected internal relation ratio is 21.74%, and the actual relation ratio is 20%, thereby it is called "main spillover block". members of fourth block are bps, dcgc, df, ins, ss and sspe, which are levied high tariffs by the us, therefore they become the spillover engine. overall, the internal links ratio of the first block is high, while the ratio of the second and third blocks is low. the detailed analysis of period 2-4 are not listed due to space limitations, and the detailed results are shown in table 4 . in order to more clearly reveal the spillover distribution and relative roles of the tail risk relationship between the 24 sectors, we calculate the density matrix and image matrix of each block (shown in table 5 ). the overall density values of the tail risk network in five periods are 0.246, 0.257, 0.219, 0.230 and 0.210, respectively. here, the overall network density is selected as the critical value. if a block's density is greater than the overall network density, the corresponding position in the image matrix is assigned 1, otherwise, the value is 0. for example, in period 1, the density of first block is 0.417 that is greater than the overall network density (0.246). it shows that the block's density is greater than the average value of whole network, and the risk spillover linkages within a block have a significant tendency to concentrate. from the image matrix in table 5 , take period 1 as an example, we find out that: (i) the diagonal elements of the image matrix in four blocks are 1, showing that internal risk spillovers in the block are closely related and indicating that it has obvious "rich-club" effect; (ii) the first block receives risk spillover connections from the second and fourth blocks; (iii) the second block receives risk spatial connections main from the third block, and it plays the role of a "bridge" and realizes the interconnection of risk spatial spillover in the first and third blocks; (iv) the fourth block realizes the correlation and interaction with other blocks due to the risk spillover association to the first block. the results prove that the interconnections between the different blocks do not occur directly, mainly through the transfer of the first and second blocks. in the following, we continue to analyze our tail risk spillover across blocks. in this context, fig.7 displays the dynamic evolution of the risk spillover transmission mechanism between four blocks. it is observed that the spatial connectedness between the risk spillover blocks is time-varying since the members of the blocks are also time-varying, thereby the blocks' features in the risk transmission process are different at different periods. from fig.7 , it is easy to find that the risk transmission path across the blocks is more complicated during the first, second and fifth periods, and the risk transmission path across the blocks is simpler in the third and fourth periods. the most likely reason is that the sources of risk are different, i.e., the first, third and fifth periods are caused by the turmoil in the foreign market, which causes the changes in the relevant industries in the china; the second and fourth periods are caused by domestic macroeconomic regulation or certain sectors with higher levels of accumulation systemic risk. thus, risk shocks originated in a particular sector spread globally to the sectors of other blocks in a more or less homogeneous way, although some blocks are not directly related to each other. for example, in period 3, the source of infection between the risk spillover blocks is the second block, which spreads systemic risk shock to the first, third and fourth blocks simultaneously. however, there is no significant transmission channel of risk between the first and third blocks. members of second block are cs, fbt, hes, pbls and bank, which should be received more attention and supervision from the regulatory authorities, and investors should avoid investment in these industries. in addition, in period 5, the forth block acts as the risk spillover engine and directly transmits the risk shocks to first, second and third blocks. the members of forth block include bps, dcgc, df, ins, ss and sspe, of which dcgc, ss and sspe are subject to high tariffs of the trade policy in the us to china, and in this vein, the export of related products in these sectors are seriously affected, which in turn can easy to break out systemic risk. simultaneously, both the first and third blocks transmit the tail risk from the fourth block to the second block which acts as a distinct bridge and hub. therefore, the second block is the most sensitive block since it accepts the risk spillover from all blocks. due to the space limitations, 20 the analysis of risk transmission paths in other periods will not be repeated. this paper applies single-index model in a generalized quantile regression framework to assess non-linearity relationship and variable selection, and in this vein, we construct dynamic tail risk network for 24 chinese sectors from 2007-2018. at the global level, we first analyze the connectedness of systemic risk spillovers in tail risk network, and investigate the impact of network concentration on systemic risk. at the individual sector level, we calculate the risk contagion or absorption intensity of each sector, and adopt pagerank method to identify systemically important sector. finally, using block model to study the spillover distribution and relative roles of the tail risk relationship between the 24 sectors, and understand the financial risks transmission process across various sectors. in this research, we report the following findings. first, there is a tail risk network that connects all sectors in chinese stock market, and it exposes to more systemic risk and total connectedness during market distress. further, the edge concentration of risk network (hhi) is used to measure risk network interconnectedness and concentration, and it exhibits obvious cyclical features. during the tail event (market downside) periods, the hhi index increases significantly, and then the risk network is relatively single central node structure, thereby, the network stability is poor. the results show that multi-centered financial network, rather than a single pivotal center, can maintain financial market stability. second, the directional connectedness of sectors shows that systemic risk receivers and transmitters vary across time, and provide an evidence about "too linked to fail". besides, we identify two influential sectors released by utilities and financial sectors, which should be received more attention in the overall time period from both regulators and investors. finally, we find that the sectoral tail risk network can be divided into four different spillover function blocks by block model, which can more clearly reflect risk spillover distribution and roles of relevant industries in the process of systemic risk transmission. the role of blocks and the spatial spillover transmission path between risk blocks are time-varying. this study has important policy implications for cross-sector linkages and systemic risk spillovers in chinese stock market. first, it is necessary for the government to issue favorable policies such as the sectoral development policies or macro-control policies in a timely manner, which will promote the influence of relevant industries in the stock market, and thus create a multicentered node to maintain the financial market network stability. second, for investors, they should pay more attention to the systemically important sectors and make reversal strategies around these sectors to configure their assets and portfolios for risk minimization. for supervision department, they may consider the features of four blocks and its spillover paths to formulate different financial regulatory policies that improve the macro-prudential framework during stock market recession and instability periods. a thorough analysis about sectoral tail risk spillover and its spatial connectedness could successfully monitor systematic risks and keep financial system stability, which in turn, contributes to the smooth functioning of the real economy. systemic risk and stability in financial networks capital shortfall: a new approach to ranking and regulating systemic risk measuring systemic risk covar. federal reserve bank of new york staff report which are the sifis? a component expected shortfall approach to systemic risk complexity theory and financial regulation international spillover in global asset markets where the risks lie: a survey on systemic risk monetary policy and u.s. long-term interest rates: how close are the linkages? a survey of systemic risk analytics econometric measures of connectedness and systemic risk in the finance and insurance sectors systemic risk in an interconnected banking system with dendogenous asset markets career attributes and network structure: a block model study of a biomedical research specially srisk: a conditional capital shortfall measure of systemic risk quantile regression in risk calibration cross-border linkages of chinese banks and dynamic network structure of the international banking industry on the network topology of variance decompositions: measuring the connectedness of financial firms the transmission of shocks among s&p indexes interconnectedness and systemic risk: a comparative study based on systemically important regions systemic risk measurement: multivariate garch estimation of covar tenet: tail-event driven network risk financial network systemic risk contributions financial network linkages to predict economic output a framework for assessing the systemic risk of major financial institutions co-movement of coherence between oil prices and the stock market from the joint time-frequency perspective partial correlation analysis: applications for financial markets interbank lending and the spread of bank failures: a network model of systemic risk principal component as a measure of systemic risk bank size, capital, and systemic risk: some international evidence study on the spatial correlation and explanation of regional economic growth in china features of spillover networks in international financial markets: evidence from g20 countries return spillovers around the globe: a network approach. o s y o economic modelling hierarchical structure in financial markets dynamics of market correlations: taxonomy and portfolio analysis the pagerank citation ranking: bringing order to the web a simple indicator of systemic risk orthogonal pulse based wideband communication for high speed data transfer in sensor applications sectoral and industrial performance during a stock market crisis systemic risk measures: the simpler the better? china's regional financial risk spatial correlation network and regional contagion effect structural position in the world system and economic growth, 1955~1970: a multiple network analysis of transitional interactions a tool for filtering information in complex systems analysing the systemic risk of indian banks granger causality stock market networks: temporal proximity and preferential attachment extreme risk spillover network: application to financial institutions correlation structure and evolution of world stock markets: evidence from pearson and partial correlation-based networks interconnectedness and systemic risk of china's financial identifying influential energy stocks based on spillover network social network analysis: methods and application social structure from multiple networks i: block models of roles and positions connectedness and risk spillover in china's stock market: a sectoral analysis study on the contagion among american industries spatial spillover effects and risk contagion around g20 stock market based on volatility network spatial connectedness of volatility spillovers in g20 stock markets: based on block models analysis ☒ the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐the authors declare the following financial interests/personal relationships which may be considered as potential competing interests:no! key: cord-317435-4yuw7jo3 authors: zhou, yadi; hou, yuan; shen, jiayu; huang, yin; martin, william; cheng, feixiong title: network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2 date: 2020-03-16 journal: cell discov doi: 10.1038/s41421-020-0153-3 sha: doc_id: 317435 cord_uid: 4yuw7jo3 human coronaviruses (hcovs), including severe acute respiratory syndrome coronavirus (sars-cov) and 2019 novel coronavirus (2019-ncov, also known as sars-cov-2), lead global epidemics with high morbidity and mortality. however, there are currently no effective drugs targeting 2019-ncov/sars-cov-2. drug repurposing, representing as an effective drug discovery strategy from existing drugs, could shorten the time and reduce the cost compared to de novo drug discovery. in this study, we present an integrative, antiviral drug repurposing methodology implementing a systems pharmacology-based network medicine platform, quantifying the interplay between the hcov–host interactome and drug targets in the human protein–protein interaction network. phylogenetic analyses of 15 hcov whole genomes reveal that 2019-ncov/sars-cov-2 shares the highest nucleotide sequence identity with sars-cov (79.7%). specifically, the envelope and nucleocapsid proteins of 2019-ncov/sars-cov-2 are two evolutionarily conserved regions, having the sequence identities of 96% and 89.6%, respectively, compared to sars-cov. using network proximity analyses of drug targets and hcov–host interactions in the human interactome, we prioritize 16 potential anti-hcov repurposable drugs (e.g., melatonin, mercaptopurine, and sirolimus) that are further validated by enrichment analyses of drug-gene signatures and hcov-induced transcriptomics data in human cell lines. we further identify three potential drug combinations (e.g., sirolimus plus dactinomycin, mercaptopurine plus melatonin, and toremifene plus emodin) captured by the “complementary exposure” pattern: the targets of the drugs both hit the hcov–host subnetwork, but target separate neighborhoods in the human interactome network. in summary, this study offers powerful network-based methodologies for rapid identification of candidate repurposable drugs and potential drug combinations targeting 2019-ncov/sars-cov-2. coronaviruses (covs) typically affect the respiratory tract of mammals, including humans, and lead to mild to severe respiratory tract infections 1 . in the past two decades, two highly pathogenic human covs (hcovs), including severe acute respiratory syndrome coronavirus (sars-cov) and middle east respiratory syndrome coronavirus (mers-cov), emerging from animal reservoirs, have led to global epidemics with high morbidity and mortality 2 . for example, 8098 individuals were infected and 774 died in the sars-cov pandemic, which cost the global economy with an estimated $30 to $100 billion 3, 4 . according to the world health organization (who), as of november 2019, mers-cov has had a total of 2494 diagnosed cases causing 858 deaths, the majority in saudi arabia 2 . in december 2019, the third pathogenic hcov, named 2019 novel coronavirus (2019-ncov/sars-cov-2), as the cause of coronavirus disease 2019 (abbreviated as covid19) 5 , was found in wuhan, china. as of 24 february 2020, there have been over 79,000 cases with over 2600 deaths for the 2019-ncov/sars-cov-2 outbreak worldwide; furthermore, human-to-human transmission has occurred among close contacts 6 . however, there are currently no effective medications against 2019-ncov/sars-cov-2. several national and international research groups are working on the development of vaccines to prevent and treat the 2019-ncov/sars-cov-2, but effective vaccines are not available yet. there is an urgent need for the development of effective prevention and treatment strategies for 2019-ncov/sars-cov-2 outbreak. although investment in biomedical and pharmaceutical research and development has increased significantly over the past two decades, the annual number of new treatments approved by the u.s. food and drug administration (fda) has remained relatively constant and limited 7 . a recent study estimated that pharmaceutical companies spent $2.6 billion in 2015, up from $802 million in 2003, in the development of an fda-approved new chemical entity drug 8 . drug repurposing, represented as an effective drug discovery strategy from existing drugs, could significantly shorten the time and reduce the cost compared to de novo drug discovery and randomized clinical trials [9] [10] [11] . however, experimental approaches for drug repurposing is costly and time-consuming 12 . computational approaches offer novel testable hypotheses for systematic drug repositioning [9] [10] [11] 13, 14 . however, traditional structure-based methods are limited when threedimensional (3d) structures of proteins are unavailable, which, unfortunately, is the case for the majority of human and viral targets. in addition, targeting single virus proteins often has high risk of drug resistance by the rapid evolution of virus genomes 1 . viruses (including hcov) require host cellular factors for successful replication during infection 1 . systematic identification of virus-host protein-protein interactions (ppis) offers an effective way toward elucidating the mechanisms of viral infection 15, 16 . subsequently, targeting cellular antiviral targets, such as virus-host interactome, may offer a novel strategy for the development of effective treatments for viral infections 1 , including sars-cov 17 , mers-cov 17 , ebola virus 18 , and zika virus 14, [19] [20] [21] . we recently presented an integrated antiviral drug discovery pipeline that incorporated gene-trap insertional mutagenesis, known functional drug-gene network, and bioinformatics analyses 14 . this methodology allows to identify several candidate repurposable drugs for ebola virus 11, 14 . our work over the last decade has demonstrated how network strategies can, for example, be used to identify effective repurposable drugs 13, [22] [23] [24] [25] [26] [27] and drug combinations 28 for multiple human diseases. for example, network-based drug-disease proximity sheds light on the relationship between drugs (e.g., drug targets) and disease modules (molecular determinants in disease pathobiology modules within the ppis), and can serve as a useful tool for efficient screening of potentially new indications for approved drugs, as well as drug combinations, as demonstrated in our recent studies 13, 23, 27, 28 . in this study, we present an integrative antiviral drug repurposing methodology, which combines a systems pharmacology-based network medicine platform that quantifies the interplay between the virus-host interactome and drug targets in the human ppi network. the basis for these experiments rests on the notions that (i) the proteins that functionally associate with viral infection (including hcov) are localized in the corresponding subnetwork within the comprehensive human ppi network and (ii) proteins that serve as drug targets for a specific disease may also be suitable drug targets for potential antiviral infection owing to common ppis and functional pathways elucidated by the human interactome ( fig. 1) . we follow this analysis with bioinformatics validation of drug-induced gene signatures and hcovinduced transcriptomics in human cell lines to inspect the postulated mechanism-of-action in a specific hcov for which we propose repurposing (fig. 1 ). to date, seven pathogenic hcovs (fig. 2a, b) have been found: 1, 29 (i) 2019-ncov/sars-cov-2, sars-cov, mers-cov, hcov-oc43, and hcov-hku1 are β genera, and (ii) hcov-nl63 and hcov-229e are α genera. we performed the phylogenetic analyses using the wholegenome sequence data from 15 hcovs to inspect the evolutionary relationship of 2019-ncov/sars-cov-2 with other hcovs. we found that the whole genomes of 2019-ncov/sars-cov-2 had~99.99% nucleotide sequence identity across three diagnosed patients (supplementary table s1 ). the 2019-ncov/sars-cov-2 shares the highest nucleotide sequence identity (79.7%) with sars-cov among the six other known pathogenic hcovs, revealing conserved evolutionary relationship between 2019-ncov/sars-cov-2 and sars-cov (fig. 2a) . hcovs have five major protein regions for virus structure assembly and viral replications 29 , including replicase complex (orf1ab), spike (s), envelope (e), membrane (m), and nucleocapsid (n) proteins (fig. 2b) . the orf1ab gene encodes the non-structural proteins (nsp) of viral rna synthesis complex through proteolytic processing 30 . the nsp12 is a viral rna-dependent rna polymerase, together with co-factors nsp7 and nsp8 possessing high polymerase activity. from the protein 3d structure view of sars-cov nsp12, it contains a larger n-terminal extension (which binds to nsp7 and nsp8) and polymerase domain (fig. 2c) . the spike is a transmembrane glycoprotein that plays a pivotal role in mediating viral infection through binding the host receptor 31, 32 . figure 2d shows the 3d structure of the spike protein bound with the host receptor angiotensin converting enznyme2 (ace2) in sars-cov (pdb id: 6ack). a recent study showed that 2019-ncov/sars-cov-2 is able to utilize ace2 as an entry receptor in ace2-expressing cells 33 , suggesting potential drug targets for therapeutic development. furthermore, cryo-em structure of the spike and biophysical assays reveal that the 2019-ncov/sars-cov-2 spike binds ace2 with higher affinity than sars-cov 34 . in addition, the nucleocapsid is also an important subunit for packaging the viral genome through protein oligomerization 35 , and the single nucleocapsid structure is shown in fig. 2e . protein sequence alignment analyses indicated that the 2019-ncov/sars-cov-2 was most evolutionarily conserved with sars-cov (supplementary table s2 ). specifically, the envelope and nucleocapsid proteins of 2019-ncov/sars-cov-2 are two evolutionarily conserved regions, with sequence identities of 96% and 89.6%, respectively, compared to sars-cov (supplementary table s2 ). however, the spike protein exhibited the lowest sequence conservation (sequence identity of 77%) between 2019-ncov/sars-cov-2 and sars-cov. meanwhile, the spike protein of 2019-ncov/sars-cov-2 only has 31.9% sequence identity compared to mers-cov. fig. 1 overall workflow of this study. our network-based methodology combines a systems pharmacology-based network medicine platform that quantifies the interplay between the virus-host interactome and drug targets in the human ppi network. a human coronavirus (hcov)-associated host proteins were collected from literatures and pooled to generate a pan-hcov protein subnetwork. b network proximity between drug targets and hcov-associated proteins was calculated to screen for candidate repurposable drugs for hcovs under the human protein interactome model. c, d gene set enrichment analysis was utilized to validate the network-based prediction. e top candidates were further prioritized for drug combinations using network-based method captured by the "complementary exposure" pattern: the targets of the drugs both hit the hcov-host subnetwork, but target separate neighborhoods in the human interactome network. f overall hypothesis of the network-based methodology: (i) the proteins that functionally associate with hcovs are localized in the corresponding subnetwork within the comprehensive human interactome network; and (ii) proteins that serve as drug targets for a specific disease may also be suitable drug targets for potential antiviral infection owing to common protein-protein interactions elucidated by the human interactome. to depict the hcov-host interactome network, we assembled the cov-associated host proteins from four known hcovs (sars-cov, mers-cov, hcov-229e, and hcov-nl63), one mouse mhv, and one avian ibv (n protein) (supplementary table s3 ). in total, we obtained 119 host proteins associated with covs with various experimental evidence. specifically, these host proteins are either the direct targets of hcov proteins or are involved in crucial pathways of hcov infection. the hcov-host interactome network is shown in fig. 3a . we identified several hub proteins including jun, xpo1, npm1, and hnrnpa1, with the highest number of connections within the 119 proteins. kegg pathway enrichment analysis revealed multiple significant biological pathways (adjusted p value < 0.05), including measles, rna transport, nf-kappa b signaling, epstein-barr virus infection, and influenza (fig. 3b ). gene ontology (go) biological process enrichment analysis further confirmed multiple viral infection-related processes (adjusted p value < 0.001), including viral life cycle, modulation by virus of host morphology or physiology, viral process, positive regulation of viral life cycle, transport of virus, and virion attachment to host cell (fig. 3c ). we then mapped the known drug-target network (see materials and methods) into the hcov-host interactome to search for druggable, cellular targets. we found that 47 human proteins (39%, blue nodes in fig. 3a) can be targeted by at least one approved drug or experimental drug under clinical trials. for example, gsk3b, dpp4, smad3, parp1, and ikbkb are the most targetable proteins. the high druggability of hcov-host interactome motivates us to develop a drug repurposing strategy by specifically targeting cellular proteins associated with hcovs for potential treatment of 2019-ncov/sars-cov-2. the basis for the proposed network-based drug repurposing methodologies rests on the notions that the proteins that associate with and functionally govern viral infection are localized in the corresponding subnetwork ( fig. 1a) within the comprehensive human interactome network. for a drug with multiple targets to be effective against an hcov, its target proteins should be within or in the immediate vicinity of the corresponding subnetwork in the human protein-protein interactome ( fig. 1 ), as we demonstrated in multiple diseases 13, 22, 23, 28 using this network-based strategy. we used a state-of-theart network proximity measure to quantify the relationship between hcov-specific subnetwork (fig. 3a) and drug targets in the human interactome. we constructed a drug-target network by assembling target information for more than 2000 fda-approved or experimental drugs (see materials and methods). to improve the quality and completeness of the human protein interactome network, we integrated ppis with five types of experimental data: (1) binary ppis from 3d protein structures; (2) binary ppis from unbiased high-throughput yeast-two-hybrid assays; (3) experimentally identified kinase-substrate interactions; (4) signaling networks derived from experimental data; and (5) literature-derived ppis with various experimental evidence (see materials and methods). we used a z-score (z) measure and permutation test to reduce the study bias in network proximity analyses (including hub nodes in the human interactome network by literature-derived ppi data bias) as described in our recent studies 13, 28 . in total, we computationally identified 135 drugs that were associated (z < −1.5 and p < 0.05, permutation test) with the hcov-host interactome (fig. 4a , supplementary tables s4 and 5 ). to validate bias of the pooled cellular proteins from six covs, we further calculated the network proximities of all the drugs for four covs with a large number of know host proteins, including sars-cov, mers-cov, ibv, and mhv, separately. we found that the z-scores showed consistency among the pooled 119 hcov-associated proteins and other four individual covs (fig. 4b) . the pearson correlation coefficients of the proximities of all the drugs for the pooled hcov are 0.926 vs. sars-cov (p < 0.001, t distribution), 0.503 vs. mers-cov (p < 0.001), 0.694 vs. ibv (p < 0.001), and 0.829 vs. mhv (p < 0.001). these network proximity analyses offer putative repurposable candidates for potential prevention and treatment of hcovs. to further validate the 135 repurposable drugs against hcovs, we first performed gene set enrichment analysis (gsea) using transcriptome data of mers-cov and sars-cov infected host cells (see methods). these transcriptome data were used as gene signatures for hcovs. additionally, we downloaded the gene expression data of drug-treated human cell lines from the connectivity map (cmap) database 36 to obtain drug-gene signatures. we calculated a gsea score (see methods) for each drug and used this score as an indication of bioinformatics validation of the 135 drugs. specifically, an enrichment score (es) was calculated for each hcov data set, and es > 0 and p < 0.05 (permutation test) was used as cut-off for a significant association of gene signatures between a drug and a specific hcov data set. the gsea score, ranging from 0 to 3, is the number of data sets that met these criteria for a specific drug. mesalazine (an fig. 4 a discovered drug-hcov network. a a subnetwork highlighting network-predicted drug-hcov associations connecting 135 drugs and hcovs. from the 2938 drugs evaluated, 135 ones achieved significant proximities between drug targets and the hcov-associated proteins in the human interactome network. drugs are colored by their first-level of the anatomical therapeutic chemical (atc) classification system code. b a heatmap highlighting network proximity values for sars-cov, mers-cov, ibv, and mhv, respectively. color key denotes network proximity (z-score) between drug targets and the hcov-associated proteins in the human interactome network. p value was computed by permutation test. approved drug for inflammatory bowel disease), sirolimus (an approved immunosuppressive drug), and equilin (an approved agonist of the estrogen receptor for menopausal symptoms) achieved the highest gsea scores of 3, followed by paroxetine and melatonin with gsea scores of 2. we next selected 16 high-confidence repurposable drugs ( fig. 5a and table 1 ) against hcovs using subject matter expertise based on a combination of factors: (i) strength of the network-predicted associations (a smaller network proximity score in supplementary table s4 ); (ii) validation by gsea analyses; (iii) literature-reported antiviral evidence, and (iv) fewer clinically reported side effects. specifically, we showcased several selected repurposable drugs with literature-reported antiviral evidence as below. an overexpression of estrogen receptor has been shown to play a crucial role in inhibiting viral replication 37 . selective estrogen receptor modulators (serms) have been reported to play a broader role in inhibiting viral replication through the non-classical pathways associated with estrogen receptor 37 . serms interfere at the post viral entry step and affect the triggering of fusion, as the serms' antiviral activity still can be observed in the absence of detectable estrogen receptor expression 18 . toremifene (z = -3.23, fig. 5a ), the first generation of nonsteroidal serm, exhibits potential effects in blocking various viral infections, including mers-cov, sars-cov, and ebola virus in established cell lines 17, 38 . compared to the classical esr1-related antiviral pathway, toremifene prevents fusion between the viral and endosomal membrane by interacting with and destabilizing the virus membrane glycoprotein, and eventually inhibiting viral replication 39 . as shown in fig. 5b , toremifene potentially affects several key host proteins associated with hcov, such as rpl19, hnrnpa1, npm1, eif3i, eif3f, and eif3e 40, 41 . equilin (z = -2.52 and gsea score = 3), an estrogenic steroid produced by horses, also has been proven to have moderate activity in inhibiting the entry of zaire ebola virus glycoprotein and human immunodeficiency virus (zebov-gp/hiv) 18 . altogether, network-predicted serms (such as toremifene and equilin) offer candidate repurposable drugs for 2019-ncov/sars-cov-2. angiotensin receptor blockers (arbs) have been reported to associate with viral infection, including hcovs [42] [43] [44] . irbesartan (z = -5.98), a typical arb, was approved by the fda for treatment of hypertension and diabetic nephropathy. here, network proximity analysis shows a significant association between irbesartan's targets and hcov-associated host proteins in the human interactome. as shown in fig. 5c , irbesartan targets slc10a1, encoding the sodium/bile acid cotransporter (ntcp) protein that has been identified as a functional pres1-specific receptor for the hepatitis b virus (hbv) and the hepatitis delta virus (hdv). irbesartan can inhibit ntcp, thus inhibiting viral entry 45, 46 . slc10a1 interacts with c11orf74, a potential transcriptional repressor that interacts with nsp-10 of sars-cov 47 . there are several other arbs (such as eletriptan, frovatriptan, and zolmitriptan) in which their targets are potentially associated with hcov-associated host proteins in the human interactome. previous studies have confirmed the mammalian target of rapamycin complex 1 (mtorc1) as the key factor in regulating various viruses' replications, including andes orthohantavirus and coronavirus 48, 49 . sirolimus (z = -2.35 and gsea score = 3), an inhibitor of mammalian target of rapamycin (mtor), was reported to effectively block viral protein expression and virion release effectively 50 . indeed, the latest study revealed the clinical application: sirolimus reduced mers-cov infection by over 60% 51 . moreover, sirolimus usage in managing patients with severe h1n1 pneumonia and acute respiratory failure can improve those patients' prognosis significantly 50 . mercaptopurine (z = -2.44 and gsea score = 1), an antineoplastic agent with immunosuppressant property, has been used to treat cancer since the 1950s and expanded its application to several autoimmune diseases, including rheumatoid arthritis, systemic lupus erythematosus, and crohn's disease 52 . (see figure on previous page) fig. 5 a discovered drug-protein-hcov network for 16 candidate repurposable drugs. a network-predicted evidence and gene set enrichment analysis (gsea) scores for 16 potential repurposable drugs for hcovs. the overall connectivity of the top drug candidates to the hcovassociated proteins was examined. most of these drugs indirectly target hcov-associated proteins via the human protein-protein interaction networks. all the drug-target-hcov-associated protein connections were examined, and those proteins with at least five connections are shown. the box heights for the proteins indicate the number of connections. gsea scores for eight drugs were not available (na) due to the lack of transcriptome profiles for the drugs. b-e inferred mechanism-of-action networks for four selected drugs: b toremifene (first-generation nonsteroidalselective estrogen receptor modulator), c irbesartan (an angiotensin receptor blocker), d mercaptopurine (an antimetabolite antineoplastic agent with immunosuppressant properties), and e melatonin (a biogenic amine for treating circadian rhythm sleep disorders). 53, 54 . mechanistically, mercaptopurine potentially target several host proteins in hcovs, such as jun, pabpc1, npm1, and ncl 40, 55 (fig. 5d) . inflammatory pathways play essential roles in viral infections 56, 57 . as a biogenic amine, melatonin (n-acetyl-5-methoxytryptamine) (z = -1.72 and gsea score = 2) plays a key role in various biological processes, and offers a potential strategy in the management of viral infections 58, 59 . viral infections are often associated with immune-inflammatory injury, in which the level of oxidative stress increases significantly and leaves negative effects on the function of multiple organs 60 . the antioxidant effect of melatonin makes it a putative candidate drug to relieve patients' clinical symptoms in antiviral treatment, even though melatonin cannot eradicate or even curb the viral replication or transcription 61, 62 . in addition, the application of melatonin may prolong patients' survival time, which may provide a chance for patients' immune systems to recover and eventually eradicate the virus. as shown in fig. 5e , melatonin indirectly targets several hcov cellular targets, including ace2, bcl2l1, jun, and ikbkb. eplerenone (z = -1.59), an aldosterone receptor antagonist, is reported to have a similar anti-inflammatory effect as melatonin. by inhibiting mast-cell-derived proteinases and suppressing fibrosis, eplerenone can improve survival of mice infected with encephalomyocarditis virus 63 . in summary, our network proximity analyses offer multiple candidate repurposable drugs that target diverse cellular pathways for potential prevention and treatment of 2019-ncov/sars-cov-2. however, further preclinical experiments 64 and clinical trials are required to verify the clinical benefits of these network-predicted candidates before clinical use. drug combinations, offering increased therapeutic efficacy and reduced toxicity, play an important role in treating various viral infections 65 . however, our ability to identify and validate effective combinations is limited by a combinatorial explosion, driven by both the large number of drug pairs and dosage combinations. in our recent study, we proposed a novel network-based methodology to identify clinically efficacious drug combinations 28 . relying on approved drug combinations for hypertension and cancer, we found that a drug combination was therapeutically effective only if it was captured by the "complementary exposure" pattern: the targets of the drugs both hit the disease module, but target separate neighborhoods (fig. 6a) . here we sought to identify drug combinations that may provide a synergistic effect in potentially treating 2019-ncov/sars-cov-2 with welldefined mechanism-of-action by network analysis. for the 16 potential repurposable drugs (fig. 5a, table 1 ), we showcased three network-predicted candidate drug combinations for 2019-ncov/sars-cov-2. all predicted possible combinations can be found in supplementary table s6 . sirolimus, an inhibitor of mtor with both antifungal and antineoplastic properties, has demonstrated to improve outcomes in patients with severe h1n1 pneumonia and acute respiratory failure 50 . the mtor signaling plays an essential role for mers-cov infection 66 . dactinomycin, also known actinomycin d, is an approved rna synthesis inhibitor for treatment of various cancer types. an early study showed that dactinomycin (1 μg/ml) inhibited the growth of feline enteric cov 67 . as shown in fig. 6b , our network analysis shows that sirolimus and dactinomycin synergistically target hcov-associated host protein subnetwork by "complementary exposure" pattern, offering potential combination regimens for treatment of hcov. specifically, sirolimus and dactinomycin may inhibit both mtor signaling and rna synthesis pathway (including dna topoisomerase 2-alpha (top2a) and dna topoisomerase 2-beta (top2b)) in hcov-infected cells (fig. 6b) . toremifene is among the approved first-generation nonsteroidal serms for the treatment of metastatic breast cancer 68 . serms (including toremifene) inhibited ebola virus infection 18 by interacting with and destabilizing the ebola virus glycoprotein 39 . in vitro assays have demonstrated that toremifene inhibited growth of mers-cov 17,69 and sara-cov 38 (table 1) . emodin, an anthraquinone derivative extracted from the roots of rheum tanguticum, has been reported to have various anti-virus effects. specifically, emdoin inhibited sars-cov-associated 3a protein 70 , and blocked an interaction between the sars-cov spike protein and ace2 (ref. 71 ). altogether, network analyses and published experimental data suggested that combining toremifene and emdoin offered a potential therapeutic approach for 2019-ncov/ sars-cov-2 (fig. 6c) . as shown in fig. 5a , targets of both mercaptopurine and melatonin showed strong network proximity with hcovassociated host proteins in the human interactome network. recent in vitro and in vivo studies identified mercaptopurine as a selective inhibitor of both sars-cov and mers-cov by targeting papain-like protease 53, 54 . melatonin was reported in potential antiviral infection via its anti-inflammatory and antioxidant effects [58] [59] [60] [61] [62] . melatonin indirectly regulates ace2 expression, a key entry receptor involved in viral infection of hcovs, including 2019-ncov/sars-cov-2 (ref. 33 ). specifically, melatonin was reported to inhibit calmodulin and calmodulin interacts with ace2 by inhibiting shedding of its ectodomain, a key infectious process of sars-cov 72, 73 . jun, also known as c-jun, is a key host protein involving in hcov infectious bronchitis virus 74 . as shown in fig. 6d , mercaptopurine and melatonin may synergistically block c-jun signaling by targeting multiple cellular targets. in summary, combination of mercaptopurine and melatonin may offer a potential combination therapy for 2019-ncov/sars-cov-2 by synergistically targeting papainlike protease, ace2, c-jun signaling, and antiinflammatory pathways (fig. 6d) . however, further experimental observations on ace2 pathways by melatonin in 2019-ncov/sars-cov-2 are highly warranted. in this study, we presented a network-based methodology for systematic identification of putative repurposable drugs and drug combinations for potential treatment of 2019-ncov/sars-cov-2. integration of drug-target networks, hcov-host interactions, hcovinduced transcriptome in human cell lines, and human protein-protein interactome network are essential for such identification. based on comprehensive evaluation, we prioritized 16 candidate repurposable drugs (fig. 5 ) and 3 potential drug combinations (fig. 6) for targeting 2019-ncov/sars-cov-2. however, although the majority of predictions have been validated by various literature data (table 1) , all network-predicted repurposable drugs and drug combinations must be validated in various 2019-ncov/sars-cov-2 experimental assays 64 and randomized clinical trials before being used in patients. we acknowledge several limitations in the current study. although 2019-ncov/sars-cov-2 shared high nucleotide sequence identity with other hcovs (fig. 2) , our predictions are not 2019-ncov/sars-cov-2 specific by lack of the known host proteins on 2019-ncov/sars-cov-2. we used a low binding affinity value of 10 μm as a threshold to define a physical drug-target interaction. however, a stronger binding affinity threshold (e.g., 1 μm) may be a more suitable cut-off in drug discovery, although it will generate a smaller drug-target network. although sizeable efforts were made for assembling large scale, experimentally reported drug-target networks from publicly available databases, the network data may be incomplete and some drug-target interactions may be functional associations, instead of physical bindings. for example, silvestrol, a natural product from the flavagline, was found to have antiviral activity against ebola 75 and coronaviruses 76 . after adding its target, an rna helicase enzyme eif4a 76 , silvestrol was predicted to be significantly associated with hcovs (z = -1.24, p = 0.041) by network proximity analysis. to increase coverage of drug-target networks, we may use computational approaches to systematically predict the drug-target interactions further 25, 26 . in addition, the collected virus-host interactions are far from completeness and the quality can be influenced by multiple factors, including different experimental assays and human cell line models. we may computationally predict a new virus-host interactome for 2019-ncov/sars-cov-2 using sequence-based and structure-based approaches 77 . drug targets representing nodes within cellular networks are often intrinsically coupled with both therapeutic and adverse profiles 78 , as drugs can inhibit or activate protein functions (including antagonists vs. agonists). the current systems pharmacology model cannot separate therapeutic (antiviral) effects from those predictions due to lack of detailed pharmacological effects of drug targets and unknown functional consequences of virus-host interactions. comprehensive identification of the virus-host interactome for 2019-ncov/sars-cov-2, with specific biological effects using functional genomics assays 79, 80 , will significantly improve the accuracy of the proposed network-based methodologies further. owing to a lack of the complete drug-target information (such as the molecular "promiscuity" of drugs), the dose-response and dose-toxicity effects for both (see figure on previous page) fig. 6 network-based rational design of drug combinations for 2019-ncov/sars-cov-2. a the possible exposure mode of the hcovassociated protein module to the pairwise drug combinations. an effective drug combination will be captured by the "complementary exposure" pattern: the targets of the drugs both hit the hcov-host subnetwork, but target separate neighborhoods in the human interactome network. z ca and z cb denote the network proximity (z-score) between targets (drugs a and b) and a specific hcov. s ab denotes separation score (see materials and methods) of targets between drug a and drug b. b-d inferred mechanism-of-action networks for three selected pairwise drug combinations: b sirolimus (a potent immunosuppressant with both antifungal and antineoplastic properties) plus dactinomycin (an rna synthesis inhibitor for treatment of various tumors), c toremifene (first-generation nonsteroidal-selective estrogen receptor modulator) plus emodin (an experimental drug for the treatment of polycystic kidney), and d melatonin (a biogenic amine for treating circadian rhythm sleep disorders) plus mercaptopurine (an antimetabolite antineoplastic agent with immunosuppressant properties). repurposable drugs and drug combinations cannot be identified in the current network models. for example, mesalazine, an approved drug for inflammatory bowel disease, is a top network-predicted repurposable drug associated with hcovs (fig. 5a ). yet, several clinical studies showed the potential pulmonary toxicities (including pneumonia) associated with mesalazine usage 81, 82 . integration of lung-specific gene expression 23 of 2019-ncov/sars-cov-2 host proteins and physiologically based pharmacokinetic modeling 83 may reduce side effects of repurposable drugs or drug combinations. preclinical studies are warranted to evaluate in vivo efficiency and side effects before clinical trials. furthermore, we only limited to predict pairwise drug combinations based on our previous network-based framework 28 . however, we expect that our methodology remain to be a useful network-based tool for prediction of combining multiple drugs toward exploring network relationships of multiple drugs' targets with the hcov-host subnetwork in the human interactome. finally, we aimed to systematically identify repurposable drugs by specifically targeting ncov host proteins only. thus, our current network models cannot predict repurposable drugs from the existing anti-virus drugs that target virus proteins only. thus, combination of the existing anti-virus drugs (such as remdesivir 64 ) with the network-predicted repurposable drugs (fig. 5 ) or drug combinations (fig. 6 ) may improve coverage of current network-based methodologies by utilizing multi-layer network framework 16 . in conclusion, this study offers a powerful, integrative network-based systems pharmacology methodology for rapid identification of repurposable drugs and drug combinations for the potential treatment of 2019-ncov/ sars-cov-2. our approach can minimize the translational gap between preclinical testing results and clinical outcomes, which is a significant problem in the rapid development of efficient treatment strategies for the emerging 2019-ncov/sars-cov-2 outbreak. from a translational perspective, if broadly applied, the network tools developed here could help develop effective treatment strategies for other emerging viral infections and other human complex diseases as well. in total, we collected dna sequences and protein sequences for 15 hcovs, including three most recent 2019-ncov/sars-cov-2 genomes, from the ncbi genbank database (28 january 2020, supplementary table s1 ). whole-genome alignment and protein sequence identity calculation were performed by multiple sequence alignment in embl-ebi database (https:// www.ebi.ac.uk/) with default parameters. the neighbor joining (nj) tree was computed from the pairwise phylogenetic distance matrix using mega x 84 with 1000 bootstrap replicates. the protein alignment and phylogenetic tree of hcovs were constructed by mega x 84 . we collected hcov-host protein interactions from various literatures based on our sizeable efforts. the hcov-associated host proteins of several hcovs, including sars-cov, mers-cov, ibv, mhv, hcov-229e, and hcov-nl63 were pooled. these proteins were either the direct targets of hcov proteins or were involved in critical pathways of hcov infection identified by multiple experimental sources, including highthroughput yeast-two-hybrid (y2h) systems, viral protein pull-down assay, in vitro co-immunoprecipitation and rna knock down experiment. in total, the virus-host interaction network included 6 hcovs with 119 host proteins (supplementary table s3 ). next, we performed kyoto encyclopedia of genes and genomes (kegg) and gene ontology (go) enrichment analyses to evaluate the biological relevance and functional pathways of the hcov-associated proteins. all functional analyses were performed using enrichr 85 . here, we collected drug-target interaction information from the drugbank database (v4.3) 86 , therapeutic target database (ttd) 87 , pharmgkb database, chembl (v20) 88 , bindingdb 89 , and iuphar/bps guide to pharmacology 90 . the chemical structure of each drug with smiles format was extracted from drug-bank 86 . here, drug-target interactions meeting the following three criteria were used: (i) binding affinities, including k i , k d , ic 50 , or ec 50 each ≤10 μm; (ii) the target was marked as "reviewed" in the uniprot database 91 ; and (iii) the human target was represented by a unique uni-prot accession number. the details for building the experimentally validated drug-target network are provided in our recent studies 13, 23, 28 . to build a comprehensive list of human ppis, we assembled data from a total of 18 bioinformatics and systems biology databases with five types of experimental evidence: (i) binary ppis tested by high-throughput yeasttwo-hybrid (y2h) systems; (ii) binary, physical ppis from protein 3d structures; (iii) kinase-substrate interactions by literature-derived low-throughput or high-throughput experiments; (iv) signaling network by literature-derived low-throughput experiments; and (v) literature-curated ppis identified by affinity purification followed by mass spectrometry (ap-ms), y2h, or by literature-derived low-throughput experiments. all inferred data, including evolutionary analysis, gene expression data, and metabolic associations, were excluded. the genes were mapped to their entrez id based on the ncbi database 92 as well as their official gene symbols based on genecards (https:// www.genecards.org/). in total, the resulting human protein-protein interactome used in this study includes 351,444 unique ppis (edges or links) connecting 17,706 proteins (nodes), representing a 50% increase in the number of the ppis we have used previously. detailed descriptions for building the human protein-protein interactome are provided in our previous studies 13, 23, 28, 93 . we posit that the human ppis provide an unbiased, rational roadmap for repurposing drugs for potential treatment of hcovs in which they were not originally approved. given c, the set of host genes associated with a specific hcov, and t, the set of drug targets, we computed the network proximity of c with the target set t of each drug using the "closest" method: where d(c, t) is the shortest distance between gene c and t in the human protein interactome. the network proximity was converted to z-score based on permutation tests: where d r and σ r were the mean and standard deviation of the permutation test repeated 1000 times, each time with two randomly selected gene lists with similar degree distributions to those of c and t. the corresponding p value was calculated based on the permutation test results. z-score < −1.5 and p < 0.05 were considered significantly proximal drug-hcov associations. all networks were visualized using gephi 0.9.2 (https://gephi.org/). for this network-based approach for drug combinations to be effective, we need to establish if the topological relationship between two drug-target modules reflects biological and pharmacological relationships, while also quantifying their network-based relationship between drug targets and hcov-associated host proteins (drug-drug-hcov combinations). to identify potential drug combinations, we combined the top lists of drugs. then, "separation" measure s ab was calculated for each pair of drugs a and b using the following method: where d á h i was calculated based on the "closest" method. our key methodology is that a drug combination is therapeutically effective only if it follows a specific relationship to the disease module, as captured by complementary exposure patterns in targets' modules of both drugs without overlapping toxic mechanisms 28 . we performed the gene set enrichment analysis as an additional prioritization method. we first collected three differential gene expression data sets of hosts infected by hcovs from the ncbi gene expression omnibus (geo). among them, two transcriptome data sets were sars-cov-infected samples from patient's peripheral blood 94 (gse1739) and calu-3 cells 95 (gse33267), respectively. one transcriptome data set was mers-cov-infected calu-3 cells 96 (gse122876). adjusted p value less than 0.01 was defined as differentially expressed genes. these data sets were used as hcov-host signatures to evaluate the treatment effects of drugs. differential gene expression in cells treated with various drugs were retrieved from the connectivity map (cmap) database 36 , and were used as gene profiles for the drugs. for each drug that was in both the cmap data set and our drug-target network, we calculated an enrichment score (es) for each hcov signature data set based on previously described methods 97 where j = 1, 2, …, s were the genes of hcov signature data set sorted in ascending order by their rank in the gene profiles of the drug being evaluated. the rank of gene j is denoted by v(j), where 1 ≤ v(j) ≤ r, with r being the number of genes (12,849) from the drug profile. then, es up/down was set to a up/down if a up/down > b up/down , and was set to −b up/down if b up/down > a up/down . permutation tests repeated 100 times using randomly generated gene lists with the same number of up-and down-regulated genes as the hcov signature data set were performed to measure the significance of the es scores. drugs were considered to have potential treatment effect if es > 0 and p < 0.05, and the number of such hcov signature data sets were used as the final gsea score that ranges from 0 to 3. coronaviruses-drug discovery and therapeutic options coronavirus infections-more than just the common cold sars and mers: recent insights into emerging coronaviruses host factors in coronavirus replication epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study early transmission dynamics in wuhan, china, of novel coronavirusinfected pneumonia putting the patient back together-social medicine, network medicine, and the limits of reductionism the $2.6 billion pill-methodologic and policy considerations silico oncology drug repositioning and polypharmacology individualized network-based drug repositioning infrastructure for precision oncology in the panomics era drug repurposing: new treatments for zika virus infection? a comprehensive map of molecular drug targets network-based approach to prediction and population-based validation of in silico drug repurposing systems biology-based investigation of cellular antiviral drug targets identified by gene-trap insertional mutagenesis understanding human-virus protein-protein interactions using a human protein complex-based analysis framework. msystems computational network biology: data, models, and applications repurposing of clinically developed drugs for treatment of middle east respiratory syndrome coronavirus infection fda-approved selective estrogen receptor modulators inhibit ebola virus infection repurposing of the antihistamine chlorcyclizine and related compounds for treatment of hepatitis c virus infection a screen of fda-approved drugs for inhibitors of zika virus infection identification of small-molecule inhibitors of zika virus infection and induced neural cell death via a drug repurposing screen prediction of drug-target interactions and drug repositioning via network-based inference a genome-wide positioning systems network algorithm for in silico drug repurposing deepdr: a network-based deep learning approach to in silico drug repositioning target identification among known drugs by deep learning from heterogeneous networks network-based prediction of drug-target interactions using an arbitrary-order proximity embedded deep forest network-based translation of gwas findings to pathobiology and drug repurposing for alzheimer's disease network-based prediction of drug combinations molecular evolution of human coronavirus genomes structure of the sars-cov nsp12 polymerase bound to nsp7 and nsp8 co-factors structure of sars coronavirus spike receptor-binding domain complexed with receptor genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding a pneumonia outbreak associated with a new coronavirus of probable bat origin cryo-em structure of the 2019-ncov spike in the prefusion conformation transient oligomerization of the sars-cov n protein-implication for virus ribonucleoprotein packaging the connectivity map: using gene-expression signatures to connect small molecules, genes, and disease a structure-informed atlas of human-virus interactions screening of an fda-approved compound library identifies four small-molecule inhibitors of middle east respiratory syndrome coronavirus replication in cell culture toremifene interacts with and destabilizes the ebola virus glycoprotein the cellular interactome of the coronavirus infectious bronchitis virus nucleocapsid protein and functional implications for virus biology determination of host proteins composing the microenvironment of coronavirus replicase complexes by proximity-labeling the central role of angiotensin i-converting enzyme in vertebrate pathophysiology effect of the angiotensin ii receptor blocker olmesartan on the development of murine acute myocarditis caused by coxsackievirus b3 the impact of statin and angiotensin-converting enzyme inhibitor/angiotensin receptor blocker therapy on cognitive function in adults with human immunodeficiency virus infection irbesartan, an fda approved drug for hypertension and diabetic nephropathy, is a potent inhibitor for hepatitis b virus entry by disturbing na(+)-dependent taurocholate cotransporting polypeptide activity the fda-approved drug irbesartan inhibits hbv-infection in hepg2 cells stably expressing sodium taurocholate co-transporting polypeptide identification of a novel transcriptional repressor (hepis) that interacts with nsp-10 of sars coronavirus host mtorc1 signaling regulates andes virus replication host cell mtorc1 is required for hcv rna replication adjuvant treatment with a mammalian target of rapamycin inhibitor, sirolimus, and steroids improves outcomes in patients with severe h1n1 pneumonia and acute respiratory failure middle east respiratory syndrome and severe acute respiratory syndrome: current therapeutic options and potential targets for novel therapies thiopurines in current medical practice: molecular mechanisms and contributions to therapy-related cancer thiopurine analogue inhibitors of severe acute respiratory syndrome-coronavirus papain-like protease, a deubiquitinating and deisgylating enzyme thiopurine analogs and mycophenolic acid synergistically inhibit the papain-like protease of middle east respiratory syndrome coronavirus interaction of the coronavirus nucleoprotein with nucleolar antigens and the host cell bird flu"), inflammation and anti-inflammatory/ analgesic drugs the development of anti-inflammatory drugs for infectious diseases melatonin: its possible role in the management of viral infections-a brief review melatonin in bacterial and viral infections with focus on sepsis: a review ebola virus disease: potential use of melatonin as a treatment one molecule, many derivatives: a never-ending interaction of melatonin with reactive oxygen and nitrogen species? on the free radical scavenging activities of melatonin's metabolites, afmk and amk anti-inflammatory effects of eplerenone on viral myocarditis remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus (2019-ncov) in vitro systematic identification of synergistic drug pairs targeting hiv antiviral potential of erk/mapk and pi3k/akt/mtor signaling modulation for middle east respiratory syndrome coronavirus infection as identified by temporal kinome analysis differential in vitro inhibition of feline enteric coronavirus and feline infectious peritonitis virus by actinomycin d toremifene is an effective and safe alternative to tamoxifen in adjuvant endocrine therapy for breast cancer: results of four randomized trials mers-cov pathogenesis and antiviral efficacy of licensed drugs in human monocyte-derived antigen-presenting cells emodin inhibits current through sars-associated coronavirus 3a protein emodin blocks the sars coronavirus spike protein and angiotensin-converting enzyme 2 interaction calmodulin interacts with angiotensin-converting enzyme-2 (ace2) and inhibits shedding of its ectodomain modulation of intracellular calcium and calmodulin by melatonin in mcf-7 human breast cancer cells activation of the c-jun nh2-terminal kinase pathway by coronavirus infectious bronchitis virus promotes apoptosis independently of c-jun the natural compound silvestrol is a potent inhibitor of ebola virus replication broad-spectrum antiviral activity of the eif4a inhibitor silvestrol against corona-and picornaviruses review of computational methods for virus-host protein interaction prediction: a case study on novel ebola-human interactions pleiotropic effects of statins: new therapeutic targets in drug design integrative functional genomics of hepatitis c virus infection identifies host dependencies in complete viral replication cycle crispr-cas9 genetic analysis of virushost interactions acute eosinophilic pneumonia related to a mesalazine suppository mesalamine induced eosinophilic pneumonia translational high-dimensional drug interaction discovery and validation using health record databases and pharmacokinetics models mega x: molecular evolutionary genetics analysis across computing platforms enrichr: a comprehensive gene set enrichment analysis web server 2016 update drugbank 4.0: shedding new light on drug metabolism therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information chembl: a large-scale bioactivity database for drug discovery bindingdb: a webaccessible database of experimentally determined protein-ligand binding affinities the iuphar/bps guide to pharmacology: an expertdriven knowledgebase of drug targets and their ligands uniprot: the universal protein knowledgebase database resources of the national center for biotechnology information conformational dynamics and allosteric regulation landscapes of germline pten mutations associated with autism compared to those associated with cancer expression profile of immune response genes in patients with severe acute respiratory syndrome cell host response to infection with novel human coronavirus emc predicts potential antivirals and important differences with sars coronavirus srebp-dependent lipidomic reprogramming as a broadspectrum antiviral target discovery and preclinical validation of drug indications using compendia of public gene expression data this work was supported by the national heart, lung, and blood institute of the national institutes of health (nih) under award number k99 hl138272 and r00 hl138272 to f.c. the content of this publication does not necessarily reflect the views of the cleveland clinic. key: cord-276178-0hrs1w7r authors: bangotra, deep kumar; singh, yashwant; selwal, arvind; kumar, nagesh; singh, pradeep kumar; hong, wei-chiang title: an intelligent opportunistic routing algorithm for wireless sensor networks and its application towards e-healthcare date: 2020-07-13 journal: sensors (basel) doi: 10.3390/s20143887 sha: doc_id: 276178 cord_uid: 0hrs1w7r the lifetime of a node in wireless sensor networks (wsn) is directly responsible for the longevity of the wireless network. the routing of packets is the most energy-consuming activity for a sensor node. thus, finding an energy-efficient routing strategy for transmission of packets becomes of utmost importance. the opportunistic routing (or) protocol is one of the new routing protocol that promises reliability and energy efficiency during transmission of packets in wireless sensor networks (wsn). in this paper, we propose an intelligent opportunistic routing protocol (iop) using a machine learning technique, to select a relay node from the list of potential forwarder nodes to achieve energy efficiency and reliability in the network. the proposed approach might have applications including e-healthcare services. as the proposed method might achieve reliability in the network because it can connect several healthcare network devices in a better way and good healthcare services might be offered. in addition to this, the proposed method saves energy, therefore, it helps the remote patient to connect with healthcare services for a longer duration with the integration of iot services. . sensor node architecture with application in e-healthcare. with the ever-increasing use of term green computing, the energy efficiency of wsn has seen a considerable rise. recently, an approach for green computing towards iot for energy efficiency has been proposed, which enhances the energy efficiency of wsn [4] . different types of methods and techniques were proposed and developed in the past to address the issue of energy optimization in wsn. another approach that regulates the challenge of energy optimization in sensor-enabled iot with the use of quantum-based green computing, makes routing efficient and reliable [5] . the problem of energy efficiency during the routing of data packets from source to target in case of iotoriented wsn is significantly addressed by another network-based routing protocol known as greedi [6] . it is imperative to mention here that iot is composed of energy-hungry sensor devices. the constraint of energy in sensor nodes has affected the transmission of data from one node to another and therefore, requires boundless methods, policies, and strategies to overcome this challenge [7] . with the ever-increasing use of term green computing, the energy efficiency of wsn has seen a considerable rise. recently, an approach for green computing towards iot for energy efficiency has been proposed, which enhances the energy efficiency of wsn [4] . different types of methods and techniques were proposed and developed in the past to address the issue of energy optimization in wsn. another approach that regulates the challenge of energy optimization in sensor-enabled iot with the use of quantum-based green computing, makes routing efficient and reliable [5] . the problem of energy efficiency during the routing of data packets from source to target in case of iot-oriented wsn is significantly addressed by another network-based routing protocol known as greedi [6] . it is imperative to mention here that iot is composed of energy-hungry sensor devices. the constraint of energy in sensor nodes has affected the transmission of data from one node to another and therefore, requires boundless methods, policies, and strategies to overcome this challenge [7] . the focus of this paper was to put forward an intelligent opportunistic routing protocol so that the consumption of resources particularly during communication could be optimized, because the sensors 2020, 20, 3887 3 of 21 alleyway taken to transmit a data packet from a source node to the target node is determined by the routing protocol. routing is a complex task in wsn because it is different from designing a routing protocol in traditional networks. in wsn, the important concern is to create an energy-efficient routing strategy to route packet from source to destination, because the nodes in the wsn are always energy-constrained. the problem of energy consumption while routing is managed with the use of a special type of routing protocol known as the opportunistic routing protocol. the opportunistic routing (or) is also known as any path routing that has gained huge importance in the recent years of research in wsn [8] . this protocol exploits the basic feature of wireless networks, i.e., broadcast transmission of data. the earlier routing strategies consider this property of broadcasting as a disadvantage, as it induces interference. the focal notion behind or is to take the benefit of spreading the behavior of the wireless networks such that broadcast from one node can be listened by numerous nodes. rather than selecting the next forwarder node in advance, the or chooses the next forwarder node robustly at the time of data transmission. it was shown that or gives better performance results than traditional routing. in or, the best opportunities are searched to transmit the data packets from source to destination [9] . the hop-by-hop communication pattern is used in the or even when there is no source-to-destination linked route. the or protocols proposed in recent times by different researchers are still belligerent with concerns pertaining to energy efficiency and the reliable delivery of data packets. the proposed or routing protocol given in this paper was specifically meant for wsn, by taking into account the problems that surface during the selection of relay candidates and execution of coordination protocol. the proposed protocol intelligently selects the relay candidates from the forwarder list by using a machine learning technique to achieve energy efficiency. the potential relay node selection is a multi-class with multiple feature-based probabilistic problems, where the inherent selection of relay node is dependent upon each node's characteristics. the selection of a node with various characteristics for a node is a supervised multiclass non-linearly separable problem. in this paper, the relay node selection algorithm is given using naïve baye's machine learning model. the organization of this paper is as follows. section 2 presents the related work in the literature regarding or and protocols. the various types of routing protocols are given in section 3. section 4 describes or with examples, followed by the proposed intelligent or algorithm for forwarder node selection in section 5. section 6 depicts the simulation results of the proposed protocol by showing latency, network lifetime, throughput, and energy efficiency. section 7 presents a proposed framework for integration iot with wsn for e-healthcare. this architecture can be useful in many e-healthcare applications. section 8 presents the conclusion and future. achieving reliable delivery of data and energy efficiency are two crucial tasks in wsns. as the sensor nodes are mostly deployed in an unattended environment and the likelihood of any node going out of order is high, the maintenance and management of topology is a rigorous task. therefore, the routing protocol should accommodate the dynamic nature of the wsns. opportunistic routing protocols developed in the recent past years provided trustworthy data delivery but they are still deficient in providing energy-efficient data transmission between the sensor nodes. some latest research on or, experimented by using the formerly suggested routing metrics and they concentrated on mutual cooperation among nodes. geraf [10] (geographic random forwarding) described a novel forwarding technique based on the geographical location of the nodes involved and random selection of the relaying node via contention among receivers. exclusive opportunistic multi-hop routing for wireless networks [11] (exor) is an integrated routing and mac protocol for multi-hop wireless networks, in which the best of multiple receivers forwards each packet. this protocol is based on the expected transmission count (etx) metric. the etx was measured by hop count from the source to the destination and the data packet traveled through the minimum number of hops. exor achieves higher throughput than traditional sensors 2020, 20, 3887 4 of 21 routing algorithms but it still has few limitations. exor contemplates the information accessible at the period of transmission only, and any unfitting information because of recent updates could worsen its performance and could lead to packet duplication. other than this, there is another limitation with exor, as it always seeks coordination among nodes that causes overhead, in case of large networks. minimum transmission scheme-optimal forwarder list selection in opportunistic routing [12] (mts) is another routing protocol that uses mts instead of etx as in exor. the mts-based algorithm gives fewer transmissions as compared to etx-based exor. simple, practical, and effective opportunistic routing for short-haul multi-hop wireless networks [13] . in this protocol, the packet duplication rate was decreased. this is a simple algorithm and can be combined with other opportunistic routing algorithms. spectrum aware opportunistic routing [14] (saor) is another routing protocol for the cognitive radio network. it uses optimal link transmission (olt) as a cost metric for positioning the nodes in the forwarder list. saor gives better qos, reduced end-to-end delay, and improved throughput. energy-efficient opportunistic routing [15] (eeor) calculates the cost for each node to transfer the data packets. the eeor takes less time than exor for sending and receiving the data packets. trusted opportunistic routing algorithm for vanet [16] (tmcor) gives a trust mechanism for opportunistic routing algorithm. it also defines the trade-off between the cost metric and the safety factor. a novel socially aware opportunistic routing algorithm in mobile social networks [17] considered three parameters, namely social profile matching, social connectivity matching, and social interaction. this gives a high probability of packet delivery and routing efficiency. ensor-opportunistic routing algorithm for relay node selection in wsns is another algorithm where the concept of an energy-efficient node is implemented [18] . the packet delivery rate of ensor is better than geraf. economy-a duplicate free [19] is the only or protocol that uses token-based coordination. this algorithm ensures the absence of duplicate packet transmissions. with the advent of the latest network technologies, the virtualization of networks along with its related resources has made networks more reliable and efficient. the virtual network functions are used to solve the problems related to service function chains in cloud-fog computing [20] . further, iot works with multiple network domains, and the possibility of compromising the security and confidentiality of data is always inevitable. therefore, the use of virtual networks for service function chains in cloud-fog computing under multiple network domains, leads to saving network resources [21] . in recent times, the cloud of things (cot) has gained immense popularity, due to its ability to offer an enormous amount of resources to wireless networks and heterogeneous mobile edge computing systems. the cot makes the opportunistic decision-making during the online processing of tasks for load sharing, and makes the overall network reliable and efficient [22] . the cloud of things framework can significantly improve communication gaps between cloud resources and other mobile devices. in this paper, the author(s) proposed a methodology for offloading computation in mobile devices, which might reduce failure rates. this algorithm reduces failure rates by improving the control policy. in recent times, wsn used virtualization techniques to offer energy-efficient and fault-tolerant data communication to the immensely growing service domain for iot [23] . with the application of wsn in e-healthcare, the wireless body area network (wban) gained a huge response in the healthcare domain. the wban is used to monitor patient data by using body sensors, and transmits the acquired data, based on the severity of the patients' symptoms, by allocating a channel without contention or with contention [24] . eeor [15] is an energy-efficient protocol that works on transmission power as a major parameter. this protocol discussed two cases that involved constant and dynamic power consumption models. these models are known as non-adjustable and adjustable power models. in the first model, the algorithm calculated the expected cost at each node and made a forwarder list on the source node based on this cost. the forwarder list was sorted in increasing order of expected cost and the first node on the list became the next-hop forwarder. as eeor is an opportunistic routing protocol, broadcasting is utilized and the packets transmitted might be received by each node on the forwarder list. in this, the authors propose algorithms for fixed-power calculation, adjustable power calculation, sensors 2020, 20, 3887 5 of 21 and opportunistic power calculation. this algorithm was compared with exor [11] by simulation in the tossim simulator. the results showed that eeor always calculated the end-to-end cost based on links from the source to destination. eeor followed distance vector routing for storing the routing information inside each sensor node. the expected energy consumption cost was updated inside each node, after each round of packet transmission. data delivery was guaranteed in this protocol. additionally, according to the simulation results, packet duplication was significantly decreased. the mdor [25, 26] protocol worked on the distance between the source to relay nodes. in this, the authors proposed an algorithm that calculated the distance to each neighbor from the source node and found out the average distance node. the average distance node was used by the source as a next-hop forwarder. the authors also stated that, to increase the speed and reliability of transmission, the strength of the signal was very important. the signal power depended on the distance between the sender and receiver. if a node sent a packet to the nearest node, then it might take more hops and this would decrease the lifetime of the network. another problem addressed in this protocol was to reduce energy consumption at each node through the dynamic energy consumption model. this model consumed energy according to the packet size and transmitted the packet by amplifying it according to the distance between the source and the relay nodes. mdor always chose the middle position node to optimize energy consumption in amplifying the packets. the mdor simulation results showed that the energy consumption was optimized and it was suitable for certain applications of wsn like environment monitoring, forest fire detection, etc. opportunistic routing introduced the concept of reducing the number of retransmissions to save energy and taking advantage of the broadcasting nature of the wireless networks. with broadcasting, the routing protocol could discover as many paths in the network as possible. the data transmission would take place on any of these paths. if a particular path failed, the transmission could be completed by using some other path, using the forwarder list that had the nodes with the same data packet. the protocols that were responsible for data transmission in wsn were broadly ordered into two sets [2] , namely, (i) old-fashioned routing, and (ii) opportunistic routing. in the traditional routing, also known as old-fashioned routing techniques, the focus was on finding the route with a minimum number of intermediate nodes from the source to the destination, without taking into consideration some of the important factors like throughput, quality of links, reliability, etc. a small comparison [27] of the routing categories is shown in table 1 . as it is clear from the literature that energy consumption of a sensor node had a considerable impact on the lifetime and quality of the wireless sensor network, therefore, it becomes vital to design energy-efficient opportunistic routing protocols to maximize the overall lifetime of the network and also to enhance the quality of the sensor network. there are few methods in the literature listed below that might be useful to save the life of the sensor network. scheduling of duty cycle • energy-efficient medium access control (ee-mac) • energy-efficient routing • node replacements (not possible in unattended environments) of the above-mentioned methods for energy saving, energy-efficient routing is the most central method for the vitality of the wsn. as this method involved the transmission of signals, i.e., receiving and sending, it took about 66.66 percent of the total energy of the network [28] . therefore, it became relevant that an opportunistic routing protocol that enhanced the vitality of the sensor network might be designed for enhancing the overall life span of the sensor network. or broadcasts a data packet to a set of relay candidates that is overheard by the neighboring nodes, whereas in traditional routing a node is (pre)-selected for each transmission. then, relay candidates that are part of the forwarders list and who have successfully acknowledged the data packet, run a protocol called coordination protocol between themselves, to choose the best relay candidate to onward the data packet. in other words, or is abstractly comprised of these three steps: step 1: broadcast a data packet to the relay candidates (this will prepare the forwarder list). step 2: select the best relay by using a coordination protocol among the nodes in the forwarder list. step 3: forward the data packet to the selected relay node. considering an example shown in figure 2 , where the source node s sends a packet to the destination node d, through nodes r1, r2, r3, r4, and r5. first, s broadcasts a packet. the relay nodes r1, r2, and r3 might become the forwarder nodes. further, if r2 is chosen as a potential forwarder, then r4 and r5 might become relay nodes. similarly, if r5 is the forwarder node, then it forwards the data packets to the destination node d. energy balance of the above-mentioned methods for energy saving, energy-efficient routing is the most central method for the vitality of the wsn. as this method involved the transmission of signals, i.e., receiving and sending, it took about 66.66 percent of the total energy of the network [28] . therefore, it became relevant that an opportunistic routing protocol that enhanced the vitality of the sensor network might be designed for enhancing the overall life span of the sensor network. or broadcasts a data packet to a set of relay candidates that is overheard by the neighboring nodes, whereas in traditional routing a node is (pre)-selected for each transmission. then, relay candidates that are part of the forwarders list and who have successfully acknowledged the data packet, run a protocol called coordination protocol between themselves, to choose the best relay candidate to onward the data packet. in other words, or is abstractly comprised of these three steps: step 1: broadcast a data packet to the relay candidates (this will prepare the forwarder list). step 2: select the best relay by using a coordination protocol among the nodes in the forwarder list. step 3: forward the data packet to the selected relay node. considering an example shown in figure 2 , where the source node s sends a packet to the destination node d, through nodes r1, r2, r3, r4, and r5. first, s broadcasts a packet. the relay nodes r1, r2, and r3 might become the forwarder nodes. further, if r2 is chosen as a potential forwarder, then r4 and r5 might become relay nodes. similarly, if r5 is the forwarder node, then it forwards the data packets to the destination node d. opportunistic routing derived the following rewards: • the escalation in reliability. by using this routing strategy, the reliability of wsn increased significantly, as this protocol transmitted the data packet through any possible link rather than any pre-decided link. therefore, this routing protocol provided additional links that could act as back up links and thus reduced the chances of transmission failure. the escalation in transmission range. with this routing protocol, the broadcast nature of the wireless medium provided an upsurge in the transmission range, as all links irrespective of their location and quality of data packets were received. hence, the data transmission could reach the farthest relay node successfully. opportunistic routing derived the following rewards: • the escalation in reliability. by using this routing strategy, the reliability of wsn increased significantly, as this protocol transmitted the data packet through any possible link rather than any pre-decided link. therefore, this routing protocol provided additional links that could act as back up links and thus reduced the chances of transmission failure. the escalation in transmission range. with this routing protocol, the broadcast nature of the wireless medium provided an upsurge in the transmission range, as all links irrespective of their location and quality of data packets were received. hence, the data transmission could reach the farthest relay node successfully. in wsn, the sensor nodes could be deployed in two ways, randomly or manually. most applications require the random deployment of nodes in the area under consideration. initially, each node is loaded with the same amount of battery power. as soon as the network starts functioning, the nodes start consuming energy. to make the network energy efficient, the protocol used for transmitting data packets must consume less battery power and the calculation of the energy consumption network model and energy model should be formulated. in the upcoming subsection, these two models are discussed and these are depicted as assumptions, to carry out smooth working of the protocol. the n sensors are distributed in a square area of size 500 * 500 square meters. this network formed a graph g = (n, m), with the following possessions: . . , n n } is the set of vertices representing sensor nodes. • m is considered to be a set of edges representing the node-to-node links. the neighboring list nbl(n i ) consists of nodes that are in the direct link to the n i . the data traffic is assumed to be traveling from the sensor nodes toward the base station. if a packet delivery is successful, then the acknowledgment (ack) for the same is considered to travel the same path back to the source. the lifespan of a wsn depends on the endurance of each node, while performing network operations. the sensor nodes rely on the battery life to perform network operations. the energy cost model considered here is the first-order energy model for wsn [25] . various terms used in equations (1)-(3) are defined in table 2 . combined vitality cost of radio board of a sensor for communication of a data packets energy consumed in the transmission of n-bit packet up to l distance: energy consumed in the transmission of n-bit packet: sensors 2020, 20, 3887 8 of 21 sensor board-full operation, radio board-full operation, cpu board-sleep, wakeup for creating messages only. the proposed protocol uses these assumptions as preliminaries. a new algorithm is proposed in the next section, for solving the issue of energy efficiency and the reliability of opportunistic routing in wsn. let there be n nodes in the wsn, where each node has k neighbors, i.e., n 1 , n 2 , . . . , n k and each neighbor nodes are represented by x 1 , x 2 , . . . , x n attributes. in this case, the number of neighbors (k) might vary for the different nodes at a particular instance. additionally, it was assumed that the wireless sensor network is spread over an area of 500 × 500 square meters. let us assume that a node a ∈ n and had neighbors as na 1 , na 2 , . . . , na k , with respective features like node id, location, prr (packet reception ratio), residual energy (re) of nodes, and distance (d), which are represented by x 1 , x 2 , . . . , x n , respectively. the goal was to intelligently find a potential relay node a, say ar, such that ar ∈ {na 1 , na 2 , . . . , na k }. in the proposed machine learning-based protocol for the selection of potential forwarder, the packet reception ratio, distance, and outstanding energy of node was taken into consideration. the packet reception ratio (prr) [29] is also sometimes referred to as psr (packet success ratio). the psr was computed as the ratio of the successfully received packets to the sent packets. a similar metric to the prr was the per (packet error ratio), which could be computed as (1-prr). a node loses a particular amount of energy during transmission and reception of packets. accordingly, the residual energy in a node gets decreased [30] . the distance (d) was the distance between the source node and the respective distance of each sensor node in the forwarder list. the potential relay node selection was multi-class, with multiple features-based probabilistic problems, where the inherent selection of the relay node was dependent upon each node feature. the underlying probabilistic-based relay node selection problem could be addressed intelligently by building a machine learning model. the selection of a node with 'n' characteristics for a given node 'a' could be considered a supervised multiclass non-linearly separable problem. in this algorithm, the naïve baye's classifier was used to find the probability of node a to reach one of its neighbors, i.e., {n 1 , n 2 , . . . , n k }. we computed the probability, p(n 1 , n 2 , . . . , n k |a). the node with maximum probability, i.e., p(n 1 , n 2 , . . . , n k |a) was selected. the probability p of selecting an individual relay node of the given node a could be computed individually for each node, as shown respectively for each node in equation (4). where p(na k |a) denotes the probability of node a to node k. furthermore, the probability computation of node a to na 1 is such that na 1 is represented by the corresponding characteristics x 1 , x 2 , . . . , xn, which means to find the probability to select the relay node na 1 , given that feature x 1 , na 1 given that feature x 2 , na 1 given that feature x 3 , and so on. the individual probability of relay node selection, given that the node characteristics might be computed by using naïve bayes conditional probability, is shown in equation (5). sensors 2020, 20, 3887 9 of 21 where i = 1, 2, 3, . . . , n and p(xi|a) is called likelihood, p(a) is called the prior probability of the event, and p(xi) is the prior probability of the consequence. the underlying problem is to find the relay node a that has the maximum probability, as shown in equation (6). table 3a-x represent the neighbor sets {na 1 , na 2 , . . . , na k } along with their feature attributes as {x 1 , x 2 , x 3 , . . . , x n } of node a. the working of iop is comprised of two phases, i.e., phase i (forwarder_set_selection) and phase ii (forwarder_node_selection). in phase i, the authors used algorithm 1 for the forwarder set selection. in this step, the information collection task was initiated after the nodes were randomly deployed in the area of interest, with specific dimensions. the beginning of the phase started with a broadcast of "hello packet" which contained the address and the location of the sending node. if any node received this packet, it sent an acknowledgment to the source and was added to the neighbor list. this process was repeated again and again, but not more than the threshold, to calculate the prr of each node and the neighbor list was formed using this procedure repeatedly. from the neighbor list and the value of prr, the forwarder set was extracted. the pre-requisite for the working of the second phase was the output of the first phase. the forwarder set generated from algorithm 1 was the set of all nodes that had the potential to forward the data packets. however, all nodes in the set could not be picked for transmission, as this would lead to duplication of packets in the network. to tackle this situation, only one node from the forwarder list should be selected to transmit the packet to the next-hop toward the destination. this was accomplished using algorithm 2, which took a forwarder node list as input and selected a single node as a forwarder. algorithm 2 used a machine-learning technique called naïve baye's classifier, to select the forwarder node intelligently. the proposed method of relay node selection using iop could be understood by considering an example of wsn shown in figure 2 and using the naïve baye's algorithm on the generic data available in table 4 , to find the optimal path in terms of energy efficiency and reliability from source node s to destination node d. therefore, by using the proposed naïve baye's classifier method, the probability of selection of a relay node r1, r2, or r3 from source node s was denoted by p(r1, r2, r3|s), which could be calculated using equation (7). where, putting the values in the above equations from table 4 again, inputting the values in the above equations (12) declare three float variables x 1 , x 2 , and x 3 to represent the properties of ri, i.e., prr (packet reception ratio), re (residual energy), and d (distance), respectively. for each node ri∈ fl(s) repeat compute p(ri|s)//probability of selection of ri given s, i.e., p k = p(r i |s) for i = 1, 2 . . . , n and assign k←i 4. compute the probability of p(r i |s) by computing the probability of each parameter separately, given s. make an unsorted array of probability values of n nodes, i.e., r1, r2, . . . , rn from step 6. for i = 1 to n and k = i, arrprob[ri]←p k //to find the node with maximum probability. 6. select the first node of the array arrprob[0] as the node with maximum value pmax i.e., pmax←arrprob[0] 7. go through the rest of the elements of the array, i.e., from the 2nd element to the last (n − 1) element, for i = 1 to n − 1. for when the end of the array is reached, then the current value of the pmax is the greatest value in the array, pmax←arrprob[i]. 10. the node ri with pmax value is selected as a relay node from the forwarder list, as the node with the highest probability. the node with the next highest probability acts as a relay node in case the first selected relay node fails to broadcast. 11. broadcast transmission of the data packet as {ri, coordinates, data} 12. destination node d is reached, if yes, go to step 15. else, apply algorithm 1 on ri s←ri and go to step 2. 13. end output: a potential forwarder node is selected from the list of forwarder nodes. again, putting the values in the above equations finally using the proposed method of relay node selection using naïve baye's algorithm, we could compute probability p(r1, r2, r3 s) , using equation (26) . p(r1, r2, r3 s) = max(p(r1 s), p(r2 s), p(r3 s) = max(0.001, 0.002, 0.001) (26) thus, node r2 would be selected as the relay node in the forwarder list of r1, r2, and r3 for source node s. similarly, the process was followed again for the neighbors of s, which consequently would check the neighbors of r1, r2, and r3. the tables 5-7 describe the features of neighboring nodes of r1, r2, and r3, respectively. after the execution of phase i and phase ii on the above said example, the final route was intelligently selected for the onward transmission of the data packet from source node s to destination node d, using the naïve baye's algorithm shown in figure 3 . node_id location prr (j) (m) r5 r20005 (49,79) 0.6 0.7 11 d after the execution of phase i and phase ii on the above said example, the final route was intelligently selected for the onward transmission of the data packet from source node s to destination node d, using the naïve baye's algorithm shown in figure 3 . figure 3 gives the details about the route selected using the iop. the source node s broadcasts the data packet among its neighboring nodes, using algorithm 1 to create a forwarders list. the node r1, r2, and r3 in the figure, were selected as the nodes in the forwarders list. these were the potential nodes that would be used for the selection of a potential forwarder node. here, r2 was selected as the potential node using algorithm 2. the same procedure was adopted again and until the data reached its final destination. the final route was selected intelligently using iop is s→r2→r5→d. with the end goal of examination and comparison of the proposed or protocol, the simulation was performed in matlab. the simulation used the environment provided by the matlab to simulate the computer networks and other networks. matlab provides a good scenario to design a network of sensor nodes and also to define a sensor node and its characteristics. the simulation results were compared with the results of the eeor [25] and the mdor [26] protocols. table 8 below shows the parameter setting of the network. figure 3 gives the details about the route selected using the iop. the source node s broadcasts the data packet among its neighboring nodes, using algorithm 1 to create a forwarders list. the node r1, r2, and r3 in the figure, were selected as the nodes in the forwarders list. these were the potential nodes that would be used for the selection of a potential forwarder node. here, r2 was selected as the potential node using algorithm 2. the same procedure was adopted again and until the data reached its final destination. the final route was selected intelligently using iop is s→r2→r5→d. with the end goal of examination and comparison of the proposed or protocol, the simulation was performed in matlab. the simulation used the environment provided by the matlab to simulate the computer networks and other networks. matlab provides a good scenario to design a network of sensor nodes and also to define a sensor node and its characteristics. the simulation results were compared with the results of the eeor [25] and the mdor [26] protocols. table 8 below shows the parameter setting of the network. the motes are haphazardly deployed in 500 × 500 m field. the nodes are deployed in such a way that these can approximately cover the whole application area. the base station position is 250 × 250 m in the field. the field area was considered a physical world environment. the proposed or protocol started working immediately after the deployment process was complete. figure 4 below represents the unplanned deployment of the nodes in the area of consideration. the motes are haphazardly deployed in 500 × 500 m field. the nodes are deployed in such a way that these can approximately cover the whole application area. the base station position is 250 × 250 m in the field. the field area was considered a physical world environment. the proposed or protocol started working immediately after the deployment process was complete. figure 4 below represents the unplanned deployment of the nodes in the area of consideration. energy efficiency was the main objective of the proposed algorithm. it could be calculated as the overall energy consumption in the network for the accomplishment of diverse network operations. in matlab, the simulation worked based on simulation rounds. the simulation round was termed as packets transmission from a single source to a single destination. in matlab, when the simulation starts, a random source is chosen to start transmission and this node makes a forwarder list and starts executing the proposed protocol. one round of simulation represents successful or unsuccessful transmissions of packets from one source in the network. for each round, different source and relay nodes are selected. this process continues until at least one node is out of its energy. the energy efficiency was calculated as the total energy consumption after each round in the network. after the operation of the network starts, the sensor's energy starts decaying. this energy reduction was due to network operations like setting up the network, transmission, reception, and acknowledging the data packets, processing of data, and sensing of data. as the nodes decayed, their energy consumption kept increasing per round, as can be seen in figure 5 below. it can be seen in the figure that energy consumption for the proposed or protocol was less, as compared to the other two algorithms. this was because the proposed or protocol distributed energy consumption equally to energy efficiency was the main objective of the proposed algorithm. it could be calculated as the overall energy consumption in the network for the accomplishment of diverse network operations. in matlab, the simulation worked based on simulation rounds. the simulation round was termed as packets transmission from a single source to a single destination. in matlab, when the simulation starts, a random source is chosen to start transmission and this node makes a forwarder list and starts executing the proposed protocol. one round of simulation represents successful or unsuccessful transmissions of packets from one source in the network. for each round, different source and relay nodes are selected. this process continues until at least one node is out of its energy. the energy efficiency was calculated as the total energy consumption after each round in the network. after the operation of the network starts, the sensor's energy starts decaying. this energy reduction was due to network operations like setting up the network, transmission, reception, and acknowledging the data packets, processing of data, and sensing of data. as the nodes decayed, their energy consumption kept increasing per round, as can be seen in figure 5 below. it can be seen in the figure that energy consumption for the proposed or protocol was less, as compared to the other two algorithms. this was because the proposed or protocol distributed energy consumption equally to all nodes, so that every node could survive up to their maximum lifetime. hence, the proposed or protocol was more energy-efficient than mdor and eeor. sensors 2020, 20, x for peer review 18 of 24 all nodes, so that every node could survive up to their maximum lifetime. hence, the proposed or protocol was more energy-efficient than mdor and eeor. latency can be measured as the time elapsed between sending the packet and receiving the same at the base station. this is also called as end-to-end delay for the packets to be reached at the destination. the communication in wireless sensor networks is always from source nodes to the sink station. latency can be measured as the time elapsed between sending the packet and receiving the same at the base station. this is also called as end-to-end delay for the packets to be reached at the destination. the communication in wireless sensor networks is always from source nodes to the sink station. in the random deployment of nodes, some nodes are able to communicate directly with the base station. while some nodes follow multi-hop communication, i.e., source nodes have to go through relay nodes to forward the data packet toward the base station. hence, in some cases, the network delay can be very low and in some cases, it can be high. hence in figure 6 , the values of end-to-end delay after each communication in each round are plotted. it can be seen that the proposed or protocol has a good latency, as compared to the other two protocols. latency can be measured as the time elapsed between sending the packet and receiving the same at the base station. this is also called as end-to-end delay for the packets to be reached at the destination. the communication in wireless sensor networks is always from source nodes to the sink station. in the random deployment of nodes, some nodes are able to communicate directly with the base station. while some nodes follow multi-hop communication, i.e., source nodes have to go through relay nodes to forward the data packet toward the base station. hence, in some cases, the network delay can be very low and in some cases, it can be high. hence in figure 6 , the values of end-to-end delay after each communication in each round are plotted. it can be seen that the proposed or protocol has a good latency, as compared to the other two protocols. the throughput of a network can be measured in different ways. throughput is calculated as the average number of packets received successfully at the base station per second in each round. figure 7 represents the throughput for each round. the proposed or protocol has good throughput, as compared to the other two. as the proposed or protocol is efficient in energy consumption, the sensor nodes are able to survive and communicate for a long time in the network. as long as the communication goes on, the base station would continue to receive the packets. sensors 2020, 20, x for peer review 19 of 24 the throughput of a network can be measured in different ways. throughput is calculated as the average number of packets received successfully at the base station per second in each round. figure 7 represents the throughput for each round. the proposed or protocol has good throughput, as compared to the other two. as the proposed or protocol is efficient in energy consumption, the sensor nodes are able to survive and communicate for a long time in the network. as long as the communication goes on, the base station would continue to receive the packets. network lifetime for wireless sensor networks is dependent upon the energy consumption in the network. when the energy of the network is 100 percent, the network lifetime would also be 100 percent. however, as the nodes start operating in the network, the network lifespan would start to reduce. figure 8 represents the percentage of lifetime remaining after each round of simulation. proposed or protocol has a good network lifetime due to the lower energy consumption in the network. network lifetime for wireless sensor networks is dependent upon the energy consumption in the network. when the energy of the network is 100 percent, the network lifetime would also be 100 percent. however, as the nodes start operating in the network, the network lifespan would start to reduce. network lifetime for wireless sensor networks is dependent upon the energy consumption in the network. when the energy of the network is 100 percent, the network lifetime would also be 100 percent. however, as the nodes start operating in the network, the network lifespan would start to reduce. figure 8 represents the percentage of lifetime remaining after each round of simulation. proposed or protocol has a good network lifetime due to the lower energy consumption in the network. the packet loss is referred to as the number of packets that are not received at the destination. to calculate the number of packets lost during each round of the simulation, packet sequence numbers are used. whenever a source tries to send packets to a destination, it inserts a sequence number. later, on packet reception, these packet sequence numbers are checked for continuity. if a certain sequence number is missing then it is referred to as packet loss. packet loss recorded per round of simulation and presented in figure 9 . it can be depicted from the figure that packet loss for the proposed protocol is less, as compared to eeor and mdor. this is because the forwarder node selection algorithm runs on each relay and source node. this algorithm calculates the probability of successful transmission through a neighbor node. this also increases the reliability of the protocol and provides accurate transmissions. sensors 2020, 20, x for peer review 20 of 24 the packet loss is referred to as the number of packets that are not received at the destination. to calculate the number of packets lost during each round of the simulation, packet sequence numbers are used. whenever a source tries to send packets to a destination, it inserts a sequence number. later, on packet reception, these packet sequence numbers are checked for continuity. if a certain sequence number is missing then it is referred to as packet loss. packet loss recorded per round of simulation and presented in figure 9 . it can be depicted from the figure that packet loss for the proposed protocol is less, as compared to eeor and mdor. this is because the forwarder node selection algorithm runs on each relay and source node. this algorithm calculates the probability of successful transmission through a neighbor node. this also increases the reliability of the protocol and provides accurate transmissions. a significant improvement could be seen in the graphs after the simulation is complete. figure 5 shows the total energy consumption after each round of packet transmission is complete. here, the round was termed as packet transmissions in between single source and destination. mdor showed the highest energy consumption, followed by eeor and the proposed protocol. this was because mdor wasted more energy in the initial setup. however, the dynamic energy consumption considerations led the network to survive for a long time, as shown in figure 8 . in the case of eeor in figure 5 , it consumed lesser energy in transmission and the initial setup for opportunistic selection of relay nodes was based on the power level. however, when it comes to lifetime, eeor failed to perform better, as it considered the network to be dead when any one of the nodes ran out of its energy. eeor chose one node as a source and continued transmissions opportunistically, which resulted in a significant reduction in the power level of a single node. the proposed protocol gave the best results, as in each round, the source node was based on the intelligent model to change the next-hop relay node. figure 6 presents the average end-to-end delay per round, generated by the simulation, and the proposed protocol worked significantly better as the next-hop selection was based on an intelligent algorithm. the proposed algorithm helped to significantly reduce average a significant improvement could be seen in the graphs after the simulation is complete. figure 5 shows the total energy consumption after each round of packet transmission is complete. here, the round was termed as packet transmissions in between single source and destination. mdor showed the highest energy consumption, followed by eeor and the proposed protocol. this was because mdor wasted more energy in the initial setup. however, the dynamic energy consumption considerations led the network to survive for a long time, as shown in figure 8 . in the case of eeor in figure 5 , it consumed lesser energy in transmission and the initial setup for opportunistic selection of relay nodes was based on the power level. however, when it comes to lifetime, eeor failed to perform better, as it considered the network to be dead when any one of the nodes ran out of its energy. eeor chose one node as a source and continued transmissions opportunistically, which resulted in a significant reduction in the power level of a single node. the proposed protocol gave the best results, as in each round, the source node was based on the intelligent model to change the next-hop relay node. figure 6 presents the average end-to-end delay per round, generated by the simulation, and the proposed protocol worked significantly better as the next-hop selection was based on an intelligent algorithm. the proposed algorithm helped to significantly reduce average end-to-end delays. figures 7 and 9 showed the reliability and availability performances of all protocols, including the proposed protocol that showed significantly better performance. this meant that the proposed protocol was a new generation protocol that has potential in many applications of wsn. in recent years, wsn saw its applications growing exponentially with the integration of iot. this gave a new purpose to the overall utility of data acquisition and transmission. with the integration of wsn with iot, the iot is making a big impact in diverse areas of life, i.e., e-healthcare, smart farming, traffic monitoring and regulation, weather forecast, automobiles, smart city, etc. all these applications are hugely dependent on the availability of real-time accurate data. healthcare with iot is one such area that involves critical decision making [31] [32] [33] . the proposed approach makes use of intelligent routing and, therefore, would help in making reliable and accurate delivery of data to the integrated healthcare infrastructure, for proper care of the patients. the proposed framework for e-healthcare is shown in figure 10 . as the proposed algorithm saves energy, the healthcare devices that are sensor-enabled can work for longer duration, and easy deployment and data analysis is possible due to iot integration [34] [35] [36] [37] [38] . according to the proposed architecture, there can be any different kind of sensor nodes, such as smart wearables, sensors collecting health data like temperature, heartbeat, number of steps taken every day, sleep patterns, etc. these factors have a correlation with different existing diseases. the best part of the integration of iot and wsn is that, with the help of sensors, data are collected and the same is stored in the cloud due to iot integration. once the health data is stored in the cloud, this cloud is a health-record cloud that belongs to a specific hospital or a public domain cloud. these cloud data can be accessed by healthcare professionals in a different way, to analyze the data and also provide feedback to a specific patient and group of patients. in recent years, wsn saw its applications growing exponentially with the integration of iot. this gave a new purpose to the overall utility of data acquisition and transmission. with the integration of wsn with iot, the iot is making a big impact in diverse areas of life, i.e., e-healthcare, smart farming, traffic monitoring and regulation, weather forecast, automobiles, smart city, etc. all these applications are hugely dependent on the availability of real-time accurate data. healthcare with iot is one such area that involves critical decision making [31] [32] [33] . the proposed approach makes use of intelligent routing and, therefore, would help in making reliable and accurate delivery of data to the integrated healthcare infrastructure, for proper care of the patients. the proposed framework for e-healthcare is shown in figure 10 . as the proposed algorithm saves energy, the healthcare devices that are sensor-enabled can work for longer duration, and easy deployment and data analysis is possible due to iot integration [34] [35] [36] [37] [38] . according to the proposed architecture, there can be any different kind of sensor nodes, such as smart wearables, sensors collecting health data like temperature, heartbeat, number of steps taken every day, sleep patterns, etc. these factors have a correlation with different existing diseases. the best part of the integration of iot and wsn is that, with the help of sensors, data are collected and the same is stored in the cloud due to iot integration. once the health data is stored in the cloud, this cloud is a health-record cloud that belongs to a specific hospital or a public domain cloud. these cloud data can be accessed by healthcare professionals in a different way, to analyze the data and also provide feedback to a specific patient and group of patients. in the recent epidemic of covid-19, telemedicine had become one of the most popular uses of this platform. doctors also started e-consulation to the patients and getting access to their health records, using the smart wearables of patients. sill, there are many challenges, and lot of improvements are required. the proposed work add towards better energy efficiency of sensors, so that they can work for longer durations. thereafter these sensor data can be integrated using iot and cloud, as per the proposed approach shown in figure 10 . in the recent epidemic of covid-19, telemedicine had become one of the most popular uses of this platform. doctors also started e-consulation to the patients and getting access to their health records, using the smart wearables of patients. sill, there are many challenges, and lot of improvements are required. the proposed work add towards better energy efficiency of sensors, so that they can work for longer durations. thereafter these sensor data can be integrated using iot and cloud, as per the proposed approach shown in figure 10 . in this paper, we proposed a new routing protocol (iop) for intelligently selecting the potential relay node using naïve baye's classifier to achieve energy efficiency and reliability among sensor nodes. residual energy and distance were used to find the probability of a node to become a next-hop forwarder. simulation results showed that the proposed iop improved the network lifetime, stability, and throughput of the sensor networks. the proposed protocol ensured that nodes that are far away from the base station become relay nodes, only when they have sufficient energy for performing this duty. additionally, a node in the middle of the source and destination has the highest probability to become a forwarder in a round. the simulation result showed that the proposed or scheme was better than mdor and eeor in energy efficiency and network lifetime. future work will examine the possibility of ensuring secure data transmission intelligently over the network. the authors declare no conflict of interest. an overview of evaluation metrics for routing protocols in wireless sensor networks comparative study of opportunistic routing in wireless sensor networks opportunistic routing protocols in wireless sensor networks towards green computing for internet of things: energy oriented path and message scheduling approach. sustain toward energy-oriented optimization for green communication in sensor enabled iot environments greedi: an energy efficient routing algorithm for big data on cloud. ad hoc netw an investigation on energy saving practices for 2020 and beyond opportunistic routing-a review and the challenges ahead a revised review on opportunistic routing protocol geographic random forwarding (geraf) for ad hoc and sensor networks: multihop performance opportunistic multi-hop routing for wireless networks optimal forwarder list selection in opportunistic routing simple, practical, and effective opportunistic routing for short-haul multi-hop wireless networks spectrum aware opportunistic routing in cognitive radio networks energy-efficient opportunistic routing in wireless sensor networks a trusted opportunistic routing algorithm for vanet a novel socially-aware opportunistic routing algorithm in mobile social networks opportunistic routing algorithm for relay node selection in wireless sensor networks economy: a duplicate free opportunistic routing mobile-aware service function chain migration in cloud-fog computing service function chain orchestration across multiple domains: a full mesh aggregation approach online learning offloading framework for heterogeneous mobile edge computing system virtualization in wireless sensor networks: fault tolerant embedding for internet of things traffic priority aware medium access control protocol for wireless body area network an energy efficient opportunistic routing metric for wireless sensor networks middle position dynamic energy opportunistic routing for wireless sensor networks an intelligent opportunistic routing protocol for big data in wsns recent advances in energy-efficient routing protocols for wireless sensor networks: a review radio link quality estimation in wireless sensor networks: a survey futuristic trends in network and communication technologies futuristic trends in networks and computing technologies communications in computer and information handbook of wireless sensor networks: issues and challenges in current scenario's lecture notes in networks and systems 121 proceedings of icric 2019 introduction on wireless sensor networks issues and challenges in current era. in handbook of wireless sensor networks: issues and challenges in current scenario's congestion control for named data networking-based wireless ad hoc network deployment and coverage in wireless sensor networks: a perspective key: cord-332313-9m2iozj3 authors: yang, hyeonchae; jung, woo-sung title: structural efficiency to manipulate public research institution networks date: 2016-01-13 journal: technol forecast soc change doi: 10.1016/j.techfore.2015.12.012 sha: doc_id: 332313 cord_uid: 9m2iozj3 with the rising use of network analysis in the public sector, researchers have recently begun paying more attention to the management of entities from a network perspective. however, guiding elements in a network is difficult because of their complex and dynamic states. in a bid to address the issues involved in achieving network-wide outcomes, our work here sheds new light on quantifying structural efficiency to control inter-organizational networks maintained by public research institutions. in doing so, we draw attention to the set of subordinates suitable as change initiators to influence the entire research profiles of subordinates from three major public research institutions: the government-funded research institutes (gris) in korea, the max-planck-gesellschaft (mpg) in germany, and the national laboratories (nls) in the united states. building networks on research similarities in portfolios, we investigate these networks with respect to their structural efficiency and topological properties. according to our estimation, only less than 30% of nodes are sufficient to initiate a cascade of changes throughout the network across institutions. the subunits that drive the network exhibit an inclination neither toward retaining a large number of connections nor toward having a long academic history. our findings suggest that this structural efficiency indicator helps assess structural development or improvement plans for networks inside a multiunit public research institution. public research more inclines to distribute its findings than commercialize in contrast to industrial research (geffen and judd, 2004) . in general, institutes conducting public research are largely government funded and target the public domain (bozeman, 1987) . because of their national orientation and stable funding source, public research institutes do cutting-edge research at least one academic field through long-term plans (greater than three years) (bozeman, 1987) . a public research institution often develops as an association of research institutes rather than a single organization. research entities with a public research institution enjoy institutional autonomy in choice of subjects notwithstanding the fact that they are under the same umbrella of governance. naturally, research organizations have different characteristics depending on national circumstances. some public research institutions, such as the max planck gesellschaft (mpg) in germany, are faithful to pure research (philipps, 2013) , while others have significance within a particular national context: part of the national laboratories (nls) in the united states (us) addresses defense-related technologies (jaffe and lerner, 2001) , and the government-funded research institutes (gris) in korea attempt to assist in the country's economic development by promoting indigenous public research (mazzoleni and nelson, 2005; arnold, 1988; lee, 2013) . with recent advances in our understanding of network, it is possible to apply novel network knowledge to manage public research institutions in response to internal and external changes. for example, entities in national innovation systems (freeman, 2004) or the triple helix models (phillips, 2014; leydesdorff, 2003) can be external factors affecting research of public research institutions. the notion of national innovation systems provides a framework to explain underlying incentive structures for technological development at a national level and international differences in competence from a network perspective of public and private organizations (patel and pavitt, 1994) . the triple helix model considers coevolving academic, industry, and government which provokes techno-economic developments of a country (leydesdorff et al., 2013) . in these systems, public research institutes provide fiscal and technical assistance to other organizations. kondo (kondo, 2011) pointed out that public research institutes dedicated to transferring technologies to industry by means of consulting, licensing, and spinning off. by doing so, they contribute to promoting integration and coordination within the system (provan and milward, 1995) . in order to formulate policies and procedures to steer the entire system, system organizers are able to guide public research institutes properly. in this context, control of those key agencies is important to achieving desirable outcomes. moreover, there is a growing need for an efficient implementation throughout public research institutions composed of multiple sub-organizations in order to deal with internal controls (yang and jung, 2014) . for example, most public research institutions have undergone transformations in recent years due to modernization, imperatives for efficiency, and the promotion of collaboration with the industry (buenstorf, 2009; cohen et al., 2002; simpson, 2004; senker, 2001) . in unfavorable economic conditions, declining government funding causes the restructuring of research areas (malakoff, 2013; izsak et al., 2013) or the government demands more practical outputs from them, such as conducting applied research and setting standards (oecd, 2011) . in an attempt to harness technology for socio-economic development, governments often prioritize future research through foresight activities (priedhorsky and hill, 2006) and accordingly assign new academic missions to public research institutions. in particular, developing countries have lately been paying more attention to the technology-driven development model under government supervision (arnold, 1988) . at that time, controlling every entity enables the institution to fully guide those internal changes but entails great expense. from 1935 to 1945, public research institutions engaged in national strategic areas, including exploration of mineral resources, industrial development, and military research and development (r&d) (oecd, 2011) . after the termination of world war ii, the establishment of public research institutions grew in an effort to advance military technology in many countries. moreover, at that time, public research institutions extended almost all areas with which governments were associated, such as economic and social issues. they continued growing until the 1960s. in the 1970s and 1980s, many countries expressed doubts on their contributions to innovation. however, as deepening the understanding of national innovation systems or the triple helix models, public research institutions started to be seen in a new light. in these models, public research institutions have played an indispensable role in preventing systemic failures, which reduce the overall efficiency of r&d (lundvall, 2007; sharif, 2006) due to their relations with external collaborators (klijn and koppenjan, 2000; mcguire, 2002) . still, the importance of public research institutions are emphasized in particular for scientific innovation (cabanelas et al., 2014) . in this regard, a network approach is necessary to efficiently implement transformations throughout sub-organizations, and the academic interest also grows for the effective operation of the network (cabanelas et al., 2014; jiang, 2014) . there is, however, a lack of empirical research on managing public research institutions through a network system. hence, in this paper, we conceptualize three major public research institutionsthe mpg, nls in the us, and gris in koreaas networks, identify the sub-organizational network structure of each, and examine its structural efficiency. a collaborative research network is one of the most prevalent inter-organizational configurations (shapiro, 2015) . however, we deem that topical similarity between research institutes is suitable to represent a relation between them in research interests. most transformations involve changes in research areas, and changes in organizational research topics frequently occur when governments prioritize specific research fields or delegate new roles to institute (wang and hicks, 2013) . prior studies emphasized the importance of similarity in knowledge content among entities to effectively manage inter-organizational networks as well (tsai, 2001; hansen, 2002) . for these reasons, a network here is formed by pairs of subunits having the most similar research profiles. with the addition of temporal dynamics to inter-organizational relations, a chain of networks over time allows the description of the structural evolution of public research institutions. based on revealed networks, we determined the structural efficiency with which network-wide actions can influence entities for finite time periods. no matter the measure puts in place, all members of network need to adopt it to achieve collective actions. in the early stages of change implementation, network organizers select initiators to change among entities. as the change initiators propagate control actions to the remainder of entities, a public research network can be steered in the desired direction like a car. we can derive a minimum number of suitable initiators from a theory of "structural controllability" (yuan et al., 2013) . in the theory, change initiators refers to injection points of external energy used to steer the network, which are theoretically selected depending on network structure. in this process, structural efficiency is obtained by calculating the share of change initiators in the network: the lower the efficiency value, the smaller the number of entities the network manager is required to handle. therefore, by comparing efficiencies with structural properties over time, we can estimate network characteristics specific to institutions. in this study, we divided institutional research portfolios into six time periods based on scientific output over eighteen years (1995) (1996) (1997) (1998) (1999) (2000) (2001) (2002) (2003) (2004) (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) , and estimated structural efficiencies of research similarity networks. considering structural efficiency, we can observe that networks in all three research institutions can be managed with less than 30% of sub-organizations, and the values reflect the changes that have occurred in research institutions. each research institution has some suborganizations consistently selected as suitable change initiators over a period of time. our results primarily highlighted young subordinates as appropriate change initiators, which means that information blockades in network might occur unless the selected units are properly managed. moreover, the estimated changes initiators tend to have a lower connectivity in network than the rest of nodes. we expect that our work has implications for decision-making bodies and network managers seeking to an efficient way to influence their intention on a network of public research institutes. the remainder of this paper is structured as follows: in section 2, we briefly describe the impact of structure on network effectiveness associated with public research institutions based on past research. section 3 is devoted to an explanation of data sources, network construction processes, and the calculation of structural controllability in a network. we discuss the results of our experiments in sections 4 and 5, and offer our conclusions in section 6. methods for utilization and development of networks have grown in an attempt to address complex problems that require collective effort. when the purpose of the network is to deliver public services, independent organizations are generally involved in the process, and interdependency between participants facilitates the formation of links (kickert et al., 1997) . by exchanging knowledge through a network, public research organizations attain a higher level of performance, at the same time, create a greater ability to innovate (morillo et al., 2013) . goldsmith and eggers (2004) claimed that using a vehicle for networks is favorable to organizations that require flexibility, rapidly changing technology, and diverse skills because actors can exchange goals, information, and resources while interacting with each other. resources usually refer to units of transposable value, such as money, materials, and customers, and information signifies exchangeable units between agencies, such as reports, discussions, and meetings. with regard to exchanged goods between organizations, van de ven (van de ven, 1976 ) underlined the importance of information and resources as "the basic elements of activity in organized forms of behavior." in research systems, organizations can take advantage of network participation to have a greater possibility of funding, to broaden their research spectrum, or to reduce the risk of failure (beaver, 2001) . therefore, networks are beneficial because they can pool resources, permit the mutual exploration of opportunities, and create new knowledge (priedhorsky and hill, 2006) . however, strategies are needed to coordinate interactions while managing networks because different actors have different goals and preferences concerning a given problem (kickert et al., 1997; o'mahony and ferraro, 2007) . the capability of network management is also necessary to promote innovations (pittaway et al., 2004) , but there remain questions as to how to manage such organizational interactions as beaver (beaver, 2001) pointed out. orchestrating activities seems unnecessary because of interactions between autonomous organizations, but addressing conflicts keeps agencies cooperative in effort to achieve the goal of the network, thereby facilitating the effective allocation and efficient utilization of network resources. furthermore, a network sometimes needs to be intentionally formed to boost management by governing parties which would be either an external organization or network participant(s) (provan and kenis, 2007) . public research institutions can be said to be governed by external organizations, considering that different entities are in charge of their administration in general, such as ministries, research councils, and other steering bodies. both the mpg and korean gris are apparently steered by a single entity. the fundamental management policy of the nls in the us also originates in a federal agency, although several laboratories are operated by contract partners. by frequently repeating interactions among actors, networks produce certain outcomes. the performance of a network is evaluated according to whether the network effectively attains its goal. the outcome varies depending on governing strategies, and the course of attainment can be enhanced by taking advantage of structural properties of the network (kickert et al., 1997; goldsmith and eggers, 2004) . provan and milward (2001) argued that the assessment of network effectiveness should involve consideration not only of beneficiaries, but also of administrative entities and the participants of the network. nevertheless, the literature on networks has paid more attention to the evaluation of their effectiveness by treating networks as a whole, such that the common goal is primarily involved in network-level accomplishment (provan and milward, 1995; möller and rajala, 2007) . there remain difficulties in determining network effectiveness. the problem primarily resides in the impossibility of quantifying the exact network outcome (provan and lemaire, 2012) . as agranoff (2006) claimed, networks are not always directly related to policy adjustments because some interactions are forged by voluntary information exchange or educational service. in the public research institution, researchers engaged in specialized fields have the opportunity to share ideas across administrative boundaries given that they have the goal and intend to generate public knowledge. outcomes of research networks can be approximated by proxy variables, such as patent and paper citations, innovation counts, new product sales, and productivity growth (council, 1997) . furthermore, such networks also indirectly affect subsequent movements and policies. thus, network efficiency needs to be measured for various types of networks, by considering factors beyond collaborations. in order to increase network effectiveness, structural efficiency in networks is important: since all entities are connected, damage to one part can cause the collapse of the entire system through a cascade of failures. in this regard, considerable research on networks has focused on deliberately building efficiently manageable networks (cabanelas et al., 2014; kickert et al., 1997; van de ven, 1976; provan and kenis, 2007) . certain network structures can affect innovation performance by catalyzing knowledge exchange (valero, 2015) . enemark et al. (2014) argued the importance of the network structure to collective actions via experimental tests that demonstrated structural variations in a network can either improve or degrade network outcomes. however, there is ambiguity in appropriate network structures to achieve effective control. pittaway et al. (2004) suggested that longitudinal network dynamics need to be taken into account when designing network topologies. a network is required to change its members or structures in order to adapt to environmental changes. much of the literature on networks emphasized that instability is an opportunity for transformation (hicklin, 2004) . although the capability of flexible response is one of the strongest features within a network model, such network dynamics challenge for effectively managing networks. with regard to network size, it is widely known that the greater number of actors involved, the more difficult it becomes for the network to achieve collective cooperation (kickert et al., 1997) . increasing the number of participants results in more complex network governance because the number of potential interactions also exponentially escalates. however, prior research found that research networks evolved to be more centralized as growing the network (ferligoj et al., 2015; hanaki et al., 2010) . the growing patterns of research networks imply that adding an entity does not always increase complexity of network management. theorists rather claimed that the introduction of a new node would improve efficiency to control networks (klijn and koppenjan, 2000) . centralization captures the extent of inequality with which important nodes are distributed across the network, and is often measured in terms of freeman's centralities (freeman et al., 1979) . a degree (the number of connections) centralized network is known to readily coordinate across agencies and closely monitor services (provan and milward, 1995) . in complex networks, a minority of nodes, referred to as hubs, dominates connections while the majority is connected with a small number of points (barabási and albert, 1999) . research revealed that complex networks were robust against random attacks (albert et al., 2000) . hubs in research networks were not only empirically impressive in their performance (echols and tsai, 2005; dhanarag and parkhe, 2006) , but also easy to access new knowledge developed by other entities (tsai, 2001) . hanaki, nakajima and ogura (hanaki et al., 2010) also found r&d collaboration networks evolved toward more centralized structures because organizations prefer to collaborate with reliable partners based on referrals obtained from former partners. however, a high degree of integration is not always desirable. provan and lemaire (2012) proposed that connective intensity between organizations should be appropriately controlled for effective network structure. cabanelas et al. (2014) also found research networks producing high performance featured nodes with low degree centrality. no matter the types of networks that develop out of interactions, the goal achievement is possible only when the relevant information spreads throughout the network to encourage actors to conform. in recent years for public research institutions, the controllability of organizational portfolios has been seen as constitutive of dynamic capabilities, which means the "ability to integrate, build, and reconfigure internal and external competencies to address rapidly changing environments" (teece et al., 1997; floricel and ibanescu, 2008) . in this sense, estimating efforts to control entities of public research institutions is related to assessing the feasibility of research reorganization over networks. at the same time, the number of key points in information flow within a network affects burden on the network administration. although earlier work emphasized that selectively activating critical actors is more effective to integration than full activation, the system must secure the capability to exercise influence across agencies (kickert et al., 1997; provan and lemaire, 2012) . furthermore, the efficiency with which network structure can be manipulated would be a suitable criterion to evaluate the built structure. this section is devoted to describing methods of network construction based on collected bibliographies and analytical methods. we describe a quantification method for structural efficiency given structure to control the whole network, and explain structural properties to explore their relation with structural efficiency. in the process of efficiency calculation, we extract suitable organizations to initiate transformation. this investigation was conducted in the r ver. 3.1.2 environment (r core team, 2015) , and used the following add-on packages for convenience: ggplot2 (wickham, 2009 ) and igraph (csardi and nepusz, 2006) . we identified research portfolios based on scientific output, and gathered bibliographic data regarding nls, mpg, and gris from the thomson reuters web of knowledge. academic output over eighteen years (1995) (1996) (1997) (1998) (1999) (2000) (2001) (2002) (2003) (2004) (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) was compiled according to institutional names and abbreviations of authors' affiliations. we only used affiliations in english for this study. subordinate research institutes listed in official websites were considered, and their portfolios were tracked using at least twenty papers for each. all disciplines, which are the constituent elements of a portfolio, need to be identified using the same classification system for ease of institutional comparison. we utilize the university of california-san diego's (ucsd) map of science (borner et al., 2012) as a journal-level classification system. the map classified documents into 554 subdisciplines belonging to 13 disciplines on the basis of journal titles. naturally, a research portfolio has two levels of classification: discipline and sub-discipline. particularly, a discipline refers to the aggregate level of sub-disciplines in the hierarchical structure in this study. fig. 1 shows an example of disciplinary mapping using sci2 (sci2 team, 2009) in order to analyze the thematic evolution of network over time, we split the portfolios into time intervals. with regard to an adequate duration of assessment period to represent scientific output being measured, abramo et al. (2012) claimed that a three-year period is adequate to assess scientific outputs. by accepting their recommendations, we observed the development of institutional portfolios for six consecutive time slices. as a well-known analytical method, a complex network is suitable for exploring dynamic topology changes (strogatz, 2001) . here, an inter-organizational network is formed between subordinate institutes building similar research profiles. representing sub-organizations, nodes are connected by a link when two sub-organization have similar research portfolios. in order to measure similarities, we used "inverse frequency factors" for weighting system and "second-order cosine similarities" (garcía et al., 2012) . the inverse frequency factor borrows from a term discrimination method for text retrieval (salton and yang, 1973; salton and buckley, 1988) . the factors weight each subdiscipline in the research portfolio. the weight of sub-discipline m for research institute i is determined by w m;i ¼ f m;i â logð n nm þ where f denotes the number of articles; and ð n nm þ implies the inverse frequency factor to file out prevalent research (jones, 1972) . the logarithmic frequency factor is calculated inversely from the ratio number of subunits n m that publish their achievements in sub-discipline m to the total number n of research institutes. as a result, the set of weights generates a 554 sub-disciplines-by-institute matrix. the similarities between two institutional research portfolios primarily take the cosine measure (salton and mcgill, 1986; baeza-yates and ribeiro-neto, 1999) . for the purpose of improving the accuracy of similarity, we applied second-order approaches to the sub-disciplineby-institute matrix. colliander and ahlgren (2012) explained that first-order approaches directly reflect the similarity between only two profiles, whereas second-order similarities determine those between two given portfolios and other institutional portfolios. a large number of studies have confirmed the superior performance of the secondorder approach as well (ahlgren and colliander, 2009; thijs et al., 2012) . moreover, to render easier structural analysis and network visualization, we strip weak similarities from the research similarity matrix. using the maximum spanning tree (mst) algorithm (kruskal, 1956) , we extracted tree like-structures. the mst algorithm ensures that all institutes are connected with maximal similarity, which implies that the institutes are connected through the most relevant links. therefore, a linked pair of institutes indicates greater potential for common intellectual foundations. among various well-known algorithms to detect mst, the backbone of thematic networks is derived by prim's algorithm (prim, 1957) . in order that the network structure can efficiently elicit the desired response from its elements, a certain amount of energy needs to be injected into the network to change the behavior of actors. thus, the selection of several agencies to initiate changes depending on network structure is inevitable. at the same time, it is important to minimize the number of injection points due to management cost. studies on complex networks consider that nodes can dynamically make decisions or can dynamically change their states by responding to information received through links between nodes. as individual actors, nodes on research networks can be researchers or research institutions, and the nodal states can be represented by individual research interests or disciplinary composition. here, we estimate the capability that controls the behavior of such nodes in complex networks with the minimum involvement of intervention adopting the notion of structural controllability. in recent years, a number of studies have focused on driving networks to a predefined state by combining control theory and network science (liu et al., 2011; wang et al., 2012; lombardi and hörnquist, 2007; gu et al., 2014) . according to network controllability, if a network system is controllable by imposing external signals on a subset of its nodes, called driver nodes, the system can be effectively driven from any initial state to the desired final state in finite time (kalman, 1963; lin, 1974) . thus, network controllability depends on the number and the placement of the control inputs. for this reason, structural efficiency refers to the share of the driver nodes. in this study, agencies found using structural controllability are key locations to steer the entire inter-organizational research network. we applied the structural controllability for undirected networks, introduced by yuan et al. (2013) , to matrix representation of our temporal msts. each temporal matrix g(a) was considered a linear timeinvariant model _ xðtþ ¼ axðtþ, where the vector x ∈ ℝ n represents the state of the nodes at time t, a∈ ℝ n×n denotes the research similarity matrix of mst, such that the value a i,j is the portfolio similarity between institutes i and j (a ij = a ji ). the controlled network g(a, b) corresponds to adding m controllers using ordinary differential equations _ xðtþ ¼ axðtþ þ buðtþ, where vector u(t)∈ ℝ m is the controller and b ∈ ℝ n×m is a control matrix. the problem of finding the driver nodes of the system is solved by the exact controllability theory following the popov-belevitch-hautus (pbh) rank condition (hautus, 1969) . to ensure complete control, the control matrix b should satisfy rank[λ m i n − a, b]=n, where i n is the identity matrix of dimension n, and λ m denotes the maximum geometric multiplicity μ(λ l ) (=n-rank(λ l i n − a)) for the distinct eigenvalues λ l of a. therefore, from a theoretical perspective, changes initiated from the drivers are likely to affect the entire structure. hence, driver institutes are crucial to the functioning of networks for public research institutes. in this paper, we regard the share of drivers in all agencies as an efficiency indicator in that the number of drivers is important for efficient control. network properties have been utilized by a considerable amount of literature in the area to better understand structural features of networks (newman, 2003; albert and barabasi, 2002; woo-young and park, 2012) . in order to understand the relation between efficiency and the inter-organizational research network, we extracted major features across institutions based on some structural properties, such as network size and connectivity. the number of participants represents network size associated with network volume. centrality is one of the most studied indicators in network analysis, and measure the influence of a node in a network using degree centrality (freeman et al., 1979; borgatti et al., 2009; freeman, 1978) . we examine the degree feature of driver nodes. as a nodal attribute, we assign research experience in time periods to nodes to characterize driver nodes. this section contains the major results of our investigation of the structural features of the inter-organizational networks. to form our desired skeletal network, we extracted pairs of academically close institutes based on portfolio similarities among their participants using the construction algorithm of the maximum spanning tree (mst). the results obtained from the backbone networks are related to structural controllability. in order to address the evolution of inter-organizational research, we assessed the structural features of temporal msts. figs. 2-4 show tree-like structures of institutions over time. each node represents a sub-organization, and its size is proportional to the total number of documents published. the colors filling the nodes were determined by the discipline in which the institute was found to be most productive. the portfolio similarities between pairs of linked institutes represented the weight on the network, and these weights affected the width of links as well. the descriptive statistics of portfolio similarity summarize and describe the distribution of the skeletal relationships between subordinates, as listed in table 1 . for all institutions, we found that the distributions were biased toward high similarities between research portfolios. for the nls networks in the us, the overall greater averages and smaller standard deviations of portfolio similarities than the other two institutions indicated that most research units were seen as connected, with the smallest difference in their research areas. on the other hand, in case of the gris, the lowest values of average similarity signified that each unit had a distinct research portfolio. the largest standard deviation and the low values of kurtosis for most time periods also showed that their research similarities were the most widely distributed. in order to represent the dynamic characteristics of msts, their structural properties are listed in table 2 . the number of nodes n increased and, accordingly, the number of links increased to n − 1 following the definition of an mst in the context of a connected network. in spite of sparsity of the network, nodes having a relatively large number of links could be found in some institutions, in particular the oak ridge national laboratory (ornl) within the nls, which was connected to approximately a quarter of the other organizations for four time periods, and the korea research institute of standards and science (kriss) and the korea institute of science and technology (kist), which appeared as a maximally connected node in each half of the dataset. however, there was stiff competition among institutes with the maximum number of connections in the mpg. we noted that network density (2/n) could be obtained from the number of nodes, and the transitivity always reduced to zero because mst rules out cycles. we calculated a periodical change of structural efficiency, and then examined the relations between network efficiency and structural properties. following this, we investigated the features of estimated driver nodes in terms of degree and period of appearance. note that although the number of driver nodes is theoretically fixed in a network, there can be multiple sets of drivers (jia and barabasi, 2013) . we randomly selected a set where multiple driver sets existed. as an indicator of network efficiency obtained from structural controllability, fig. 5 denotes the share of driver nodes over time. according to the graph in the figure, the proportion of drivers varied, but institutions did not have to consider all their agencies for network-wide transformation. less than 30% of nodes were selected as suitable points at which to inject external information in all three institutions because the maximum value of structural efficiency in the entire datasets was about 30% at the second period (1998) (1999) (2000) in the gris. in particular, the nls could be influenced with a relatively small share of drivers among the institutions at all times, which the exception of the period 2004-2006, whereas in the gris, the largest portion of nodes mostly needed to initiate changes. the efficiency fluctuation of the mpg was more stable than other two institutions over time periods. an understating of drivers enables administrators to take preemptive action to prevent information isolation, like the knowledge of the relation between the share of drivers and network efficiency can help plan structural development. the total number of driver appearances for the entire period corresponded to 13, 25, and 53 for the nls, the gris, and mpg, respectively but 6, 15, and 31 agencies were selected as drivers. this was an evidence for the existence of memory in the drivers. moreover, figs. 6 and 7 capture some features of the drivers. fig. 6 compares the average number of links between drivers and the entire nodes over different periods. despite the common knowledge that nodes possessing large connectivity are influential, our results showed that drivers with low connectivity tended to determine collective agreement on the network. fig. 7 shows the average durations of appearance by institutional drivers. we see that the driver nodes were the ones newly entered to the network based on the average durations. of the institutions, the research units of the nls showed a wide difference between drivers and non-drivers. public research has contributed to major innovations by improving competitiveness among existing industries and developing new ones. as prominent contributors to public research, governments have implemented a variety of support policies and programs for higher efficiency and excellence. among the actors involved in public research, public research institutions aim to disseminate their knowledge, by providing various functions: priority-driven research to address national and academic agendas or blue skies research engaging large-scale research facilities to complement university research (pot and reale, 2000) . to maintain such diversity, public research institutions seek to coordinate elements with varying specializations and missions in adapting to dynamic technological environments. as a part of the effort, institutions occasionally attempt to restructure research portfolios or modify organizational placements in relation with other research units. in order to assess the development of public research institutions, we examined structural evolution derived from research similarities in the context of networked organizations in this paper. more precisely, this study focused on public research institutions composed of several specialized research units, and extracted a network from similarities between sub-organizational research portfolios over eighteen years. a pair of connected agencies would be most influenced by the same type of exertion on a specific research area. in addition, suborganizations connected to each other can be potential partners to collaborate because they share similar academic backgrounds. for example, the similarity networks of the gris give implications for inter-disciplinary research groups operated by the research council. in the research group, researchers working at different gris seek a solution together to technological difficulties and research similarities can indicate proper gris to resolve the difficulties. moreover, offering the advantage of predictable network controllability, network modeling helps to understand the system's entire dynamics, which could be guided in finite time by controlling the initiators (liu et al., 2011) . as a result of the modeling, we can measure the efficiency of the network, where network efficiency implied the proportion of elements required as initiators to change the states of the entire agencies. the lower the proportion, the greater the network efficiency because the initiators are injection points for external information. we also revealed the structural properties of estimated initiators. our research here is different from other studies concerning network effectiveness in that it quantitatively estimated the effort required to control an entire inter-organizational network based on its structure. naturally, if we send control signals to every single node, the network is operated with high controllability but involves significant cost. thus, by employing the concept of structural controllability, we can theoretically detect the initial spreaders of information that need to be properly treated. otherwise, they would have produced barriers to the exertion of authority; in extreme cases, the information blockades could have caused network failure (klijn and koppenjan, 2000) . however, handling these elements incurs extra cost, because of which it is important to build networks with the minimum possible number of initiators to reduce enforcement costs incurred for complete control (egerstedt, 2011) . common structural features of estimated initiators can direct network management of public research institutions. we generated results to provide a clear idea of how structural efficiency of research network is related to structural properties, such as size and nodal degree. previous work on network governance structures has provided recommendations on how to build and design inter-organizational networks for innovation acceleration. for example, related to the number of participants, it is natural to expect that the share of drivers would also increase owing to a higher risk of insularity in information due to increasing structural complexity. however, our findings suggest not necessarily complying with the idea. each of the institutions considered by us was different in size from others: the mpg was the largest-scale organization, whereas the nls formed the smallest group in terms of number. however, according to our results, the size of the network did not seem to meaningfully affect table 2 network properties of inter-organizational network. time span 1995-1997 1998-2000 2001-2003 2004-2006 2007-2009 2010-2012 the proportion of drivers in public research institutions. despite being a medium-sized institution, the networks of the gris were more likely to be inefficient than those of the mpg and the nls. we think this was because an institution more experienced with managing such a union has built more effective structures. even we found that the gris took advantage of the structural reorganization of the network because additions improved their network efficiency. in this regard, kickert et al. (1997) claimed that the introduction of new actors can be a strategy to accomplish a mutual adjustment, since the new institute would cause structural changes within the network. proposition 1. a subset of nodes positioned in structurally important locations will have ability to steer a whole network of public research institution. our findings indicated that control actions applied to only less than half of the research units can lead to changes of an entire system, and the units repeatedly appear over time. we suspect the reason that public research institutions are designed to be a cost effective and resilient, as do their infrastructure networks. however, as national research structures can be affected by government policies (hossain et al., 2011) , the network efficiency also changes over time. the drastic fluctuation in the share of drivers would be related to changes in the relevant institution's strategy or operation. for example, the gris underwent a restructuring to remove redundancy, and began operating under the research councils after 1999. we can capture drastic changes at the same time in our results because their structural efficiency significantly increased between the second and third periods between 1998 and 2003. the results would imply that the organizational rearrangements in the gris worked well. besides, the research subjects of the nls were revamped in the 2001-2006 period due to several events, i.e. the september 11 attacks and the outbreak of the severe acute respiratory syndrome (sars). since the terrorist attacks of september 11, 2001, the nls have made greater efforts to reinforce national security by working on nuclear weapons or intelligent detection of potentially dangerous events. moreover, a sudden epidemic of sars accelerated multidisciplinary research in the nls on vaccines, therapeutics, bioinformatics, or bioterrorism. we also find that the structural efficiency of the nls were severely affected during the readjustment period. these changes in portfolio composition would cause temporary disarray in the structure of the networks. on the other hand, the property of stable fluctuations in the mpg would be attributable to internal transitions for scientific advances rather than external impact. the mpg makes an expansion of research topics by mostly spinning off units because each unit has its own research area. proposition 2. variations in structural efficiency of research networks will reflect structural changes in research composition. another difference between past research and our work here is that degree centralization is not invariably recommendable. policy makers and network scientists have hitherto paid attention to highly connected institutes because hubs are regarded as network facilitators. however, our findings indicated that most key elements were apt to have low degrees. our study focused on revealing the injection points to infuse their nearest neighbors with energy regardless of the amount of energy required, and the nodes impart directions to connected neighbors at a time rather than exuding control forces over their adjacencies simultaneously. obviously, the energy-entering hub can effectively reach agencies within its orbit, but there is a diffusion range. thus, our observations suggest that network-wide influence was dependent upon nodes with a low connectivity. in this context, a network with moderately distributed focal points can be more effective to influence all organizations than a thoroughly concentrated one. furthermore, emergent sub-organizations show a tendency to have greater effect on structural efficiency than sub-organizations with a long research experience. we suspect that this is why a new research institute is often derived from a larger unit in public research institutions, holds low research similarity with other units than its parent, and takes position at the border, beyond any energy ranges. another possible is that that a newly-established research institute has unstable research portfolios, as braam and van den besselaar (2014) pointed out. the instability that a new research institute has in its research areas can increase uncertainty about the consequences of network-wide changes. therefore, network managers may have the need to monitor the degree of acceptance of a network-wide action especially among emerging sub-organizations. this result is also consistent with recent observations, whereby driver nodes in real-world networks tend to be reluctant to link with high-degree nodes (liu et al., 2011) . proposition 3. other things being equal, a possibility to control the whole research network will increase when control actions work properly at nodes with a low connectivity and a short research history. we consider the differences in network effectiveness between existing studies and our findings originate from: whether the complete functioning of all elements was considered. previous studies regarding the maximization of network effectiveness implicitly presupposed the complete performance of all entities a priori despite conflicts between participants, but at least network managers need to ensure complete operation of their network. for the full functioning of a network, all elements are required to be within the sphere of influence of the network manager for network-wide control. we deal with a possibility to manage the network behavior in public research institutions by quantifying the effort required to implement maneuvers. in order to avoid control blockades, we showed the importance of the elements, inter alia, with low connectivity and brief experience in academia. this study provided theoretical results for structural controllability assuming some ideal situations, such as that measures were implemented on network skeleton without redundant connectivity, sufficient resources were provided to change the network, all institutes respected the administrator's intention, and there were no conflicts between a pair of connected institutes. the success or failure of such measures can be determined once the processes are completely implemented because a network's dynamic nature in inter-organizational network raises difficulties in coordination. nevertheless, estimating the completion of network-wide objectives is still critical to network planning and design. our theoretical calculations here can assist decision making for structural improvement plans. moreover, the common features of the selected initiators are sufficient to suggest elements significant to attaining a synchronized response across and institutional network to reorganize research portfolio. public research institutions continue to gain prominence in the development national agendas of science and technology. institutions have their own strategies according to their values and interests in research trends to a greater or lesser extent. governments and research councils significantly affect these institutes through polices, programs, funding, and financial support in an effort to better coordinate their research agencies (rammer, 2006) . therefore, guiding the subunits of these institutes in a network is important to efficiently deliver managerial control. in doing so, administrators should be concerned with improving network structure to enhance its outcomes. however, manipulating network structure is difficult because of complex and dynamic states of the sub-organizations. in this study, we quantified network structural efficiency to maneuver a set of spontaneous elements into network-wide goals by using the theory of structural controllability (yuan et al., 2013) , and tracked the efficiency of networks of these public research institutions: the gris in korea, the nls in the us, and the mpg. for the relevant calculations, we extracted a hidden network structure from each institution based on similarities between the profiles of their subordinate organizations. the results of structural efficiency enabled the assessment of the operational strategies of each institution for eighteen years. the elements selected by structural controllability implied suitable points to inject external energy for governing networks. revealing the injection points was important to prevent information blockages that hinder collective action. apparently, the greater number of injection points required, the lesser the efficiency of network due to the increased burden of management. our findings indicate that structural efficiencies reflect changes in research interests of an institution. in this sense, research institutions have the necessity to track the structural controllability to assess their structural changes, such as portfolio adjustments on all of sub-organizations. the structural controllability can also provide the suitable spots for an intervention by a network manager (ministries, research councils, or steering bodies) as driver nodes with regard to structural changes. according to our results, the proper intervention points tend to be with a low connectivity as well as young suborganizations. in spite of these implications for managing strategies of interorganizational networks, this study has shortcomings that limit the generalizability of our findings. scientific articles represent only part of an institute's capacity for research. major scientific outputs are classified into two types: scientific articles and patents. depending on the major research types, some institutes concentrate on patents instead of publications. as a result, research portfolios derived from richer data sources than were used would more precisely depict institutional research capacity. another limitation of this study is that network properties other than those considered here, such as network density, clustering coefficient, and betweenness centrality, might affect structural efficiency like. furthermore, our findings raised several questions that suggest directions for future research. these include exploring the range of drivers' influence on structural efficiency, determining the optimal network structure to steer, and investigating diverse network properties with other types of players in innovation systems, e.g., academia and industries. what is the appropriate length of the publication period over which to assess research performance? inside collaborative networks: ten lessons for public managers document-document similarity approaches and science mapping: experimental comparison of five approaches statistical mechanics of complex networks error and attack tolerance of complex networks science and technology development in taiwan and south korea modern information retrieval emergence of scaling in random networks reflections on scientific collaboration (and its study): past, present, and future network analysis in the social sciences design and update of a classification system: the ucsd map of science all organizations are public: bridging public and private organizational theories indicators for the dynamics of research organizations: a biomedical case study is commercialization good or bad for science? individual-level evidence from the max planck society influence of governance on regional research network performance links and impacts: the influence of public research on industrial r&d experimental comparison of first and second-order similarities in a scientometric context industrial research and innovation indicators: report of a workshop the igraph software package for complex network research orchestrating innovation networks. acad niche and performance: the moderating role of network embeddedness complex networks: degrees of control knowledge and networks: an experimental test of how network knowledge affects coordination scientific collaboration dynamics in a national scientific system using r&d portfolio management to deal with dynamic risk centrality in social networks conceptual clarification technological infrastructure and international competitiveness centrality in social networks: ii. experimental results mapping academic institutions according to their journal publication profile: spanish universities as a case study innovation through initiatives-a framework for building new capabilities in public sector research organizations governing by network: the new shape of the public sector controllability of brain networks the dynamics of r&d network in the it industry knowledge networks: explaining effective knowledge sharing in multiunit companies controllability and observability conditions of linear autonomous systems network stability: opportunity or obstacles? mapping the dynamics of knowledge base of innovations of r&d in bangladesh: triple helix perspective the impact of the crisis on research and innovation policies, in, european commission dg research reinventing public r&d: patent policy and the commercialization of national laboratory technologies control capacity and a random sampling method in exploring controllability of complex networks international student flows between asia, australia, and russia: a network analysis a statistical interpretation of term specificity and its application in retrieval mathematical description of linear dynamical systems managing complex networks: strategies for the public sector public management and policy networks on the shortest spanning subtree of a graph and the traveling salesman problem multidisciplinary team research as an innovation engine in knowledgebased transition economies and implication for asian countries the mutual information of university-industry-government relations: an indicator of the triple helix dynamics a routine for measuring synergy in university-industry-government relations: mutual information as a triple-helix and quadruple-helix indicator structural controllability controllability of complex networks controllability analysis of networks national innovation systems-analytical concept and development tool as budgets tighten, washington talks of shaking up doe labs the roles of research at universities and public labs in economic catch-up. columbia university, initiative for policy dialogue, working paper managing networks: propositions on what managers do and why they do it rise of strategic nets -new modes of value creation do networking centres perform better? an exploratory analysis in psychiatry and gastroenterology/ hepatology in spain the structure and function of complex networks public research institutions: mapping sector trends the emergence of governance in an open source community national innovation systems: why they are important, and how they might be measured and compared mission statements and self-descriptions of german extra-university research institutes: a qualitative content analysis triple helix and the circle of innovation networking and innovation: a systematic review of the evidence convergence and differentiation in institutional change among european public research systems: the decreasing role of public research institutes identifying strategic technology directions in a national laboratory setting: a case study shortest connection networks and some generalizations modes of network governance: structure, management, and effectiveness core concepts and key ideas for understanding public sector organizational networks: using research to inform scholarship and practice a preliminary theory of interorganizational network effectiveness: a comparative study of four community mental health systems do networks really work? a framework for evaluating public-sector organizational networks r: a language and environment for statistical computing trends in innovation policy: an international comparison term-weighting approaches in automatic text retrieval introduction to modern information retrieval on the specification of term values in automatic indexing science of science (sci2) tool changing organisation of public-sector research in europe-implications for benchmarking human resources in rtd establishing 'green regionalism': environmental technology generation across east asia and beyond emergence and development of the national innovation systems concept after the reforms: how have public science research organisations changed? r&d manag exploring complex networks dynamic capabilities and strategic management do second-order similarities provide added-value in a hybrid approach knowledge transfer in intraorganizational networks: effects of network position and absorptive capacity on business unit innovation and performance effective leadership in public organizations: the impact of organizational structure in asian countries on the nature, formation, and maintenance of relations among organizations detecting structural change in university research systems: a case study of british research policy optimizing controllability of complex networks by minimum structural perturbations the network structure of the korean blogosphere a strategic management approach for korean public research institutes based on bibliometric investigation exact controllability of complex networks hyeonchae yang is a ph.d. candidate of the graduate program for technology and innovation management at pohang university of science and technology the authors are grateful to the editors of the journal and the reviewers for their support and work throughout the process. this work is supported by mid-career researcher program through the national research foundation of korea (nrf) grant funded by the ministry of science, ict and future planning (2013r1a2a2a04017095). key: cord-285647-9tegcrc3 authors: estrada, ernesto title: fractional diffusion on the human proteome as an alternative to the multi-organ damage of sars-cov-2 date: 2020-08-17 journal: chaos doi: 10.1063/5.0015626 sha: doc_id: 285647 cord_uid: 9tegcrc3 the coronavirus 2019 (covid-19) respiratory disease is caused by the novel coronavirus sars-cov-2 (severe acute respiratory syndrome coronavirus 2), which uses the enzyme ace2 to enter human cells. this disease is characterized by important damage at a multi-organ level, partially due to the abundant expression of ace2 in practically all human tissues. however, not every organ in which ace2 is abundant is affected by sars-cov-2, which suggests the existence of other multi-organ routes for transmitting the perturbations produced by the virus. we consider here diffusive processes through the protein–protein interaction (ppi) network of proteins targeted by sars-cov-2 as an alternative route. we found a subdiffusive regime that allows the propagation of virus perturbations through the ppi network at a significant rate. by following the main subdiffusive routes across the ppi network, we identify proteins mainly expressed in the heart, cerebral cortex, thymus, testis, lymph node, kidney, among others of the organs reported to be affected by covid-19. scitation.org/journal/cha hypothesized as a potential cause of the major complications of the covid-19. 11, 12 however, it has been found that ace2 has abundant expression on endothelia and smooth muscle cells of virtually all organs. 13 therefore, it should be expected that after sars-cov-2 is present in circulation, it can be spread across all organs. in contrast, both sars-cov and sars-cov-2 are found specifically in some organs but not in others, as shown by in situ hybridization studies for sars-cov. this was already remarked by hamming et al. 13 by stressing that it "is remarkable that so few organs become viruspositive, despite the presence of ace2 on the endothelia of all organs and sars-cov in blood plasma of infected individuals." recently, gordon et al. 14 identified human proteins that interact physically with those of the sars-cov-2 forming a high confidence sars-cov-2-human protein-protein interaction (ppi) system. using this information, gysi et al. 15 discovered that 208 of the human proteins targeted by sars-cov-2 forms a connected component inside the human ppi network. that is, these 208 are not randomly distributed across the human proteome, but they are closely interconnected by short routes that allow moving from one to another in just a few steps. these interdependencies of protein-protein interactions are known to enable that perturbations on one interaction propagate across the network and affect other interactions. [16] [17] [18] [19] in fact, it has been signified that diseases are a consequence of such perturbation propagation. [20] [21] [22] it has been stressed that the protein-protein interaction process requires diffusion in their initial stages. 23 the diffusive processes occur when proteins, possibly guided by electrostatic interactions, need to encounter each other many times before forming an intermediate. 24 not surprisingly, diffusive processes have guided several biologically oriented searches in ppi networks. 25, 26 therefore, we assume here that perturbations produced by sars-cov-2 proteins on the human ppi network are propagated by means of diffusive processes. however, due to the crowded nature of the intra-cell space and the presence in it of spatial barriers, subdiffusive processes more than normal diffusion are expected for these protein-protein encounters. [27] [28] [29] this creates another difficulty, as remarked by batada et al., 23 which is that such (sub)diffusive processes along are not sufficient for carrying out cellular processes at a significant rate in cells. here, we propose the use of a time-fractional diffusion model on the ppi network of proteins targeted by sars-cov-2. the goal is to model the propagation of the perturbations produced by the interactions of human proteins with those of sars-cov-2 through the whole ppi. the subdiffusive process emerging from the application of this model to the sars-cov-2-human ppis has a very small rate of convergence to the steady state. however, this process produces a dramatic increment of the probability that certain proteins are perturbed at very short times. this kind of shock wave effect of the transmission of perturbations occurs at much earlier times in the subdiffusive regime than at the normal diffusion one. therefore, we propose here a switch and restart process in which a subdiffusive process starts at a given protein of the ppi, perturbs a few others, which then become the starting point of a new subdiffusive process and so on. using this approach, we then analyze how the initial interaction of the sars-cov-2 spike protein with a human protein propagates across the whole network. we discover some potential routes of propagation of these perturbations from proteins mainly expressed in the lungs to proteins mainly expressed in other different tissues, such as the heart, cerebral cortex, thymus, lymph node, testis, prostate, liver, small intestine, duodenum, kidney, among others. a. settling a model the problem we intend to model here is of a large complexity as it deals with the propagation of perturbations across a network of interacting proteins, each of which is located in a crowded intracellular space. therefore, we necessarily have to impose restrictions and make assumptions to settle our modeling framework. as we have mentioned in sec. i, protein encounters should necessarily occur in subdiffusive ways due to the crowded environment in which they are embedded, as well as the existence of immobile obstacles such as membranes. by a subdiffusive process, we understand that the mean square displacement of a protein scales as where 0 < κ < 1 is the anomalous diffusion exponent. as observed by sposini et al., 30 these anomalous diffusive processes can emerge from (i) continuous time random walk (ctrw) processes or by (ii) viscoelastic diffusion processes. in the first case, the "anomaly" is created by power-law waiting times in between motion events. this kind of processes is mainly accounted for by the generalized langevin equation with the power-law friction kernel as well as by fractional brownian motion (fbm). while the first processes are characterized by the stretched gaussian displacement probability density, weak ergodicity, and aging, the second ones are ergodic processes characterized by the gaussian density probability distribution. therefore, our first task is to discern which of these two kinds of approaches is appropriate for the current scenario. we start by mentioning that weiss et al. 31 have analyzed data from fluorescence correlation spectroscopy (fcs) for studying subdiffusive biological processes. they have, for instance, reported that membrane proteins move subdiffusively in the endoplasmic reticulum and golgi apparatus in vivo. subdiffusion of cytoplasmatic macromolecules was also reported by weiss et al. 32 using fcs. then, guigas and weiss 27 simulated the way in which the subdiffusive motion of these particles should occur in a crowded intracellular fluid. they did so by assigning diffusive steps from a weirstrass-mandelbrot function yielding a fbm. they stated that ctrw was excluded due to its markovian nature. in another work, szymanski and weiss 33 used fcs and simulations to analyze the subdiffusive motion of a protein in a simulated crowded medium. first, they reported that crowded-induced subdiffusion is consistent with the predictions from fbm or obstructed (percolation-like) diffusion. second, they reported that ctrw does not explain the experimental results obtained by fcs and should not be appropriated for such processes. the time resolution of fcs is in the microsecond range, i.e., 10 −6 s. 34 however, an important question on biological subdiffusion may require higher time resolution to be solved. this is the question of how diffusive processes on short times, while the macromolecule has not felt yet the crowding of the environment, is related to the long-time diffusion. this particular problem was explored experimentally by gupta et al. 35 by using state-of-the-art neutron chaos article scitation.org/journal/cha spin-echo (nse) and small-angle neutron scattering (sans), which has a resolution in the nanosecond range, i.e., 10 −9 s. their experimental setting was defined by the use of two globular proteins in a crowded environment formed by poly(ethylene oxide) (peo), which mimics a macromolecular environment. in their experiments, nse was used to tackle the fast diffusion process, which corresponds to a dynamics inside a trap built by the environment mesh. sans captures the slow dynamics, which corresponds to the long-time diffusion at macroscopic length scales. from our current perspective, the most important result of this work is that the authors found that in a higher concentration of polymeric solutions, like in the intracellular space, the diffusion is fractional in nature. they showed this by using the fractional fokker-planck equation with a periodic potential. according to gupta et al., 35 this fractional nature of the crossover from fast dynamics to slow macroscopic dynamics is due to the heterogeneity of the polymer mesh in the bulk sample, which may well resemble the intra-cellular environment. as proved by barkai et al. , 36 the fractional fokker-planck equation can be derived from the ctrw, which clearly indicates that the results obtained by gupta et al. point out to the classification of the subdiffusive dynamics into the class (i). we should remark that independently of these results by gupta et al., 35 shorten and sneyd 37 have successfully used the fractional diffusion equation to mimic the protein diffusion in an obstructed media like within skeletal muscle. we notice in passing that the (fractional) diffusion equation can be obtained from the (fractional) fokker-planck equation in the absence of an external force. in closing, because here we are interested in modeling the diffusion of proteins in several human cells, which are highly crowded, and in which we should recover the same crossover between initial fast and later slow dynamics, we will consider a modeling tool of the class (i). in particular, we will focus our modeling on the use of a time-fractional diffusion equation using caputo derivatives. another justification for the use of this model here is that interacting proteins can be in different kinds of cells. thus, we consider that the perturbation of one protein is not necessarily followed by the perturbation of one of its interactors, but a time may mediate between the two processes. this is exactly the kind of processes that the time-fractional diffusion captures. in this work, we always consider g = (v, e) to be an undirected finite network with vertices v representing proteins and edges e representing the interaction between pairs of proteins. let us consider 0 < α ≤ 1 and a function u : [0, ∞) → r, then we denote by d α t u the fractional caputo derivative of u of the order α, which is given by 38 where * denotes the classical convolution product on (0, ∞) and g γ (t) t γ −1 (γ ) , for γ > 0,where (·) is the euler gamma function. observe that the previous fractional derivative has sense whenever the function is derivable and the convolution is defined (for example, if u is locally integrable). the notation g γ is very useful in the fractional calculus theory, mainly by the property g γ * g δ = g γ +δ for all γ , δ > 0. here, we propose to consider the time-fractional diffusion (tfd) equation on the network as with the initial condition x (0) = x 0 , where x i (t) is the probability that the protein i is perturbed at the time t; c is the diffusion coefficient of the network, which we will set hereafter to unity; and l is the graph laplacian, i.e., l = k − a, where k is a diagonal matrix of node degrees and a is the adjacency matrix. this model was previously studied in distributed coordination algorithms for the consensus of multi-agent systems. [39] [40] [41] the use of fractional calculus in the context of physical anomalous diffusion has been reviewed by metzler and klafter. 42 a different approach has been developed by riascos and mateos. 43, 44 it is based on the use of fractional powers of the graph laplacian (see ref. 45 and references therein). the approach has been recently formalized by benzi et al. 46 this method cannot be used in the current framework because it generates only superdiffusive behaviors (see benzi et al. 46 ) and not subdiffusive regimes. another disadvantage of this approach is that it can only be used to positive (semi)definite graph operators, such as the laplacian, but not to adjacency operators such as the one used in tight-binding quantum mechanical or epidemiological approaches (see sec. vi). theorem 1. the solution of the fractional-time diffusion model on the network is where e α,β (γ l) is the mittag-leffler function of the laplacian matrix of a graph. proof. we use the spectral decomposition of the network laplacian l = uλu −1 , where u = ψ 1 · · · ψ n and λ = diag (µ r ). then, we can write let us define y (t) = u −1 x (t) such that d α t x (t) = −uλy (t), and we have as is a diagonal matrix, we can write which has the solution we can replace which finally gives the result in the matrix-vector when written for all the nodes, we can write l = uλu −1 , where u = ψ 1 · · · ψ n and λ = diag (µ r ). then, which can be expanded as where ψ j and φ j are the jth column of u and of u −1 , respectively. because µ 1 = 0 and 0 < µ 2 ≤ · · · ≤ µ n for a connected graph, we have lim where ψ t 1 φ 1 = 1. let us take ψ 1 = 1, such that we have this result indicates that in an undirected and connected network, the diffusive process controlled by the tfd equation always reaches a steady state, which consists of the average of the values of the initial condition. in the case of directed networks (ppi are not directed by nature) or in disconnected networks (a situation that can be found in ppis), the steady state is reached in each (strongly) connected component of the graph. also, because the network is connected, µ 2 makes the largest contribution to e α,1 (−t α l) among all the nontrivial eigenvalues of l. therefore, it dictates the rate of convergence of the diffusion process. we remark that in practice, the steady state lim t→∞ x v (t) − x w (t) = 0, ∀v, w ∈ v is very difficult to achieve. therefore, we use a threshold ε, e.g., ε = 10 −3 , such that lim t→∞ x v (t) − x w (t) = ε is achieved in a relatively small simulation time. due to its importance in this work, we remark the structural meaning of the mittag-leffler function of the laplacian matrix appearing in the solution of the tfd equation. that is, e α,1 (−t α l) is a matrix function, which is defined as where (·) is the euler gamma function as before. we remark that for α = 1, we recover the diffusion equation on the network: dx (t) /dt = −lx (t) and its solution e 1,1 (−t α l) = exp (−tcl) is the well-known heat kernel of the graph. we define here a generalization of the diffusion distance studied by coifman and lafon. 47 then, we define the following quantity: we have the following result. proof. the matrix function f (τ l) can be written as f (τ l) = uf (τ λ) u −1 . let ϕ u = ψ 1,u , ψ 2,u , . . . , ψ n,u t . then, therefore, because f (τ l) is positively defined, we can write where consequently, d vw is a square euclidean distance between v and w. in this sense, the vector we have that d vw generalizes the diffusion distance studied by coifman and lafon, which is the particular case when α = 1. let µ j be the jth eigenvalue and ψ ju the uth entry of the jth eigenvector of the laplacian matrix. then, we can write the time-fractional diffusion distance as it is evident that when α = 1, d vw is exactly the diffusion distance previously studied by coifman and lafon. 47 the fractionaltime diffusion distance between every pair of nodes in a network can be represented in a matrix form as follows: where s = f(τ l) 11 , f(τ l) 22 , . . . , f(τ l) nn is a vector whose entries are the main diagonal terms of the mittag-leffler matrix function, 1 is an all-ones vector, and • indicates an entrywise operation. using this matrix, we can build the diffusion distance-weighted adjacency matrix of the network, the shortest diffusion path between two nodes is then the shortest weighted path in w (τ ). lemma 3. the shortest (topological) path distance between two nodes in a graph is a particular case of the time-fractional shortest diffusion path length for τ → 0. proof. let us consider each of the terms forming the definition of the time-fractional diffusion distance and apply the limit of the very small −t α . that is, chaos article scitation.org/journal/cha and in a similar way, therefore, lim , which immediately implies that the time-fractional shortest diffusion path is identical to the shortest (topological) one in the limit of very small τ = −t α . the proteins of sars-cov-2 and their interactions with human proteins were determined experimentally by gordon et al. 14 gysi et al. 15 constructed an interaction network of all 239 human proteins targeted by sars-cov-2. in this network, the nodes represent human proteins targeted by sars-cov-2 and two nodes are connected if the corresponding proteins have been determined to interact with each other. obviously, this network of proteins targeted by sars-cov-2 is a subgraph of the protein-protein interaction (ppi) network of humans. one of the surprising findings of gysi et al. 15 is the fact that this subgraph is not formed by proteins randomly distributed across the human ppi, but they form a main cluster of 208 proteins and a few small isolated components. hereafter, we will always consider this connected component of human proteins targeted by sars-cov-2. this network is formed by 193 proteins, which are significantly expressed in the lungs. gysi et al. 15 reported a protein as being significantly expressed in the lungs if its gtex median value is larger than 5. gtex 48 is a database containing the median gene expression from rna-seq in different tissues. the other 15 proteins are mainly expressed in other tissues. however, in reporting here, the tissues that were proteins are mainly expressed; we use the information reported in the human protein atlas 49 where we use information not only from gtex but also from hpa (see details at the human protein atlas webpage) and fantom5 50 datasets. the ppi network of human proteins targeted by sars-cov-2 is very sparse, having 360 edges, i.e., its edge density is 0.0167, 30% of nodes have a degree (number of connections per protein) equal to one, and the maximum degree of a protein is 14. the second smallest eigenvalue of the laplacian matrix of this network is very small; i.e., µ 2 = 0.0647. therefore, the rate of convergence to the steady state of the diffusion processes taking place on this ppi is very slow. we start by analyzing the effects of the fractional coefficient α on these diffusive dynamics. we use the normal diffusion α = 1 as the reference system. to analyze the effects of changing α over the diffusive dynamics on the ppi network, we consider the solution of the tfd equation for processes starting at a protein with a large degree, i.e., prkaca, degree 14, and a protein with a low degree, i.e., mrps5, degree 3. that is, the initial condition vector consists of a vector having one at the entry corresponding to either prkaca or mrps5 and zeroes elsewhere. in fig. 1 , we display the changes of the probability with the shortest path distance from the protein where the process starts. this distance corresponds to the number of steps that the perturbation needs to traverse to visit other proteins. for α = 1.0, the shapes of the curves in fig. 1 are the characteristic ones for the gaussian decay of the probability with distance. however, for α < 1, we observe that such decay differs from that typical shape showing a faster initial decay followed by a slower one. in order to observe this effect in a better way, we zoomed the region of distances from 2 to 4 [see figs. 1(b) and 1(d)]. as can be seen for distances below 3, the curve for α = 1.0 is on top of those for α < 1, indicating a slower decay of the probability. after this distance, there is an inversion, and the normal diffusion occurs at a much faster rate than the other two for the longer distances. this is a characteristic signature of subdiffusive processes, which starts at much faster rates than a normal diffusive process and then continue at much slower rates. therefore, here, we observe that the subdiffusive dynamics are much faster at earlier times of the process, which is when the perturbation occurs to close nearest neighbors to the initial point of perturbation. to further investigate these characteristic effects of the subdiffusive dynamics, we study the time evolution of a perturbation occurring at a given protein and its propagation across the whole ppi network. in fig. 2 , we illustrate these results for α = 1.0 (a), α = 0.75 (b), and α = 0.5 (c). as can be seen in the main plots of this figure, the rate of convergence of the processes to the steady state is much faster in the normal diffusion (a) than in the subdiffusive one (b) and (c). however, at very earlier times (see insets in fig. 2 ), there is a shock wave increase of the perturbation at a set of nodes. such kind of shock waves has been previously analyzed in other contexts as a way of propagating effects across ppi networks. 17 we have explored briefly about the possible causes of this increase in the concentration for a given subset of proteins. accordingly, it seems that the main reason for this is the connectivity provided by the network of interactions and not a given distribution of the degrees. for instance, we have observed such "shock waves" in networks with normal-like distributions as well as with power-law ones. however, it is possible that the extension and intensity of such effects depend on the degree distribution as well as on other topological factors. the remarkable finding here is, however, the fact that such a shock wave occurs at much earlier times in the subdiffusive regimes than at the normal diffusion. that is, while for α = 1.0, these perturbations occur at t≈0.1-0.3; for α = 0.75, they occur at t≈0.0-0.2; and for α = 0.5, they occur at t≈0.0-0.1. seeing this phenomenon in the light of what we have observed in the previous paragraph is not strange due to the observation that such processes go at a much faster rate at earlier times, and at short distances, than the normal diffusion. in fact, this is a consequence of the existence of a positive scalar t for which e α,1 (−γ t α ) decreases faster than exp (−γ t) for t ∈ (0, t) for γ ∈ r + and α ∈ r + (see theorem 4.1 in ref. 39 ). hereafter, we will consider the value of α = 0.75 for our experiments due to the fact that it reveals a subdiffusive regime, but the shock waves observed before are not occurring in an almost instantaneous way like when α = 0.5 , which would be difficult from a biological perspective. the previous results put us at a crossroads. first, the subdiffusive processes that are expected due to the crowded nature of the intra-cellular space are very slow for carrying out cellular processes at a significant rate in cells. however, the perturbation shocks occurring at earlier times of these processes are significantly faster than in normal diffusion. to sort out these difficulties, we propose a switching back and restart subdiffusive process occurring in the ppi network. that is, a subdiffusive process starts at a given protein, which is directly perturbed by a protein of sars-cov-2. it produces a shock wave increase of the perturbation in close neighbors of that proteins. then, a second subdiffusive process starts at these newly perturbed proteins, which will perturb their nearest neighbors. the process is repeated until the whole ppi network is perturbed. this kind of "switch and restart processes" has been proposed for engineering consensus protocols in multiagent systems 51 as a way to accelerate the algorithms using subdiffusive regimes. the so-called spike protein (s-protein) of the sars-cov-2 interacts with only two proteins in the human hosts, namely, zdhhc5 and golga7. the first protein, zdhhc5, is not in the main connected component of the ppi network of sars-cov-2 targets. therefore, we will consider here how a perturbation produced by the interaction of the virus s-protein with golga7 is propagated through the whole ppi network of sars-cov-2 targets. golga7 has degree one in this network, and its diffusion is mainly to close neighbors, namely, to proteins separated by two to three edges. when starting the diffusion process at the protein golga7, the main increase in the probability of perturbing another protein is reached for the protein golga3, which increases its probability up to 0.15 at t = 0.2, followed by prkar2a, with a small increase in its probability, 0.0081. then, the process switch and restarts at golga3, which mainly triggers the probability of the protein prkar2a-a major hub of the network. once we start the process at prkar2a, practically, the whole network is perturbed with probabilities larger than 0.1 for 19 proteins apart from golga3. these proteins are in decreasing order of their probability of being perturbed: akap8, prkar2b, cep350, mib1, cdk5rap2, cep135, akap9, cep250, pcnt, cep43, pde4dip, prkaca, tub6cp3, tub6cp2, cep68, clip4, cntrl, plekha5, and ninl. notice that the number of proteins perturbed is significantly larger than the degree of the activator, indicating that not only nearest neighbors are activated. an important criterion for revealing the important role of the protein prkar2a as a main propagator in the network of proteins targeted by sars-cov-2 is its average diffusion path length. this is the average number of steps that a diffusive process starting at this protein needs to perturb all the proteins in the network. we have calculated this number to be 3.6250, which is only slightly larger than the average (topological) path length, which is 3.5673. that is, in less than four steps, the whole network of proteins is activated by a diffusive process starting at prkar2a. also remarkable that the average shortest diffusive path length is almost identical to the shortest (topological) one. this means that this protein mainly uses shortest (topological) paths in perturbing other proteins in the ppi. in other words, it is highly efficient in conducting such perturbations. we will analyze this characteristics of the ppi of human proteins targeted by sars-cov-2 in a further section of this work. at this time, almost any protein in the ppi network is already perturbed. therefore, we can switch and restart the subdiffusion from practically any protein at the ppi network. we then investigate which are the proteins with the higher capacity of activating other proteins that are involved in human diseases. here, we use the database disgenet, 52 which is one of the largest publicly available collections of genes and variants associated with human diseases. we identified 38 proteins targeted by sars-cov-2 for which there is a "definitive" or "strong" evidence of being involved in a human disease or syndrome (see table s1 in the supplementary material). these proteins participate in 70 different human diseases or syndromes as given in tables s2 and s3 of the supplementary material. we performed an analysis in which a diffusive process starts at any protein of the network, and we calculated the average probability that all the proteins involved in human diseases are then perturbed. for instance, for a subdiffusive process starting at the protein arf6, we summed up the probabilities that the 38 proteins involved in diseases are perturbed at an early time of the process t = 0.2. then, we obtain a global perturbation probability of 0.874. by repeating this process for every protein as an initiator, we obtained the top disease activators. we have found that none of the 20 top activators is involved itself in any of the human diseases or syndromes considered here. they are, however, proteins that are important not because of their direct involvement in diseases or syndromes but because they propagate perturbations in a very effective way to those directly involved in such diseases/syndromes. among the top activators, we have found arf6, ecsit, retreg3, stom, hdac2, exosc5, thtpa, among others shown in fig. 3 , where we illustrate the ppi network of the proteins targeted by sars-cov-2 remarking the top 20 disease activators. we now consider how a perturbation produced by sars-cov-2 on a protein mainly expressed in the lungs can be propagated to proteins mainly located in other tissues (see table s4 in the supplementary material) by a subdiffusive process. that is, we start the subdiffusive process by perturbing a given protein, which is mainly expressed in the lungs. then, we observe the evolution of the perturbation at every one of the proteins mainly expressed in other tissues. we repeat this process for all the 193 proteins mainly expressed in the lungs. in every case, we record those proteins outside the lungs, which are perturbed at very early times of the subdiffusive process. for instance, in fig. 4 , we illustrate one example in which the initiator is the protein golga2, which triggers a shock wave on proteins rbm41, tl5, and pkp2, which are expressed mainly outside the lungs. we consider such perturbations only if they occur at t < 1. not every one of the proteins expressed outside the lungs is triggered by such shock waves at a very early time of the diffusion. for instance, proteins mark1 and slc27a2 are perturbed in very slow processes and do not produce the characteristic high peaks in the probability at very short times. on the other hand, there are proteins expressed outside the lungs that are triggered by more than one protein from the lungs. the case of golga2 is an example of a protein triggered by three proteins in the lungs. in table i , we list some of the proteins expressed mainly in tissues outside the lungs, which are heavily perturbed by proteins in the lungs. the table i. multi-organ propagation of perturbations. proteins mainly expressed outside the lungs are significantly perturbed during diffusive processes that have started at other proteins expressed in the lungs. act. is the number of lung proteins activators, p tot is the sum of the probabilities of finding the diffusive particle at this protein, and t mean is the average time of activation (see the text for explanations). the tissues of main expression are selected among the ones with the highest consensus normalized expression (nx) levels by combining the data from the three transcriptomics datasets (hpa, gtex, and fantom5) using the internal normalization pipeline. 49 boldface denotes the highest value in each of the columns. act. table s5 of the supplementary material. we give three indicators of the importance of the perturbation of these proteins. they are act., which is the number of proteins in the lungs that activate each of them; p tot , which is the sum of the probabilities of finding the diffusive particle at this protein for diffusive processes that have started in their activators; and t mean , which is the average time required by activators to perturb the corresponding protein. for instance, pkp2 is perturbed by 21 proteins in the lungs, which indicates that this protein, mainly expressed in the heart muscle, has a large chance of being perturbed by diffusive processes starting in proteins mainly located at the lungs. protein prim2 is activated by 5 proteins in the lungs, but if all these proteins were acting at the same time, the probability that prim2 is perturbed will be very high, p tot ≈ 0.536. finally, protein tle5 is perturbed by 13 proteins in the lungs, which needs as an average t mean ≈ 0.24 to perturb tle5. these proteins do not form a connected component among them in the network. the average shortest diffusion path between them is 5.286 with a maximum shortest subdiffusion path of 10. as an average, they are almost equidistant from the rest of the proteins in the network as among themselves. that is, the average shortest subdiffusion path between these proteins expressed outside the lungs and the rest of the proteins in the network is 5.106. therefore, these proteins can be reached from other proteins outside the lungs in no more than six steps in subdiffusive processes like the ones considered here. finally, we study here how the diffusive process determines the paths that the perturbation follows when diffusing from a protein to another not directly connected to it. the most efficient way of propagating a perturbation between the nodes of a network is through the shortest (topological) paths that connect them. the problem for a (sub)diffusive perturbation propagating between the nodes of a network is that it does not have complete information about the topology of the network as to know its shortest (topological) paths. the network formed by the proteins targeted by sars-cov-2 is very sparse, and this indeed facilitates that the perturbations occurs by following the shortest (topological) paths most of the time. think, for instance, in a tree, which has the lowest possible edge density among all connected networks. in this case, the perturbation will always use the shortest (topological) paths connecting pairs of nodes. however, in the case of the ppi network studied here, a normal diffusive process, i.e., α = 1, not always uses the shortest (topological) paths. in this case, there are 1294 pairs of proteins for which the diffusive particle uses a shortest diffusive path, which is one edge longer than the corresponding shortest (topological) path. this represents 6.11% of all total pairs of proteins that are interconnected by a path in the ppi network of proteins targeted by sars-cov-2. however, when we have a subdiffusive process, i.e., α = 0.75, this number is reduced to 437, which represents only 2.06% of all pairs of proteins. therefore, the subdiffusion process studied here through the ppi network of proteins targeted by sars-cov-2 has an efficiency of 97.9% relative to a process that always uses the shortest (topological) paths in hopping between proteins. in fig. 5 , we illustrate the frequency with which proteins not in the shortest (topological) paths are perturbed as a consequence that they are in the shortest subdiffusive paths between other proteins. for instance, the following is a shortest diffusive path between the two end points: rhoa-prkaca-prkar2a-cep43-rab7a-atp6ap1. the corresponding shortest (topological) path is rhoa-mark2-ap2m1-rab7a-atp6ap1, which is one edge smaller. the proteins prkaca, prkar2a, and cep43 are those in the diffusive path that are not in the topological one. repeating this selection process for all the diffusive paths that differs from the topological ones, we obtained the results illustrated in fig. 5 . as can be seen, there are 36 proteins visited by the shortest diffusive paths, which are not visited by the corresponding topological ones. the chaos article scitation.org/journal/cha average degree of these proteins is 7.28, and there is only a small positive trend between the degree of the proteins and the frequency with which they appear in these paths; e.g., the pearson correlation coefficient is 0.46. we have presented a methodology that allows the study of diffusive processes in (ppi) networks varying from normal to subdiffusive regimes. here, we have studied the particular case in which the time-fractional diffusion equation produces a subdiffusive regime, with the use of α = 3/4 in the network of human proteins targeted by sars-cov-2. a characteristic feature of this ppi network is that the second smallest eigenvalue is very small; i.e., µ 2 = 0.0647. as this eigenvalue determines the rate of convergence to the steady state, the subdiffusive process converges very slowly to that state. what it has been surprising is that even in these conditions of very small convergence to the steady state, there is a very early increase of the probability in those proteins closely connected to the initiator of the diffusive process. that is, in a subdiffusive process on a network, the time at which a perturbation is transmitted from the initiator to any of its nearest neighbors occurs at an earlier time than for the normal diffusion. this is a consequence of the fact that e α,1 (−γ t α ) decreases very fast at small values of t α , which implies that the perturbation occurring at a protein i at t = 0 is transmitted almost instantaneously to the proteins closely connected to i. this effect may be responsible for the explanation about why subdiffusive processes, which are so globally slow, can carry out cellular processes at a significant rate in cells. we have considered here a mechanism consisting in switching and restarting several times during the global cellular process. for instance, a subdiffusive process starting at the protein i perturbs its nearest neighbors at very early times, among which we can find the protein j. then, a new subdiffusive process can be restarted again at the node j and so on. one of the important findings of using the current model for the study of the pin of proteins affected by sars-cov-2 is the identification of those proteins that are expressed outside the lungs that can be more efficiently perturbed by those expressed in the lungs (see table i ). for instance, the protein with the largest number of activators, pkp2, appears mainly in the heart muscle. it has been observed that the elevation of cardiac biomarkers is a prominent feature of covid-19, which in general is associated with a worse prognosis. 53 myocardial damage and heart failure are responsible for 40% of death in the wuhan cohort (see references in ref. 53) . although the exact mechanism involving the heart injury is not known, the hypothesis of direct myocardial infection by sars-cov-2 is a possibility, which acts along or in combination with the increased cardiac stress due to respiratory failure and hypoxemia, and/or with the indirect injury from the systemic inflammatory response. [53] [54] [55] [56] as can be seen in table i , the testis is the tissue where several of the proteins targeted by sars-cov-2 are mainly expressed, e.g., cep43, tle5, prim2, mipol1, reep6, hook1, cenpf, trim59, and mark1. currently, there is no conclusive evidence about the testis damage by sars-cov-2. 57-60 however, the previous sars-cov that appeared in 2003 and which shares 82% of proteins with the current one produced testis damage and spermatogenesis, and it was concluded that orchitis was a complication of that previous sars disease. 57 we also detect a few proteins mainly expressed in different brain tissues, such as cep135, prim2, trim59, and mark1. the implication of sars-cov-2 and cerebrovascular diseases has been reported, including neurological manifestations as well as cerebrovascular disease, such as ischemic stroke, cerebral venous thrombosis, and cerebral hemorrhage. [61] [62] [63] kidney damage in sars-cov-2 patients has been reported, 64-66 which includes signs of kidney dysfunctions, proteinuria, hematuria, increased levels of blood urea nitrogen, and increased levels of serum creatinine. as much as 25% of an acute kidney injury has been reported in the clinical setting of sars-cov-2 patients. one of the potential mechanisms for kidney damage is the organ crosstalk, 64 as can be the mechanism of diffusion from proteins in the lungs to proteins in the urinary tract and kidney proposed here. a very interesting observation from table i is the existence of several proteins expressed mainly in the thymus and t-cells, such as tle5, retreg3, rbm41, cenpf, and trim59. it has been reported that many of the patients affected by sars-cov-2 in wuhan displayed a significant decrease of t-cells. 67 thymus is an organ that displays a progressive decline with age with reduction of the order of 3%-5% a year until approximately 30-40 years of age and of about 1% per year after that age. consequently, it was proposed that the role of thymus should be taken into account in order to explain why covid-19 appears to be so mild in children. 67 the protein tle5 is also expressed significantly in the lymph nodes. it was found by feng et al. 68 that sars-cov-2 induces lymph follicle depletion, splenic nodule atrophy, histiocyte hyperplasia, and lymphocyte reductions. the proteins hook1 and mipol1 are significantly expressed in the pituitary gland. there has been some evidence and concerns that covid-19 may also damage the hypothalamo-pituitary-adrenal axis that has been expressed by pal, 69 which may be connected with the participation of the previously mentioned proteins. another surprising finding of the current work is the elevated number of subdiffusive shortest paths that coincide with the shortest (topological) paths connecting pairs of proteins in the ppi of human proteins targeted by sars-cov-2. this means that the efficiency of the diffusive paths connecting pairs of nodes in this ppi is almost 98% in relation to a hypothetical process that uses the shortest (topological) paths in propagating perturbations between pairs of proteins. the 437 shortest diffusive paths reported here contain one more edge than the corresponding shortest (topological) paths. the proteins appearing in these paths would never be visited in the paths connecting two other proteins if only the shortest (topological) paths were used. what is interesting to note that 6 out of the 15 proteins that are mainly expressed outside the lungs are among the ones "crossed" by these paths. they are tle5 (thymus, lymph node, testis), pkp2 (heart muscle), cep135 (skeletal muscle, heart muscle, cerebral cortex, cerebellum), cep43 (testis), rbm41 (pancreas, t-cells, testis, retina), and retreg3 (prostate, thymus). this means that the perturbation of these proteins occurs not only through the diffusion from other proteins in the lungs directly to them, but also through some "accidental" diffusive paths between pairs of proteins that are both located in the lungs. all in all, the use of time-fractional diffusive models to study the propagation of perturbations on ppi networks seems a very promising approach. the model is not only biologically sounded but it also allows us to discover interesting hidden patterns of the chaos article scitation.org/journal/cha interactions between proteins and the propagation of perturbations among them. in the case of the pin of human proteins targeted by sars-cov-2, our current finding may help to understand potential molecular mechanisms for the multi-organs and systemic failures occurring in many patients. after this work was completed, qiu et al. 75 uploaded the manuscript entitled "postmortem tissue proteomics reveals the pathogenesis of multiorgan injuries of covid-19." the authors profiled the host responses to covid-19 by means of quantitative proteomics in postmortem samples of tissues in lungs, kidney, liver, intestine, brain, and heart. they reported differentially expressed proteins (deps) for these organs as well as virus-host ppis between 23 virus proteins and 110 interacting deps differentially regulated in postmortem lung tissues. according to their results, most deps (70.5%) appears in the lungs, followed by kidney (16.5%). additionally, qiu et al. 75 identified biological processes that were up-or down-regulated in the six postmortem tissue types. they found that most up-regulated processes in the lungs correspond to processes related to the response to inflammation and to immune response. however, pathways related to cell morphology, such as the establishment of endothelial barriers, were down-regulated in the lungs, which was interpreted as a confirmation that the lungs are the main focus of virus-host fights. other fundamental processes in the six organs analyzed postmortem were significantly down-regulated, which include processes related to organ movement, respiration, and metabolism. from the 59 proteins that we reported here as the ones with the largest effect on perturbing those 38 proteins identified in human diseases (see table s3 in the supplementary material), 18 were found to be down-regulated in the lungs by qiu et al. 75 if we make the corresponding adjustment, by considering that qiu et al. 75 considered 110 instead of 209 proteins in the ppi, the previous number represents 58.1% of proteins predicted here and experimentally found as down-regulated in the lungs. from the rest of the proteins, which were not found as having the largest effect on perturbing proteins identified in human disease, only 29.1% were reported by qiu et al. 75 to be down-regulated in the postmortem analysis of patients' lungs. among the proteins reported in table s3 of the supplementary material and by qiu et al., we have arf6, rtn4, rab7a, 6ng5, reep5, vps11 , rhoa, rab5c, among others. finally, among the proteins mainly expressed outside the lungs that are predicted in this work to be significantly perturbed, we have five that were found by qiu et al. 75 to be up-regulated in the different organs analyzed by them. from the proteins included in table i , qiu et al. 75 reported the following ones up-regulated: pkp2 (heart), reep6 (liver), hook1 (several organs), atp5me (heart), and slc27a2 (liver and kidney). they also reported cep43 (reported as fgfr1op) as downregulated in the brain. we should remark that we have considered here many more organs than the six ones studied by qiu et al. 75 there are no doubts that in considering a diffusive propagation of perturbations among proteins in a ppi, we have made a fig. 6. pi metaplex. in the metaplex, every node of the ppi corresponds to a protein and its crowded intracellular space. there is an internal dynamics in the nodes and an external between the nodes. few simplifications and assumptions. every protein is embedded in an intracellular crowded environment, which drives its diffusive mechanism. nowadays, it is well-established that this environment is conducive to molecular subdiffusive processes. as remarked by guigas and weiss, 27 far from obstructing cellular processes, subdiffusion increases the probability of finding a nearby target by a given protein, and therefore, it facilitates protein-protein interactions. the current approach can be improved using two recently developed theoretical frameworks: (i) metaplexes and (ii) d-path laplacian operators on graphs. a ppi metaplex, 70 a 4-tuple ϒ = (v, e, i, ω), where (v, e) is a graph, ω = { j } k j=1 is a set of locally compact metric spaces j with borel measures µ j , and i : v → ω is illustrated in fig. 6 . then, we define a dynamical ϒ = (v, e, i, ω = { k }) on the metaplex as a tuple (h, t ). here, h = h v : l 2 ( i (v), µ i (v)) → l 2 ( i (v), µ i (v))} v∈v is a family of operators such that the initial value problem ∂ t u v = h v (u v ), u v | t=0 = u 0 , is well-posed, and t = {t vw } (v,w)∈e is a family of bounded operators t vw : l 2 ( i(v) , µ i(v) ) → l 2 ( i(w) , µ i(w) ). this means that inside a node of the metaplex, we consider one protein and its crowded intracellular space. inside the nodes, we can have a dynamics like a time-fractional diffusion equation, the fractional fokker-planck equation, or any other in a continuous space. the inter-nodes dynamics is then dominated by a graph-theoretic diffusive model like the one presented here. the second possible improvement to the current model can be made by introducing the possibility of long-range interactions in the inter-nodes dynamics in the ppi metaplex. that is, instead of considering the time-fractional diffusion equation, which only accounts for subdiffusive processes in the graph, we can use the following generalization, which incorporates the d-path laplacian operators, 71 where l d is a generalization of the graph laplacian operator to account for long-range hops between nodes in a graph, d is the shortest path distance between two nodes, and s > 0 is a parameter. this equation has never been used before except that for the case α = 1 where superdiffusive behavior was proved in 1-and chaos article scitation.org/journal/cha 2-dimensional cases. 72, 73 other approaches have also been recently used for similar purposes in the literature. 74 we then hope that the combination of metaplexes and timeand space-fractional diffusive models do capture more of the details of protein-protein interactions in crowded cellular environments. see the supplementary material for a list of proteins targeted by sars-cov-2, which are found in the database disgenet as displaying "definitive" or "strong" evidence of participating in human diseases. the disease id is the code of the disease in disgenet. a list of proteins with the largest effect on perturbing those 38 proteins are identified in human diseases. p tot is the sum of the probabilities that the given protein activates those identified as having "definitive" or "strong" evidence of being involved in a human disease. there is an rna expression overview for proteins targeted by sars-cov-2 and mainly expressed outside the lungs. we select the top rna expressions in the four databases reported in the human protein atlas. there is a list of proteins mainly expressed outside the lungs and their major activators, which are proteins mainly expressed in the lungs. the author thanks dr. deisy morselli gysi for sharing data and information. the author is indebted to two anonymous referees whose clever comments helped in improving this work substantially. the data that support the findings of this study are available within the article and its supplementary material and also from the corresponding author upon reasonable request. scitation.org/journal/cha the species severe acute respiratory syndrome-related coronavirus: classifying 2019-ncov and naming it sars-cov-2 a pneumonia outbreak associated with a new coronavirus of probable bat origin a new coronavirus associated with human respiratory disease in china review of the clinical characteristics of coronavirus disease 2019 (covid-19) coronavirus disease 2019 (covid-19): a clinical update covid-19 and multi-organ response genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting wuhan angiotensin-converting enzyme 2 is a functional receptor for the sars coronavirus sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor specific ace2 expression in cholangiocytes may cause liver damage after 2019-ncov infection single-cell rna expression profiling of ace2, the putative receptor of wuhan 2019-ncov tissue distribution of ace2 protein, the functional receptor for sars coronavirus. a first step in understanding sars pathogenesis a sars-cov-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing network medicine framework for identifying drug repurposing opportunities for covid-19 specificity and stability in topology of protein networks perturbation waves in proteins and protein networks: applications of percolation and game theories in signaling and drug design predicting perturbation patterns from the topology of biological networks modeling and simulating networks of interdependent protein interactions network medicine: a networkbased approach to human disease protein interaction networks in medicine and disease human diseases through the lens of network biology stochastic model of protein-protein interaction: why signaling proteins need to be colocalized accounting for conformational changes during protein-protein docking inferring novel tumor suppressor genes with a protein-protein interaction network and network diffusion algorithms information flow in interaction networks sampling the cell with anomalous diffusion-the discovery of slowness understanding biochemical processes in the presence of sub-diffusive behavior of biomolecules in solution and living cells protein motion in the nucleus: from anomalous diffusion to weak interactions random diffusivity from stochastic equations: comparison of two models of brownian yet non-gaussian diffusion anomalous protein diffusion in living cells as seen by fluorescence correlation spectroscopy anomalous subdiffusion is a measure for cytoplasmic crowding in living cell elucidating the origin of anomalous diffusion in crowded fluids in a mirror dimly: tracing the movements of molecules in living cells protein entrapment in polymeric mesh: diffusion in crowded environment with fast process on short scales from continuous time random walks to fractional fokker-planck equation a mathematical analysis of obstructed diffusion within skeletal muscle fractional calculus and waves in linear viscoelasticity. an introduction to mathematical models distributed coordination algorithms for multiple fractional-order systems distributed coordination of networked fractional-order systems consensus of networked multi-agent systems with delays and fractional-order dynamics the random walk's guide to anomalous diffusion: a fractional dynamics approach long-range navigation on complex networks using lévy random walks fractional dynamics on networks: emergence of anomalous diffusion and lévy flights fractional dynamics on networks and lattices nonlocal network dynamics via fractional graph laplacians diffusion maps the genotype-tissue expression (gtex) project tissue-based map of the human proteome a promoter-level mammalian expression atlas convergence speed of a fractional order consensus algorithm over undirected scale-free networks the disgenet knowledge platform for disease genomics: 2019 update covid-19 and the heart cardiac involvement in a patient with coronavirus disease 2019 (covid-19) covid-19 and the cardiovascular system coronavirus disease 2019 (covid-19) and cardiovascular disease sars-cov-2 and the testis: similarity to other viruses and routes of infection rising concern on damaged testis of covid-19 patients the need for urogenital tract monitoring in covid-19 ace2 expression in kidney and testis may cause kidney and testis damage after 2019-ncov infection covid-19, angiotensin receptor blockers, and the brain pulmonary, cerebral, and renal thromboembolic disease associated with covid-19 infection a case of coronavirus disease 2019 with concomitant acute cerebral infarction and deep vein thrombosis kidney involvement in covid-19 and rationale for extracorporeal therapies acute kidney injury in sars-cov-2 infected patients caution on kidney dysfunctions of covid-19 patients additional hypotheses about why covid-19 is milder in children than adults the novel severe acute respiratory syndrome coronavirus 2 (sars-cov-2) directly decimates human spleens and lymph nodes covid-19, hypothalamo-pituitary-adrenal axis and clinical implications metaplex networks: influence of the exo-endo structure of complex systems on diffusion path laplacian matrices: introduction and application to the analysis of consensus in networks path laplacian operators and superdiffusive processes on graphs. i. one-dimensional case path laplacian operators and superdiffusive processes on graphs. ii. two-dimensional lattice hopping in the crowd to unveil network topology key: cord-295307-zrtixzgu authors: delgado-chaves, fernando m.; gómez-vela, francisco; divina, federico; garcía-torres, miguel; rodriguez-baena, domingo s. title: computational analysis of the global effects of ly6e in the immune response to coronavirus infection using gene networks date: 2020-07-21 journal: genes (basel) doi: 10.3390/genes11070831 sha: doc_id: 295307 cord_uid: zrtixzgu gene networks have arisen as a promising tool in the comprehensive modeling and analysis of complex diseases. particularly in viral infections, the understanding of the host-pathogen mechanisms, and the immune response to these, is considered a major goal for the rational design of appropriate therapies. for this reason, the use of gene networks may well encourage therapy-associated research in the context of the coronavirus pandemic, orchestrating experimental scrutiny and reducing costs. in this work, gene co-expression networks were reconstructed from rna-seq expression data with the aim of analyzing the time-resolved effects of gene ly6e in the immune response against the coronavirus responsible for murine hepatitis (mhv). through the integration of differential expression analyses and reconstructed networks exploration, significant differences in the immune response to virus were observed in ly6e [formula: see text] compared to wild type animals. results show that ly6e ablation at hematopoietic stem cells (hscs) leads to a progressive impaired immune response in both liver and spleen. specifically, depletion of the normal leukocyte mediated immunity and chemokine signaling is observed in the liver of ly6e [formula: see text] mice. on the other hand, the immune response in the spleen, which seemed to be mediated by an intense chromatin activity in the normal situation, is replaced by ecm remodeling in ly6e [formula: see text] mice. these findings, which require further experimental characterization, could be extrapolated to other coronaviruses and motivate the efforts towards novel antiviral approaches. the recent sars-cov-2 pandemic has exerted an unprecedented pressure on the scientific community in the quest for novel antiviral approaches. a major concern regarding sars-cov-2 is the capability of the coronaviridae family to cross the species barrier and infect humans [1] . this, along with the tendency of coronaviruses to mutate and recombine, represents a significant threat to global health, which ultimately has put interdisciplinary research on the warpath towards the development of a vaccine or antiviral treatments. given the similarities found amongst the members of the coronaviridae family [2, 3] , analyzing the global immune response to coronaviruses may shed some light on the natural control of viral infection, and inspire prospective treatments. this may well be achieved from the perspective of systems biology, in which the interactions between the biological entities involved in a certain process are represented by means of a mathematical system [4] . within this framework, gene networks (gn) have become an important tool in the modeling and analysis of biological processes from gene expression data [5] . gns constitute an abstraction of a given biological reality by means of a graph composed by nodes and edges. in such a graph, nodes represent the biological elements involved (i.e., genes, proteins or rnas) and edges represent the relationships between the nodes. in addition, gns are also useful to identify genes of interest in biological processes, as well as to discover relationships among these. thus, they provide a comprehensive picture of the studied processes [6, 7] . among the different types of gns, gene co-expression networks (gcns) are widely used in the literature due to their computational simplicity and good performance in order to study biological processes or diseases [8] [9] [10] . gcns usually compute pairwise co-expression indices for all genes. then, the level of interaction between two genes is considered significant if its score is higher than a certain threshold, which is set ad hoc. traditionally, statistical-based co-expression indices have been used to calculate the dependencies between genes [5, 7] . some of the most popular correlation coefficients are pearson, kendall or spearman [11] [12] [13] . despite their popularity, statistical-based measures present some limitations [14] . for instance, they are not capable of identifying non-linear interactions and the dependence on the data distribution in the case of parametric correlation coefficients. in order to overcome some of these limitations, new approaches, e.g., the use of information theory-based measures or ensemble approaches, are receiving much attention [15] [16] [17] . gene co-expression networks (gcns) have already been applied to the study of dramatic impact diseases, such as cancer [18] , diabetes [19] or viral infections (e.g., hiv) in order to study the role of immune response to these illnesses [20, 21] . genetic approaches are expected to be the best strategy to understand viral infection and the immune response to it, potentially identifying the mechanisms of infection and assisting the design of strategies to combat infection [22, 23] . the current gene expression profiling platforms, in combination with high-throughput sequencing, can provide time-resolved transcriptomic data, which can be related to the infection process. the main objective of this approach is to generate knowledge on the immune functioning upon viral entry into the organism, which means mean a perturbation to the system. in the context of viral infection, a first defense line is the innate response mediated by interferons, a type of cytokines which eventually leads to the activation of several genes of antiviral function [24] . globally, these genes are termed interferon-stimulated genes (isgs), and regulate processes like inflammation, chemotaxis or macrophage activation among others. furthermore, isgs are also involved in the subsequent acquired immune response, specific for the viral pathogen detected [25] . gene ly6e (lymphocyte antigen 6 family member e), which has been related to t cell maturation and tumorogenesis, is amongst the isgs [26] . this gene is transcriptionally active in a variety of tissues, including liver, spleen, lung, brain, uterus and ovary. its role in viral infection has been elusive due to contradictory findings [27] . for example, in liu et al. [28] , ly6e was associated with the resistance to marek's disease virus (mdv) in chickens. moreover, differences in the immune response to mouse adenovirus type 1 (mav-1) have been attributed to ly6e variants [29] . conversely, ly6e has also been related to an enhancement of human immunodeficiency viruses (hiv-1) pathogenesis, by promoting hiv-1 entry through virus-cell fusion processes [30] . also in the work by mar et al. [31] , the loss of function of ly6e due to gene knockout reduced the infectivity of influenza a virus (iav) and yellow fever virus (yfv). this enhancing effect of ly6e on viral infection has also been observed in other enveloped rna viruses such as in west nile virus (wnv), dengue virus (den), zika virus (zikv), o'nyong nyong virus (onnv) and chikungunya virus (chikv) among others [32] . nevertheless, the exact mechanisms through which ly6e modulates viral infection virus-wise, and sometimes even cell type-dependently, require further characterization. in this work we present a time-resolved study of the immune response of mice to a coronavirus, the murine hepatitis virus (mhv), in order to analyze the implications of gene ly6e. to do so, we have applied a gcn reconstruction method called engnet [33] , which is able to perform an ensemble strategy to combine three different co-expression measures, and a topology optimization of the final network. engnet has outscored other methods in terms of network precision and reduced network size, and has been proven useful in the modeling of disease, as in the case of human post-traumatic stress disorder. the rest of the paper is organized as follows. in the next section, we propose a description of related works. in section 3, we first describe the dataset used in this paper, and then we introduce the engnet algorithm and the different methods used to infer and analyze the generated networks. the results obtained are detailed in section 4, while, in section 5, we propose a discussion of the results presented in the previous section. finally, in section 6, we draw the main conclusions of our work. as already mentioned, gene co-expression networks have been extensively applied in the literature for the understanding of the mechanisms underlying complex diseases like cancer, diabetes or alzheimer [34] [35] [36] . globally, gcn serve as an in silico genetic model of these pathologies, highlighting the main genes involved in these at the same time [37] . besides, the identification of modules in the inferred gcns, may lead to the discovery of novel biomarkers for the disease under study, following the 'guilt by association' principle. along these lines, gcns are also considered suitable for the study of infectious diseases, as those caused by viruses to the matter at hand [38] . to do so, multiple studies have analyzed the effects of viral infection over the organism, focusing on immune response or tissue damage [39, 40] . for instance, the analysis of gene expression using co-expression networks is shown in the work by pedragosa et al. [41] , where the infection caused by lymphocytic choriomeningitis virus (lcmv) is studied over time in mice spleen using gcns. in ray et al. [42] , gcns are reconstructed from different microarray expression data in order to study hiv-1 progression, revealing important changes across the different infection stages. similarly, in the work presented by mcdermott et al. [43] , the over-and under-stimulation of the innate immune response to severe acute respiratory syndrome coronavirus (sars-cov) infection is studied. using several network-based approaches on multiple knockout mouse strains, authors found that ranking genes based on their network topology made accurate predictions of the pathogenic state, thus solving a classification problem. in [39] , co-expression networks were generated by microarray analysis of pediatric influenza-infected samples. thanks to this study, genes involved in the innate immune system and defense to virus were revealed. finally, in the work by pan et al. [44] , a co-expression network is constructed based on differentially-expressed micrornas and genes identified in liver tissues from patients with hepatitis b virus (hbv). this study provides new insights on how micrornas take part in the molecular mechanism underlying hbv-associated acute liver failure. the alarm posed by the covid-19 pandemic has fueled the development of effective prevention and treatment protocols for 2019-ncov/sars-cov-2 outbreak [45] . due to the novelty of sars-cov-2, recent research takes similar viruses, such as sars-cov and middle east respiratory syndrome coronavirus (mers-cov), as a starting point. other coronaviruses, like mouse hepatitis virus (mhv), are also considered appropriate for comparative studies in animal models, as demonstrated in the work by de albuquerque et al. [46] and ding et al. [47] . mhv is a murine coronavirus (m-cov) that causes an epidemic illness with high mortality, and has been widely used for experimentation purposes. works like the ones by case et al. [48] and gorman et al. [49] , study the innate immune response against mhv arbitrated by interferons, and those interferon-stimulated genes with potential antiviral function. this is the case of gene ly6e, which has been shown to play an important role in viral infection, as well as various orthologs of the same gene [50, 51] . mechanistic approaches often involved the ablation of the gene under study, like in the work by mar et al. [31] , where gene knockout was used to characterize the implications of ly6e in influenza a infection. as it is the case of giotis et al. [52] , these studies often involve global transcriptome analyses, via rna-seq or microarrays, together with computational efforts, which intend to screen the key elements of the immune system that are required for the appropriate response. this approach ultimately leads experimental research through predictive analyses, as in the case of co-expression gene networks [53] . in the following subsections, the main methods and gcn reconstruction steps are addressed. first, in section 3.1, the original dataset used in the present work is described, together with the experimental design. then, in section 4.1, the data preprocessing steps are described. subsequently in section 3.3, key genes controlling the infection progression are extracted through differential expression analyses. finally, the inference of gcns and their analysis are detailed in sections 3.4 and 3.5, respectively. the original experimental design can be described as follows. the progression of the mhv infection at genetic level was evaluated in two genetic backgrounds: wild type (wt, ly6efl/fl) and ly6e knockout mutants (ko, ly6e ∆hsc ). the ablation of gene ly6e in all cell types is lethal, hence the ly6e ∆hsc strain contains a disrupted version of gene ly6e only in hematopoietic stem cells (hsc), which give rise to myeloid and lymphoid progenitors of all blood cells. wild type and ly6e ∆hsc mice were injected intraperitoneally with 5000 pfu mhv-a59. at 3 and 5 days post-injection (d p.i.), mice were euthanized and biological samples for rna-seq were extracted. the overall effects of mhv infection in both wt and ko strains was assessed in liver and spleen. in total 36 samples were analyzed, half of these corresponding to liver and spleen, respectively. from the 18 organ-specific samples, 6 samples correspond to mock infection (negative control), 6 to mhv-infected samples at 3 d p.i. and 6 to mhv-infected samples at 5 d p.i. for each sample, two technical replicates were obtained. libraries of cdna generated from the samples were sequenced using illumina novaseq 6000. further details on sample preparation can be found in the original article by pfaender et al. [54] . for the sake of simplicity, mhv-infected samples at 3 and 5 d p.i. will be termed 'cases', whereas mock-infection samples will be termed 'controls'. the original dataset consists of 72 files, one per sample replicate, obtained upon the mapping of the transcript reads to the reference genome. reads were recorded in three different ways, considering whether these mapped introns, exons or total genes. then, a count table was retrieved from these files by selecting only the total gene counts of each sample replicate file. pre-processing was performed using the edger [55] r package. the original dataset by pfaender et al. [54] was retrieved from geo (accession id: gse146074) using the geoquery [56] package. additional files on sample information and treatment were also used to assist the modeling process. by convention, a sequencing depth per gene below 10 is considered neglectable [57, 58] . genes meeting this criterion are known as low expression genes, and are often removed since they add noise and computational burden to the following analyses [59] . in order to remove genes showing less than 10 reads across all conditions, counts per million (cpm) normalization was performed, so possible differences between library sizes for both replicates would not affect the result. afterwards, principal components analyses (pca) were performed over the data in order to detect the main sources of variability across samples. pca were accompanied by unsupervised k-medoid clustering analyses, in order to identify different groups of samples. in addition, multidimensional scaling plots (mds) were applied to further separate samples according to their features. last, between-sample similarities were assessed through hierarchical clustering. the analyses of differential expression served a two-way purpose, (i) the exploration of the directionality in the gene expression changes upon viral infection, and (ii) the identification of key regulatory elements for the subsequent network reconstruction. in the present application, differentially-expressed genes (deg) were filtered from the original dataset and proceeded to the reconstruction process. this approximation enabled the modeling of the genetic relationships that are considered of relevance in the presented comparison [60] [61] [62] . in the present work mice samples were compared organ-wise depending on whether these corresponded to control, 3 d p.i. and 5 d p.i. the identification of deg was performed using the limma [63] r package, which provides non-parametric robust estimation of the gene expression variance. this package includes voom, a method that incorporates rna-seq count data into the limma workbench, originally designed for microarrays [64] . in this case, a minimum log2-fold-change (log2fc) of 2 was chosen, which corresponds to four fold changes in the gene expression level. p-value was adjusted by benjamini-hochberg [65] and the selected adjusted p-value cutoff was 0.05. in order to generate gene networks the engnet algorithm was used. this technique, presented in gómez-vela et al. [33] , is able to compute gene co-expression networks with a competitive performance compared other approaches from the literature. engnet performs a two-step process to infer gene networks: (a) an ensemble strategy for a reliable co-expression networks generation, and (b) a greedy algorithm that optimizes both the size and the topological features of the network. these two features of engnet offer a reliable solution for generating gene networks. in fact, engnet relies on three statistical measures in order to obtain networks. in particular, the measures used are the spearman, kendall and normalized mutual information (nmi), which are widely used in the literature for inferring gene networks. engnet uses these measures simultaneously by applying an ensemble strategy based on major voting, i.e., a relationship will be considered correct if at least 2 of the 3 measures evaluate the relationship as correct. the evaluation is based on different independent thresholds. in this work, the different thresholds were set to the values originally used in [33] : 0.9, 0.8 and 0.7 for spearman, kendall and nmi, respectively. in addition, as mentioned above, engnet performs an optimization of the topological structure of the networks obtained. this reduction is based on two steps: (i) the pruning of the relations considered of least interest in the initial network, and (ii) the analysis of the hubs present in the network. for this second step of the final network reconstruction, we have selected the same threshold that was used in [33] , i.e., 0.7. through this optimization, the final network produced by engnet results easier to analyze computationally, due to its reduced size. networks were imported to r for the estimation of topology parameters and the addition of network features that are of interest for the latter network analysis and interpretation. these attributes were added to the reconstructed networks to enrich the modeling using the igraph [66] r package. the networks were then imported into cytoscape [67] through rcy3 [68] for examination and analyses purposes. in this case, two kind of analyses were performed: (i) a topological analysis and (ii) an enrichment analysis. regarding the topological analysis, clustering evaluation was performed in order to identify densely connected nodes, which, according to the literature, are often involved in a same biological process [69] . the chosen clustering method was community clustering (glay) [70] , implemented via cytoscape's clustermaker app [71] , which has yielded significant results in the identification of densely connected modules [72, 73] . among the topology parameters, degree and edge betweenness were estimated. the degree of a node refers to the number of its linking nodes. on the other hand, the betweenness of an edge refers to the number of shortest paths which go through that edge. both parameters are considered as a measure of the implications of respectively nodes and edges in a certain network. particularly, nodes whose degree exceeds the average network node degree, the so called hubs, are considered key elements of the biological processes modeled by the network. in this particular case, the distribution of nodes' degree network was analyzed so those nodes whose degree exceeded a threshold were selected as hubs. this threshold is defined as q3 + 1.5 × iqr, where q3 is the third quartile and iqr the interquartile range of the degree distribution. this method has been widely used for the detection of upper outliers in non-parametric distributions [74, 75] , as it is the case. however, the outlier definition does not apply to this distribution since those nodes whose degree are far above the median degree are considered hubs. on the other hand, gene ontology (go) enrichment analysis provides valuable insights on the biological reality modeled by the reconstructed networks. the gene ontology consortium [76] is a data base that seeks for a unified nomenclature for biological entities. go has developed three different ontologies, which describe gene products in terms of the biological processes, cell components or molecular functions in which these are involved. ontologies are built out of go terms or annotations, which provide biological information of gene products. in this case, the clusterprofiler [77] r package, allowed the identification of the statistically over-represented go terms in the gene sets of interest. additional enrichment analyses were performed using david [78] . for both analyses, the complete genome of mus musculus was selected as background. finally, further details on the interplay of the genes under study was examined using the string database [79] . the reconstruction of gene networks that adequately model viral infection involves multiple steps, which ultimately shape the final outcome. first, in section 4.1, exploratory analyses and data preprocessing are detailed, which prompted the modeling rationale. then, in section 4.2, differential expression is evaluated for the samples of interest. finally, networks reconstruction and analysis are addressed in section 4.3. at the end, four networks were generated, both in an organand genotype-wise manner. a schematic representation of the gcn reconstruction approach is shown in figure 1 . general scheme for the reconstruction method. the preprocessed data was subjected to exploratory and differential expression analyses, which imposed the reconstruction rationale. four groups of samples were used to generate four independent networks, respectively modeling the immune response in the liver, both in the wt and the ko situations; and in the spleen, also in the wt and the ko scenarios. in order to remove low expression genes, a sequencing depth of 10 was found to correspond to an average cpm of 0.5, which was selected as threshold. hence, genes whose expression was found over 0.5 cpm in at least two samples of the dataset were maintained, ensuring that only genes which are truly being expressed in the tissue will be studied. the dataset was log2-normalized with priority to the following analyses, in accordance to the recommendations posed in law et al. [64] . the results of both pca and k-medoid clustering are shown in figure 2a . clustering of the log2-normalized samples revealed clear differences between liver and spleen samples. also, for each organ, three subgroups of analogous samples that cluster together are identified. these groups correspond to mock infection, mhv-infected mice at 3 d p.i. and mhv-infected mice at 5 d p.i. (dashed lines in figure 2a ). finally, subtle differences were observed in homologous samples of different genotypes ( figure a1 ). organ-specific pca revealed major differences between mhv-infected samples for ly6e ∆hsc and wt genotypes, at both 3 and 5 d p.i. these differences were not observed in the mock infection (control situation). organ-wise pca are shown in figure 2b ,c. the distances between same-genotype samples illustrate the infection-prompted genetic perturbation from the uninfected status (control) to 5 d p.i., where clear signs of hepatitis were observed according to the original physiopathology studies [54] . on the other hand, the differences observed between both genotypes are indicative of the role of gene ly6e in the appropriate response to viral infection. these differences are subtle in control samples, but in case samples, some composition biass is observed depending on whether these are ko or wt, especially in spleen samples. the comparative analysis of the top 500 most variable genes confirmed the differences observed in the pca, as shown in figure a2 . among the four different features of the samples under study: organ, genotype, sample type (case or control) and days post injection; the dissimilarities in terms of genotype were the subtlest. in the light of these exploratory findings, the network reconstruction approach was performed as follows. networks were reconstructed organ-wise, as these exhibit notable differences in gene expression. additionally, a main objective of the present work is to evaluate the differences in the genetic response in the wt situation compared to the ly6e ∆hsc ko background, upon the viral infection onset in the two mentioned tissues. for each organ, log2-normalized samples were coerced to generate time-series-like data, i.e., for each genotype, 9 samples will be considered as a set, namely 3 control samples, 3 case samples at 3 d p.i. and 3 case samples at 5 d p.i. both technical replicates were included. this rational design seeks for a gene expression span representative of the infection progress. thereby, control samples may well be considered as a time zero for the viral infection, followed by the corresponding samples at 3 and 5 d p.i. the proposed rationale is supported by the exploratory findings, which position 3 d p.i. samples between control and 5 d p.i. samples. at the same time, the reconstruction of gene expression becomes robuster with increasing number of samples. in this particular case, 18 measuring points are attained for the reconstruction of each one of the four intended networks, since two technical replicates were obtained per sample [80] . the differential expression analyses were performed over the four groups of 9 samples explained above, with the aim of examining the differences in the immune response between ly6e ∆hsc and wt samples. limma -voom differential expression analyses were performed over the log2-normalized counts, in order to evaluate the different genotypes whilst contrasting the three infection stages: control vs. cases at 3 d p.i., control vs. cases at 5 d p.i. and cases at 3 vs. 5 d p.i. the choice of a minimum absolute log2fc ≥ 2, enabled considering only those genes that truly effect changes between wt and ly6e ∆hsc samples, whilst maintaining a relatively computer-manageable number of deg for network reconstruction. the latter is essential for the yield of accurate network sparseness values, as this is a main feature of gene networks [5] . for both genotypes and organs, the results of the differential expression analyses reveal that mhv injection triggers a progressive genetic program from the control situation to the mhv-infected scenario at 5 d p.i., as shown in figure 3a . the absolute number of deg between control vs. cases at 5 d p.i. was considerably larger than in the comparison between control vs. cases at 3 d p.i. furthermore, in all cases, most of the deg in control vs. cases at 3 d p.i. are also differentially-expressed in the control vs. cases at 5 d p.i. comparison, as shown in figure 4 . regarding genes fold change, an overall genetic up-regulation is observed upon infection. around 70% of deg are upregulated for all the comparisons performed for wt samples, as shown in figure 3b . nonetheless, a dramatic reduce in this genetic up-regulation is observed, by contrast, in knockout samples, even limiting upregulated genes to nearly 50% in the control vs. cases at 3 d p.i. comparison of liver ly6e ∆hsc samples. the largest differences are observed in the comparison of controls vs. cases at 5 d p.i ( figures a3 and a4 ). these deg are of great interest for the understanding of the immune response of both wt and ko mice to viral infection. these genes were selected to filter the original dataset for latter network reconstruction. the commonalities between wt and ko control samples for both organs were also verified through differential expression analysis following the same criteria (log2fc > 2, p value < 0.05). the number of deg between wt and ko liver control samples (2) and between wt and ko spleen control samples (20) were not considered significant, so samples were taken as analogous starting points for infection. as stated above, the samples were arranged both organ and genotype-wise in order to generate networks which would model the progress of the disease in each scenario. gcns were inferred from log2-normalized expression datasets. a count of 1 was added at log2 normalization so the problem with remaining zero values was avoided. each network was generated exclusively taking into consideration their corresponding deg at control vs. cases at 5 d p.i., where larger differences were observed. four networks were then reconstructed from these previously-identified deg for liver wt samples (1133 genes), liver ko samples (1153 genes), spleen wt samples (506 genes) and spleen ko samples (426 genes). this approach results in the modeling of only those relationships that are related to the viral infection. each sample set was then fed to engnet for the reconstruction of the subsequent network. genes that remained unconnected due to weak relationships, which do not overcome the set threshold, were removed from the networks. furthermore, the goodness of engnet-generated models outperformed other well-known inference approaches, as detailed in appendix b. topological parameters were estimated and added as node attributes using igraph, together with log2fc, prior to cytoscape import. specifically, networks were simplified by removing potential loops and multiple edges. the clustering topological scrutiny of the reconstructed networks revealed neat modules in all cases, as shown in figure a5 . the number of clusters identified in each network, as well as the number of genes harbored in the clusters is shown in table a1 . as already mentioned, according to gene networks theory, nodes contained within the same cluster are often involved in the same biological process [5, 81] . in this context, the go-based enrichment analyses over the identified clusters may well provide an idea of the affected functions. only clusters containing more than 10 genes were considered, since this is the minimum number of elements required by the enrichment tool clusterprofiler. the results of the enrichment analyses revealed that most go terms were not shared between wt and ko homologous samples, as shown in figure 5 . in order to further explore the reconstructed networks, the intersection of ko and wt networks of a same organ was computed. this refers to the genes and relationships that are shared between both genotypes for a specific organ. additionally, the genes and relationships that were exclusively present at the wt and ko samples were also estimated, as shown in figure a6 . the enrichment analyses over the nodes, separated using this criterion, would reveal the biological processes that make the difference between in ly6e ∆hsc mice compared to wt ones. the results of such analyses are shown in figure a7 . finally, the exploration of nodes' degree distribution would reveal those genes that can be considered hubs. those nodes comprised within the top genes with highest degree (degree > q3 + 1.5 × iq), also known as upper outliers in the nodes distribution, were considered hubs. a representation of nodes' degree distribution throughout the four reconstructed networks is shown in figure 6 . these distributions are detailed in figure a8 . this method provided four cutoff values for the degree, 24, 39, 21 and 21, respectively for liver wt and ko, spleen wt and ko networks. above these thresholds, nodes would be considered as hubs in each network. these hubs are shown in tables a2-a5 . figure 5 . enrichment analyses performed over the main clusters identified in wt and ko networks of (a) liver and (b) spleen networks. gene ratio is defined by the number of genes used as input for the ernichment analyses associated with a particular go term divided by the total number of input genes. . boxplots representative of the degree distributions for each one of the four reconstructed networks. identified hubs, according to the q3 + 1.5 × iqr criterion, are highlighted in red. the degree cutoffs, above which nodes would be considered as hubs, were 24, 39, 21 and 21, respectively for liver wt, liver ko, spleen wt and spleen ko networks. note degree is represented in a log scale given that the reconstructed networks present a scale-free topology. in this work four gene networks were reconstructed to model the genetic response mhv infection in two tissues, liver and spleen, and in two different genetic backgrounds, wild type and ly6e ∆hsc . samples were initially explored in order to design an inference rationale. not only did the designed approach reveal major differences between the genetic programs in each organ, but also, between different subgroups of samples, in a time-series-like manner. noticeably, disparities between wt and ly6e ∆hsc samples were observed in both tissues, and differential expression analyses revealed relevant differences in terms of the immune response generated. hereby, our results predict the impact of ly6e ko on hsc, which resulted in an impaired immune response compared to the wt situation. overall, results indicate that the reconstruction rationale, elucidated from exploratory findings, is suitable for the modeling of the viral progression. regarding the variance in gene expression in response to virus, pca and k-medoid clustering revealed strong differences between samples corresponding to liver spleen, respectively (figure 2a ). these differences set the starting point for the modeling approach, in which samples corresponding to each organ were analyzed independently. this modus operandi is strongly supported by the tropism that viruses exhibit for certain tissues, which ultimately results in a differential viral incidence and charge depending on the organ [82] . in particular, the liver is the target organ of mhv, identified as the main disease site [83] . on the other hand, the role of the spleen in innate and adaptive immunity against mhv has been widely addressed [84, 85] . the organization of this organ allows blood filtration for the presentation of antigens to cognate lymphocytes by the antigen presenting cells (apcs), which mediate the immune response exerted by t and b cells [86] . as stated before, pca revealed differences between the three sample groups on each organ: control and mhv-infected at 3 and 5 d p.i. interestingly, between-groups differences are specially clear for liver samples (figure 2b) , whereas spleen samples are displayed in a continuum-like way. this becomes more evident in organ-wise pca (figure 2) , and was latter confirmed by the exploration of the top 500 most variable genes and differential expression analyses ( figure a2 ). furthermore, clear differences between wt and ly6e ∆hsc samples are observed in none of these analyses, although the examination of the differential expression and network reconstruction did exposed divergent immune responses for both genotypes. the differential expression analyses revealed the progressive genetic response to virus for both organs and genotypes (figures 3a and 4) . in a wt genetic background, mhv infection causes an overall rise in the expression level of certain genes, as most deg in cases vs. control samples are upregulated. however, in a ly6e ∆hsc genetic background, this upregulation is not as prominent as in a wt background, significantly reducing the number of upregulated genes (figure 3b) . besides, the number of deg in each comparison varies from wt to ly6e ∆hsc samples. attending at the deg in the performed comparisons, for both the wt and ko genotypes, liver cases at 3 d p.i. are more similar to liver cases at 5 d p.i. than to liver controls, since the number of deg between the first two measuring points is significantly lower than the number of deg between control and case samples at 3 d p.i. (figure 4a,b) . a different situation occurs in the spleen, where wt cases at 3 d p.i. are closer to control samples (figure 4c ), whereas ko cases at 3 d p.i. seem to be more related to cases at 5 d p.i. (figure 4d ). this was already suggested by hierarchical clustering in the analysis of the top 500 most variable genes, and could be indicative of a different progression of the infection impact on both organs, which could be modulated by gene ly6e, at least for the spleen samples. moreover, the results of the deg analyses indicate that the sole knockout of gene ly6e in hsc considerably affects the upregulating genetic program normally triggered by viral infection in wild type individuals (in both liver and spleen). interestingly, there are some genes in each organ and genotype that are differentially expressed in every comparison between the possible three sample types, controls, cases at 3 d p.i. and cases at 5 d p.i. these genes, which we termed highly deg, could be linked to the progression of the infection, as changes in their expression level occur with days post injection, according to the data. the rest of the deg, show an uprise or fall when comparing two sample types, which does not change significantly in the third sample type. alternatively, highly deg, shown in table a6 , exhibited three different expression patterns: (i) their expression level, initially low, rises from control to cases at 3 d p.i. and then rises again in cases at 5 d p.i. (ii) their expression level, initially high in control samples, falls at 3 d p.i. and falls even more at 5 d p.i cases. (iii) their expression level, initially low, rises from control to cases at 3 d p.i. but then falls at cases at 5 d p.i., when it is still higher than the initial expression level. these expression patterns, which are shown in figure a9 , might be used to keep track of the disease progression, differentiating early from late infection stages. in some cases, these genes exhibited inconsistent expression levels, specially at 5 d p.i. cases, which indicates the need for further experimental designs targeting these genes. highly deg could be correlated with the progression of the disease, as in regulation types (i) and (ii) or by contrast, be required exclusively at initial stages, as in regulation type (iii). notably, genes gm10800 and gm4756 are predicted genes which, to date, have been poorly described. according to the string database [79] , gm10800 is associated with gene lst1 (leukocyte-specific transcript 1 protein), which has a possible role in modulating immune responses. in fact, gm10800 is homologous to human gene piro (progranulin-induced-receptor-like gene during osteoclastogenesis), related to bone homeostasis [87, 88] . thus, we hypothesize that bone marrow-derived cell lines, including erythrocytes and leukocytes (immunity effectors), could also be regulated by gm10800. on the other hand, gm4756 is not associated to any other gene according to string. protein gm4756 is homologous to human protein dhrs7 (dehydrogenase/reductase sdr family member 7) isoform 1 precursor. nonetheless and to the best of our knowledge, these genes have not been previously related to ly6e, and could play a role in the immune processes mediated by this gene. finally, highly deg were not found exclusively present in wt nor ko networks, instead, these were common nodes of these networks for each organ. this suggests that highly deg might be of core relevance upon mhv infection, with a role in those processes independent on ly6e ∆hsc . besides, genes hykk, ifit3 and ifit3b; identified as highly deg throughout liver ly6e ∆hsc samples were also identified as hubs in the liver ko network. also gene saa3, highly deg across spleen ly6e ∆hsc samples was considered a hub in the spleen ko network. nevertheless, these highly deg require further experimental validation. the enrichment analyses of the identified clusters at each network revealed that most go terms are not shared between the two genotypes ( figure 5 ), despite the considerable amount of shared genes between the two genotypes for a same organ. the network reconstructed from liver wt samples reflects a strong response to viral infection, involving leukocyte migration or cytokine and interferon signaling among others. these processes, much related to immune processes, are not observed in its ko counterpart. the liver wt network presented four clusters ( figure a5a ). its cluster 1 regulates processes related to leukocyte migration, showing the implication of receptor ligand activity and cytokine signaling, which possibly mediates the migration of the involved cells. cluster 2 is related to interferon-gamma for the response to mhv, whereas cluster 3 is probably involved in the inflammatory response mediated by pro-inflammatory cytokines. last, cluster 4 is related to cell extravasation, or the leave of blood cells from blood vessels, with the participation of gene nipal1. the positive regulation observed across all clusters suggests the activation of these processes. overall, hub genes in this network have been related to the immune response to viral infection, as the innate immune response to the virus is the mediated by interferons. meanwhile, the liver ko network showed three main clusters ( figure a5b ). its cluster 1 would also be involved in defense response to virus, but other processes observed in the liver wt network, like leukocyte migration or cytokine activity, are not observed in this cluster nor the others. cluster 2 is then related to the catabolism of small molecules and cluster 3 is involved in acids biosynthesis. these processes are certainly ambiguous and do not correspond the immune response observed in the wt situation, which suggests a decrease in the immune response to mhv as a result of ly6e ablation in hsc. on the other hand, spleen wt samples revealed high nuclear activity potentially involving nucleosome remodeling complexes and changes in dna accessibility. histone modification is a type of epigenetic modulation which regulates gene expression. taking into account the central role of the spleen in the development of immune responses, the manifested relevance of chromatin organization could be accompanied by changes in the accessibility of certain dna regions with implications in the spleen-dependent immune response. this is supported by the reduced reaction capacity in the first days post-infection of ly6e ∆hsc samples compared to wt, as indicated by the number of deg between control and cases at 3 d p.i for these genotypes. the spleen wt network displayed three clusters ( figure a5c ). cluster 1, whose genes were all upregulated in ly6e ∆hsc samples at 5 d p.i. compared to mock infection, is mostly involved in nucleosome organization and chromatin remodelling, together with cluster 3. cluster 2 would also be related to dna packaging complexes, possibly in response to interferon, similarly to liver networks. instead, in spleen ko most genes take part in processes related to the extracellular matrix. in the spleen ko network, four clusters were identified ( figure a5d ). cluster 1 is related to the activation of an immune response, but also, alongside with clusters 2 and 4, to the extracellular matrix, possibly in relation with collagen, highlighting its role in the response to mhv. cluster 3 is implied in protease binding. the dramatic shut down in the ko network of the nuclear activity observed in the spleen wt network, leads to the hypothesis that the chromatin remodeling activity observed could be related to the activation of certain immunoenhancer genes, modulated by gene ly6e. in any case, further experimental validation of these results would provide meaningful insights in the face of potential therapeutic approaches (see appendix a for more details). the exploration of nodes memebership, depending on whether these exclusively belonged to wt or ko networks or, by contrast, were present in both networks, helped to understand the impairment caused by ly6e ∆hsc . in this sense, go enrichment analyses over these three defined categories of the nodes in the liver networks revealed that genes at their intersection are mainly related to cytokine production, leukocyte migration and inflammatory response regulation, in accordance to the phenotype described for mhv-infection [89] . however, a differential response to virus is observed in wt mice compared to ly6e-ablated. the nodes exclusively present at the wt liver network are related to processes like regulation of immune effector process, leukocyte mediated immunity or adaptive immune response. these processes, which are found at a relatively high gene ratio, are not represented by nodes exclusively present in the liver ko network. additionally, genes exclusively present at the wt network and the intersection network are upregulated in case samples with respect to controls ( figure a6a) , which suggests the activation of the previously mentioned biological processes. on the other hand, genes exclusively-present at the liver ko networks, mostly down-regulated, were found to be associated with catabolism. as for the spleen networks, genotype-wise go enrichment results revealed that the previously-mentioned intense nuclear activity involving protein-dna complexes and nucleosome assembly is mostly due to wt-exclusive genes. actually, these biological processes could be pinpointing cell replication events. analogously to the liver case, genes that were found exclusively present in the wt network and the intersection network are mostly upregulated, whereas in the case of ko-exclusive genes the upregulation is not that extensive. interestingly, the latter are mostly related to extracellular matrix (ecm) organization, which suggest the relevance of ly6e on these. other lymphocyte antigen-6 (ly-6) superfamily members have been related to ecm remodelling processes such as the urokinase receptor (upar), which participates in the proteolysis of ecm proteins [90] . however and to the best of our knowledge, the implications of ly6e in ecm have not been reported. the results presented are in the main consistent with those by pfaender et al. [54] , who observed a loss of genes associated with the type i ifn response, inflammation, antigen presentation, and b cells in infected ly6e ∆hsc mice. genes stat1 and ifit3, selected in their work for their high variation in absence of ly6e, were identified as hub genes in the networks reconstructed from liver wild type and knockout samples, respectively. it is to be noticed that our approach significantly differs to the one carried out in the original study. in this particular case, we consider that the reconstruction of gcn enables a more comprehensive analysis of the data, potentially finding the key genes involved in the immune response onset and their relationships with other genes. for instance, the transcriptomic differences between liver and spleen upon ly6e ablation become more evident using gcn. altogether, the presented results show the relevance of gene ly6e in the immune response against the infection caused by mhv. the disruption of ly6e significantly reduced the immunogenic response, affecting signaling and cell effectors. these results, combining in vivo and in silico approaches, deepen in our understanding of the immune response to viruses at the gene level, which could ultimately assist the development of new therapeutics. for example, basing on these results, prospective studies on ly6e agonist therapies could be inspired, with the purpose of enhancing the gene expression level via gene delivery. given the relevance of ly6e in sars-cov-2 according to previous studies [54, 91] , the overall effects of ly6e ablation in hscs upon sars-cov-2 infection, putting special interest in lung tissue, might show similarities with the deficient immune response observed in the present work. in this work we have presented an application of co-expression gene networks to analyze the global effects of ly6e ablation in the immune response to mhv coronavirus infection. to do so, the progression of the mhv infection on the genetic level was evaluated in two genetic backgrounds: wild type mice (wt, ly6efl/fl) and ly6e knockout mutants (ko, ly6e ∆hsc ) mice. for these, viral progression was assessed in two different organs, liver and spleen. the proposed reconstruction rationale revealed significant differences between mhv-infected wt and ly6e ∆hsc mice for both organs. in addition we observed that mhv infection triggers a progressive genetic response of upregulating nature in both liver and spleen. in addition, the results suggest that the ablation of gene ly6e at hsc caused an impaired genetic response in both organs compared to wt mice. the impact of such ablation is more evident in the liver, consistently with the disease site. at the same time, the immune response in the spleen, which seemed to be mediated by an intense chromatin activity in the normal situation, is replaced by ecm remodeling in ly6e ∆hsc mice. we infer that the presence of ly6e limits the damage in the above mentioned target sites. we believe that the characterization of these processes could motivate the efforts towards novel antiviral approaches. finally, in the light of previous works, we hypothesize that ly6e ablation might show analogous detrimental effects on immunity upon the infection caused by other viruses including sars-cov, mers and sars-cov-2. in future works, we plan to investigate whether the over-expression of ly6e in wt mice has an enhancement effect in immunity. in this direction, ly6e gene mimicking (agonist) therapies could represent a promising approach in the development of new antivirals. the authors declare no conflict of interest. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q organ q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q genotype q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q sample type q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q top 500 most variable genes across liver samples row z−score 0 top 500 most variable genes across spleen samples table a1 . number of deg used as input to engnet for network reconstruction and their latter distribution in inferred networks. genes that were not assigned to a cluster (or were comprised in minoritary clusters) were not taken into consideration for enrichment analyses. input genes 1133 1153 506 426 network genes 1118 1300 485 403 cluster 1 262 284 180 109 cluster 2 218 379 255 190 cluster 3 579 624 36 77 cluster 4 59 figure a7 . enrichment analyses based on node exclusiveness of (a) liver and (b) spleen networks. wt refers to nodes exclusively present at those networks reconstructed from wt samples; ko refers to nodes exclusively present at networks reconstructed from ly6e ∆hsc samples; both addresses shared nodes between wt and ko networks. gene ratio is defined by the number of genes used as input for the ernichment analyses associated with a particular go term divided by the total number of input genes. expression of highly deg across spleen ko samples (d) figure a9 . cpm-normalized expression values of highly deg identified across (a) liver wt samples, (b) liver ko samples, (c) spleen wt samples and (d) spleen ko samples. dashed lines separate samples from the three groups under study: controls, cases at 3 d p.i. and cases at 5 d p.i. note sample order within same group is exchangeable. the reconstruction method employed in this case study was validated against other thee well-known inference methods: aracne [93] , wgcna [94] and wto [95] . the output of each reconstruction method, using default values (including engnet) was compared to a gold standard (gs), retrieved from the string database. four different gss were taken into consideration, since these were reconstructed from the deg that were identified in the comparison of control vs. case samples at 5 d p.i., as shown in section 4.2. these deg were mapped to the string database gene identifiers selecting mus musculus as model organism (taxid: 10090). a variable percentage of deg (6-20%) could not be assigned to a string identifier, and were thus removed from the analysis. the interactions exclusively concerning the resulting deg in each case were retrieved from the string database. these interaction networks would serve as gss. the mentioned deg (without unmapped identifiers) would also serve as input for the four reconstruction methods to be compared. the aracne networks were inferred using the spearman correlation coefficient following the implementations in the minet [96] r package. in this case, mutual information values were normalized and scaled in the range 0-1. on the other hand, the wgcna networks were reconstructed following the original tutorial provided by the authors [97] . the power was defined as 5. additionally, the wto networks were built using pearson correlation in accordance to the documentation. absolute values were taken as relationship weights. finally, engnet networks were inferred using the default parameters described in the original article by gómez-vela et al. [33] . for the comparison, the receiver operating characteristic (roc)-curve was estimated using the proc [98] r package. roc curves are shown in figure a10 . the area under the roc curve (auc) was also computed in each case for the quantitative comparison of the methods, as shown in figure a11a . the auc compares the reconstruction quality of each method against random prediction. an auc ≈ 1 corresponds to the perfect classifier whereas am auc ≈ 0.5 approximates to a random classifier. thus, the higher the auc, the better the predictions. on average, engnet provided the best auc results, whilst maintaining a good discovery rate. in addition, engnet provided relatively scarce networks compared to wgcna, as shown in figure a11b . this is considered of relevance given that sparseness is a main feature of gene networks [7] . hosts and sources of endemic human coronaviruses identification and characterization of severe acute respiratory syndrome coronavirus replicase proteins an orally bioavailable broad-spectrum antiviral inhibits sars-cov-2 in human airway epithelial cell cultures and multiple coronaviruses in mice a first course in systems biology computational methods for gene regulatory networks reconstruction and analysis: a review gene network coherence based on prior knowledge using direct and indirect relationships gene regulatory network inference: data integration in dynamic models-a review structure optimization for large gene networks based on greedy strategy comprehensive analysis of the long noncoding rna expression profile and construction of the lncrna-mrna co-expression network in colorectal cancer a new cytoscape app to rate gene networks biological coherence using gene-gene indirect relationships evaluation of gene association methods for coexpression network construction and biological knowledge discovery a comparative study of statistical methods used to identify dependencies between gene expression signals ranking genome-wide correlation measurements improves microarray and rna-seq based global and targeted co-expression networks wisdom of crowds for robust gene network inference comparison of co-expression measures: mutual information, correlation, and model based indices mider: network inference with mutual information distance and entropy reduction bioinformatics analysis and identification of potential genes related to pathogenesis of cervical intraepithelial neoplasia lsd1 activates a lethal prostate cancer gene network independently of its demethylase function diverse type 2 diabetes genetic risk factors functionally converge in a phenotype-focused gene network survivin (birc5) cell cycle computational network in human no-tumor hepatitis/cirrhosis and hepatocellular carcinoma transformation coexpression network analysis in chronic hepatitis b and c hepatic lesions reveals distinct patterns of disease progression to hepatocellular carcinoma reverse genetics approaches for the development of influenza vaccines how viral genetic variants and genotypes influence disease and treatment outcome of chronic hepatitis b. time for an individualised approach? accessory proteins 8b and 8ab of severe acute respiratory syndrome coronavirus suppress the interferon signaling pathway by mediating ubiquitindependent rapid degradation of interferon regulatory factor 3 interferon-stimulated genes: a complex web of host defenses distinct lymphocyte antigens 6 (ly6) family members ly6d, ly6e, ly6k and ly6h drive tumorigenesis and clinical outcome emerging role of ly6e in virus-host interactions identification of chicken lymphocyte antigen 6 complex, locus e (ly6e, alias sca2) as a putative marek's disease resistance gene via a virus-host protein interaction screen polymorphisms in ly6 genes in msq1 encoding susceptibility to mouse adenovirus type 1 interferon-inducible ly6e protein promotes hiv-1 infection ly6e mediates an evolutionarily conserved enhancement of virus infection by targeting a late entry step flavivirus internalization is regulated by a size-dependent endocytic pathway ensemble and greedy approach for the reconstruction of large gene co-expression networks identification of candidate mirna biomarkers for pancreatic ductal adenocarcinoma by weighted gene co-expression network analysis a comprehensive analysis on preservation patterns of gene co-expression networks during alzheimer's disease progression gene co-expression network analysis for identifying modules and functionally enriched pathways in type 1 diabetes gene co-expression analysis for functional classification and gene-disease predictions systems analysis reveals complex biological processes during virus infection fate decisions identifying novel biomarkers of the pediatric influenza infection by weighted co-expression network analysis comprehensive innate immune profiling of chikungunya virus infection in pediatric cases linking cell dynamics with gene coexpression networks to characterize key events in chronic virus infections discovering preservation pattern from co-expression modules in progression of hiv-1 disease: an eigengene based approach the effect of inhibition of pp1 and tnfα signaling on pathogenesis of sars coronavirus the regulatory role of microrna-mrna co-expression in hepatitis b virus-associated acute liver failure sars-cov-2 entry factors are highly expressed in nasal epithelial cells together with innate immune genes murine hepatitis virus strain 1 produces a clinically relevant model of severe acute respiratory syndrome in a/j mice the nucleocapsid proteins of mouse hepatitis virus and severe acute respiratory syndrome coronavirus share the same ifn-β antagonizing mechanism: attenuation of pact-mediated rig-i/mda5 activation murine hepatitis virus nsp14 exoribonuclease activity is required for resistance to innate immunity the interferon-stimulated gene ifitm3 restricts west nile virus infection and pathogenesis organization, evolution and functions of the human and mouse ly6/upar family genes interferon-stimulated gene ly6e enhances entry of diverse rna viruses chicken interferome: avian interferon-stimulated genes identified by microarray and rna-seq of primary chick embryo fibroblasts treated with a chicken type i interferon (ifn-α) integrative network biology framework elucidates molecular mechanisms of sars-cov-2 pathogenesis edger: a bioconductor package for differential expression analysis of digital gene expression data geoquery: a bridge between the gene expression omnibus (geo) and bioconductor evaluation of statistical methods for normalization and differential expression in mrna-seq experiments orchestrating high-throughput genomic analysis with bioconductor heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences systems approach identifies tga 1 and tga 4 transcription factors as important regulatory components of the nitrate response of a rabidopsis thaliana roots computational inference of gene co-expression networks for the identification of lung carcinoma biomarkers: an ensemble approach step-by-step construction of gene co-expression networks from high-throughput arabidopsis rna sequencing data limma powers differential expression analyses for rna-sequencing and microarray studies precision weights unlock linear model analysis tools for rna-seq read counts false discovery control with p-value weighting the igraph software package for complex network research cytoscape 2.8: new features for data integration and network visualization network biology using cytoscape from within r gene co-opening network deciphers gene functional relationships community structure analysis of biological networks a multi-algorithm clustering plugin for cytoscape topological analysis and interactive visualization of biological networks and protein structures selectivity determinants of gpcr-g-protein binding boxplot-based outlier detection for the location-scale family outlier detection: how to threshold outlier scores? gene ontology consortium: going forward clusterprofiler: an r package for comparing biological themes among gene clusters systematic and integrative analysis of large gene lists using david bioinformatics resources the string database in 2017: quality-controlled protein-protein association networks, made broadly accessible massive-scale gene co-expression network construction and robustness testing using random matrix theory uncovering biological network function via graphlet degree signatures. cancer inform viral pathogenesis structure-guided mutagenesis alters deubiquitinating activity and attenuates pathogenesis of a murine coronavirus crosstalk of liver immune cells and cell death mechanisms in different murine models of liver injury and its clinical relevance a disparate subset of double-negative t cells contributes to the outcome of murine fulminant viral hepatitis via effector molecule fibrinogen-like protein 2 structure and function of the immune system in the spleen progranulin and a five transmembrane domain-containing receptor-like gene are the key components in receptor activator of nuclear factor κb (rank)-dependent formation of multinucleated osteoclasts rank is essential for osteoclast and lymph node development autologous intramuscular transplantation of engineered satellite cells induces exosome-mediated systemic expression of fukutin-related protein and rescues disease phenotype in a murine model of limb-girdle muscular dystrophy type 2i the intriguing role of soluble urokinase receptor in inflammatory diseases ly6e restricts the entry of human coronaviruses, including the currently pandemic sars-cov-2 aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context wgcna: an r package for weighted correlation network analysis wto: an r package for computing weighted topological overlap and a consensus network with integrated visualization tool ar/bioconductor package for inferring large transcriptional networks using mutual information a general framework for weighted gene co-expression network analysis proc: an open-source package for r and s+ to analyze and compare roc curves this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license key: cord-230294-bjy2ixcj authors: stella, massimo; restocchi, valerio; deyne, simon de title: #lockdown: network-enhanced emotional profiling at the times of covid-19 date: 2020-05-09 journal: nan doi: nan sha: doc_id: 230294 cord_uid: bjy2ixcj the covid-19 pandemic forced countries all over the world to take unprecedented measures like nationwide lockdowns. to adequately understand the emotional and social repercussions, a large-scale reconstruction of how people perceived these unexpected events is necessary but currently missing. we address this gap through social media by introducing mercurial (multi-layer co-occurrence networks for emotional profiling), a framework which exploits linguistic networks of words and hashtags to reconstruct social discourse describing real-world events. we use mercurial to analyse 101,767 tweets from italy, the first country to react to the covid-19 threat with a nationwide lockdown. the data were collected between 11th and 17th march, immediately after the announcement of the italian lockdown and the who declaring covid-19 a pandemic. our analysis provides unique insights into the psychological burden of this crisis, focussing on: (i) the italian official campaign for self-quarantine (#iorestoacasa}), (ii) national lockdown (#italylockdown), and (iii) social denounce (#sciacalli). our exploration unveils evidence for the emergence of complex emotional profiles, where anger and fear (towards political debates and socio-economic repercussions) coexisted with trust, solidarity, and hope (related to the institutions and local communities). we discuss our findings in relation to mental well-being issues and coping mechanisms, like instigation to violence, grieving, and solidarity. we argue that our framework represents an innovative thermometer of emotional status, a powerful tool for policy makers to quickly gauge feelings in massive audiences and devise appropriate responses based on cognitive data. the stunningly quick spread of the covid-19 pandemic catalysed the attention of worldwide audiences, overwhelming individuals with a deluge of often contrasting content about the severity of the disease, the uncertainty of its transmission mechanisms, and the asperity of the measures taken by most countries to fight it [1, 2, 3, 4] . although these policies have been seen as necessary, they had a tremendous impact on the mental well-being of large populations [5] for a number of reasons. due to lockdowns, many are facing financial uncertainty, having lost or being on the verge of losing their source of income. moreover, there is much concern about the disease itself, and most people fear for their own health and that of their loved ones [6] , further fueled by infodemics [2, 3, 1] . finally, additional distress is caused by the inability of maintaining a normal life [7] . the extent of the impact of these factors is such that, in countries greatly struck by covid-19 such as china, the population started to develop symptoms of post-traumatic stress disorder [8] . during this time more than ever, people have shared their emotions on social media. these platforms provide an excellent emotional thermometer of the population, and have been widely explored in previous studies investigating how online social dynamics promote or hamper content diffusion [2, 9, 1, 10, 11] and the adoption of specific positive/negative attitudes and behaviours [9, 12, 13 ]. building on the above evidence, our goal is to draw a comprehensive quantitative picture of people's emotional profiles, emerging during the covid-19 crisis, through a cognitive analysis of online social discourse. we achieve this by introducing mercurial (multi-layer co-occurrence networks for emotional profiling), a framework that combines cognitive network science [14, 15, 16] with computational social sciences [9, 12, 2, 17, 18] . before outlining the methods and main contributions of our approach, we briefly review existing research on understanding emotions in social media. much of the research on emotions in social media has been consolidated into two themes. on the one hand, there is the data science approach, which mostly focused over large-scale positive/negative sentiment detection [9] and recently identified the relevance of tracing more complex affect patterns for understanding social dynamics [19, 20, 21, 16] . on the other hand, cognitive science research makes use of small-scale analysis tools, but explores the observed phenomena in much more detail in the light of its theoretical foundations [22, 23, 24] . specifically, in cognitive science the massive spread of semantic and emotional information through verbal communication represent long-studied phenomena, known as cognitive contagion [24] and emotional contagion [24, 25, 23] , respectively. this research suggests that ideas are composed of a cognitive component and an emotional content, much alike viruses containing the genomic information necessary for their replication [1] . both these types of contagion happen when an individual is affected in their behaviour by an idea. emotions elicited by ideas can influence users' behaviour without their awareness, resulting in the emergence of specific behavioural patterns such as implicit biases [23] . unlike pathogen transmission, no direct contact is necessary for cognitive and emotional contagion to take place, since both are driven by information processing and diffusion, like it happens through social media [26, 27] . in particular, during large-scale events, ripples of emotions can rapidly spread across information systems [27] and have dramatic effects, as it has recently been demonstrated in elections and social movements [28, 12, 25] . at the intersection of data-and cognitive science is emotional profiling, a set of techniques which enables the reconstruction of how concepts are emotionally perceived and assembled in usergenerated content [17, 18, 9, 19, 15, 29] . emotional profiling conveys information about basic affective dimensions such how positive/negative or how arousing a message is, and also includes the analysis of more fine-grained emotions such as fear or trust that might be associated with the lockdown and people's hopes for the future [9, 30, 22] . recently, an emerging important line of research has shown that reconstructing the knowledge embedded in messages through social and information network models [31, 14, 32] successfully highlight important phenomena in a number of contexts, ranging from the diffusion of hate speech during massive voting events [12] to reconstructing personality traits from social media [17] . importantly, to reconstruct knowledge embedded in tweets, recent work has successfully merged data science and cognitive science, introducing linguistic networks of co-occurrence relationships between words in sentences [16, 33, 21] and between hashtags in tweets [12] . however, an important shortfall of these works is that these two types of networked knowledge representations were not merged together, thus missing on the important information revealed by studying their interdependence. we identify three important contributions that distinguish our paper from previous literature, and make a further step towards consolidating cognitive network science [14] as a paradigm suitable to analyse people's emotions. first, we introduce a new framework exploiting the interdependence between hashtags and words, addressing the gap previously discussed. this framework, multi-layer co-occurrence networks for emotional profiling (mercurial), combines both the semantic structure encoded through the co-occurrence of hashtags and the textual message to construct a multi-layer lexical network [34] . this multi-layer network structure allows us to contextualise hashtags and, therefore, improve the analysis of their meaning. importantly, these networks can be used to identify which concepts or words contribute to different emotions and how central they are. second, in contrast to previous work, which largely revolved around english tweets [4, 20] , the current study focusses on italian twitter messages. there are several reasons why the emotional response of italians is particularly interesting. specifically, i) italy was the first western country to experience a vast number of covid-19 clusters; ii) the italian government was the first to declare a national lockdown 1 ; iii), the italian lockdown was announced on 10th march, one day before the world health organization (who) declared the pandemic status of covid-19. this enables us to address the urgent need of measuring the emotional perceptions and reactions to social distancing, lockdown, and, more generally, the covid-19 pandemic. third, thanks to mercurial, we obtain richer and more complex emotional profiles that we analyse through the lens of established psychological theories of emotion. this is a fundamental step in going beyond positive/neutral/negative sentiment and to provide accurate insights on the mental well-being of a population. to this end, we take into account three of the most trending hashtags, #iorestoacasa (english: "i stay at home"), #sciacalli (english: "jackals"), and #italylockdown, as representative of positive, negative, and neutral social discourse, respectively. we use these hashtags as a starting point to build multi-layer networks of word and hashtag co-occurrence, from which we derive our profiles. our results depict a complex map of emotions, suggesting that there is co-existence and polarisation of conflicting emotional states, importantly fear and trust towards the lockdown and social distancing. the combination of these emotions, further explored through semantic network analysis, indicates mournful submission and acceptance towards the lockdown, perceived as a measure for preventing contagion but with negative implications over economy. as further evidence of the complexity of the emotional response to the crisis, we also find strong signals of hope and social bonding, mainly in relation to social flash mobs, and interpreted here as psychological responses to deal with the distress caused by the threat of the pandemic. the paper is organised as follows. in the methods section we describe the data we used to perform our analysis, and describe mercurial in detail. in the results section we present the emotional profiles obtained from our data, which are then discussed in more detail in the section discussion. finally, the last section highlights the psychological implications of our exploratory investigation and its potential for follow-up monitoring of covid-19 perceptions in synergy with other datasets/approaches. we argue that our findings represent an important first step towards monitoring both mental well-being and emotional responses in real time, offering policy-makers a framework to make timely data-informed decisions. in this section we describe the methodology employed to collect our data and perform the emotional profiling analysis. first, we describe the dataset and how it was retrieved. then, we introduce cooccurrence networks, and specifically our novel method that combines hasthag co-occurrence with word co-occurrence on multi-layer networks. finally, we describe the cognitive science framework we used to perform the emotional profiling analysis on the so-obtained networks. we gathered 101,767 tweets in italian to monitor how online users perceived the covid-19 pandemic and its repercussions in italy. these tweets were gathered by crawling messages containing three trending hashtags of relevance for the covid-19 outbreak in italy and expressing three different sentiment polarities: • #iorestoacasa (english: "i stay at home"), a positive-sentiment hashtag introduced by the italian government in order to promote a responsible attitude during the lockdown; • #sciacalli, (english: "jackals"), a negative sentiment hashtag used by online users in order to address unfair behaviour rising during the health emergency; • #italylockdown, a neutral sentiment hashtag indicating the application of lockdown measures all over italy. we refer to #iorestoacasa, #sciacalli and #italylockdown as focal hashtags to distinguish them from other hashtags. we collected the tweets through complex science consulting (@complexconsult), which was authorised by twitter, and used the serviceconnect crawler implemented in mathematica 11.3. the collection of tweets comprises 39,943 tweets for #iorestoacasa, 26,999 for #sciacalli and 34,825 for #italylockdown. retweets of the same text message were not considered. for each tweet, the language was detected. pictures, links, and non-italian content was discarded and stopwords (i.e. words without intrinsic meaning such as "di" (english: "of") and "ma" (english: "but") removed. other interesting datasets with tweets about covid-19 are available in [35, 20] . word co-occurrence networks have been successfully used to characterise a wide variety of phenomena related to language acquisition and processing [16, 36, 37] . recently, researchers have also used hashtags to investigate various aspects of social discourse. for instance, stella et al. [12] showed that hashtag co-occurrence networks were able to characterise important differences in the social discourses promoted by opposing social groups during the catalan referendum. in this work we introduce mercurial (multi-layer co-occurrence networks for emotional profiling), a framework combining: • hashtag co-occurrence networks (or hashtag networks) [12] . nodes represent hashtags and links indicate the co-occurrence of any two nodes in the same tweet. • word co-occurrence networks (or word networks) [16] . nodes represent words and links represent the co-occurrence of any two words one after the other in a tweet without stop-words (i.e. words without an intrinsic meaning). we combine these two types of networks in a multi-layer network to exploit the interdependence between hashtags and words. this new, resulting network enables us to contextualise hashtags, and capture their real meaning through context, thereby enhancing the accuracy of the emerging emotional profile. to build the multi-layer network, we first build the single hashtag and word layers. for sake of simplicity, word networks are unweighted and undirected 2 . note that the hashtag network was kept at a distinct level from word networks, e.g. common words were not explicitly linked with hashtags. as reported in figure 1 , each co-occurrence link between any two hashtags a and b (#coronavirus and #restiamoacasa in the figure) is relative to a word network, including all words co-occurring in all tweets featuring hashtags a and b. the hashtag and word networks capture the co-occurrence of lexical entities within the structured online social discourse. words possess meaning in language [31] and their network assembly is evidently a linguistic network. similar to words in natural language, hashtags possess linguistic features that express a specific meaning and convey rich affect patterns [12] . the resulting networks capture the meaning of a collection of tweets by identifying which words/hashtags co-occurred together. this knowledge embedded in hashtag networks was used in order to identify the most relevant or central terms associated within a given collection of thematic tweets. rather than using frequency to indicate centrality, which makes it difficult to compare hashtags that do not co-occur in the same message, the current work relies on distance-based measures to detect how central a hashtag is in the network. the first measure that implements this notion is closeness centrality. closeness c(i) identifies how many links connect i to all its neighbours and is formalised as follows: where d ij is the network distance between i and j, i.e. the smallest amount of links connecting nodes i and j. in co-occurrence networks, nodes (i.e. hashtags and words) with a higher closeness tend to co-occur more often with each other or with other relevant nodes at short network distance. we expect that rankings of closeness centrality will reveal the most central hashtags in the networks for #iorestoacasa, #sciacalli and #italylockdown, in line with previous work in which closeness centrality was used to measure language acquisition and processing [39, 14, 32] . importantly, closeness is a more comprehensive approach compared to the simpler frequency analysis. imagine a collection of hashtags a, b, c, d, .... computing the frequency of hashtag figure 1 : top: example of co-occurrence networks for different hashtags: #distantimauniti (english: distant but united) in #iorestoacasa on the left, #incapaci (english: inept) in #sciacalli in the middle, and #futuro (english: future) in #italylockdown on the right. clusters of co-occurring hashtags were obtained through spectral clustering [38] . these clusters highlight the co-occurrence of european-focused content, featuring hashtags like #bce (i.e. european central bank), #lagarde and #spread (i.e. spread between italian and german bonds) together with social distance practices related to #iorestoacasa. bottom: in mercurial, any link in a co-occurrence network of hashtags (left) corresponds to a collection of tweets whose words co-occur according to a word network (right). larger words have a higher closeness centrality. a co-occurring with hashtag b is informative about the frequency of the so-called 2-grams "ab" or "ba" but it does not consider how those hashtags co-occur with c, d, etc. in other words, a 2-gram captures the co-occurrence of two specific hashtags within tweets but does not provide the simultaneous structure of co-occurrences of all hashtags across tweets, for which a network of pairwise co-occurrences is required. on such a network, closeness can then highlight hashtags at short distance from all others, i.e. co-occurring in a number of contexts in the featured discourse. in addition to closeness, we also use graph distance entropy to measure centrality. this centrality measure captures which hashtags are uniformly closer to all other hashtags in a connected network. combining closeness with graph distance entropy led to successfully identifying words of relevance in conceptual networks with a few hundreds of nodes [32] . the main idea behind graph distance entropy is that it provides info about the spread of the distribution of network distances between nodes (i.e. shortest path), a statistical quantity that cannot be extracted from closeness (which is, conversely, a mean inverse distance). considering the set d (i) ≡ (d i1 , ..., d ij , ..., d in ) of distances between i and any other node j connected to it (1 ≤ j ≤ n ) and m i = m ax(d (i) ), then graph distance entropy is defined as: where p k is the probability of finding a distance equal to k. therefore, h(i) is a shannon entropy of distances and it ranges between 0 and 1. in general, the lower the entropy, the more a node resembles a star centre [34] and is at equal distances from all other nodes. thus, nodes with a lower h(i) and a higher closeness are more uniformly close to all other connected nodes in a network. words with simultaneously low graph distance entropy and high closeness were found to be prominent words for early word learning [34] and mindset characterisation [32] . in addition to hashtag networks, we also build word networks obtained from a collection of tweets containing any combination of the focal hashtags #iorestocasa or #sciacalli and #coronavirus. for all tweets containing a given set of hashtags, we performed the following: 1. subdivide the tweet in sentences and delete all stop-words from each sentence, preserving the original ordering of words; 2. stem all the remaining words, i.e. identify the root or stem composing a given word. in a language such as italian, in which there is a number of ways of adding suffixes to words, word stemming is essential in order to recognise the same word even when it is inflected for different gender, number or as a verb tense. for instance, abbandoneremo (we will abandon) and abbandono (abandon, abandonment) both represent the same stem abband; 3. draw links between a stemmed word and its subsequent one. store the resulting edge list of word co-occurrences. 4. sentences containing a negation (i.e. "not") underwent an additional step parsing their syntactic structure. this was done in order to identify the target of negation (e.g. in "this is not peace", the negation refers to "peace"). syntactic dependencies were not used for network construction but intervened in emotional profiling, instead (see below). the resulting word network also captures syntactic dependencies between words [16] related by online users to a specific hashtag or combination of hashtag. we used closeness centrality to detect the relevance of words for a given hashtag. text pre-processing such as word stemming and syntactic dependencies was performed using mathematica 11.3, which was also used to extract networks and compute network metrics. the presence of hashtags in word networks provided a way of linking words, which express common language, with hashtags, which express content but also summarise the topic of a tweet. consequently, by using this new approach, the meaning attributed by users to hashtags can be inferred not only from hashtag co-occurrence but also from word networks. an example of mercu-rial, featuring hashtag-hashtag and word-word co-occurrences, is reported in figure 1 (bottom). in this example, hashtags #coronavirus and #restiamoacasa co-occurred together (left) in tweets featuring many co-occurring words (right). the resulting word network shows relevant concepts such as "incoraggiamenti" (english: encouragement) and "problemi" (english: problems), highlighting a positive attitude towards facing problems related to the pandemic. more in general, the attribution and reconstruction of such meaning was explored by considering conceptual relevance and emotional profiling in one or several word networks related to a given region of a hashtag co-occurrence network. as a first data source for emotional profiling, this work also used valence and arousal data from warriner and colleagues [40] , whose combination can reconstruct emotional states according to the well-studied circumplex model of affect [41, 30] . in psycholinguistics, word valence expresses how positively/negatively a concept is perceived (equivalently to sentiment in computer science). the second dimension, arousal, indicates the alertness or lethargy inspired by a concept. having a high arousal and valence indicates excitement and joy, whereas a negative valence combined with a high arousal can result in anxiety and alarm [30] . finally, some studies also include dominance or potency as a measure of the degree of control experienced [40] . however, for reasons of conciseness, we focus on the two primary dimensions of affect: valence and arousal. going beyond the standard positive/negative/neutral sentiment intensity is of utmost importance for characterising the overall online perception of massive events [9] . beyond the primary affective dimension of sentiment, the affect associated with current events [12] can also be described in terms of arousal [42] and of basic emotions such as fear, disgust, anger, trust, joy, surprise, sadness, and anticipation. these emotions represent basic building blocks of many complex emotional states [23] , and they are all self-explanatory except for anticipation, which indicates a projection into future events [18] . whereas fear, disgust, and anger (trust and joy) elicit negative (positive) feedback, surprise, sadness and anticipation have been recently evaluated as neutral emotions, including both positive and negative feedback reactions to events in the external world [43] . to attribute emotions to individual words, we use the nrc lexicon [18] and the circumplex model [30] . these two approaches allow us to quantify the emotional profile of a set of words related to hashtags or combinations of hashtags. the nrc lexicon enlists words eliciting a given emotion. the circumplex model attributes valence and arousal scores to words, which in turn determine their closest emotional states. because datasets of similar size were not available for italian, the data from the nrc lexicon and the warriner norms were translated from english to italian using a forward consensus translation of google translate, microsoft bing and deepl translator, which was successfully used in previous investigations with italian [44] . although the valence of some concepts might change across languages [40] , word stemming related several scores to the same stem, e.g. scores for "studio" (english: "study") and "studiare" (english: "to study") were averaged together and the average attributed to the stem root "stud". in this way, even if non-systematic cross-language valence shifting introduced inaccuracy in the score for one word (e.g. "studiare"), averaging over other words relative to the same stem reduced the influence of such inaccuracy. no statistically significant difference (α = 0.05) was found between the emotional profiles of 200 italian tweets, including 896 different stems, and their automatic translations in english, holding for each dimension separately (z-scores < 1.96). then, we build emotional profiles by considering the distribution of words eliciting a given emotion/valence/arousal and associated to specific hashtags in tweets. assertive tweets with no negation were evaluated directly through a bag of words model, i.e. by directly considering the words composing them. tweets including negations underwent an additional intermediate step where words syntactically linked to the negation were substituted with their antonyms [45] and then evaluated. source-target syntactic dependencies were computed in mathematica 11.3 and all words targeted by a negation word (i.e. no, non and nessuno in italian) underwent the substitution with their antonyms. to determine whether the observed emotional intensity r(i) of a given emotion in a set s of words was compatible with random expectation, we perform a statistical test (z-test) using the nrc dataset. remember that emotional intensity here was measured in terms of richness or count of words eliciting a given emotion in a given network. as a null model, we use random samples as follows: let us denote by m the number of words stemmed from s that are also in the nrc dataset. then, m words from the nrc lexicon are sampled uniformly at random and their emotional profile is compared against that of the empirical sample. we repeated this random sampling 1000 times for each single empirical observed emotional profile {r(i)} i . to ensure the resulting profiles are indeed compatible with a gaussian distribution, we performed a kolmogorov-smirnov test (α = 0.05). all the tests we performed gave random distributions of emotional intensities compatible with a gaussian distribution, characterised by a mean random intensity for emotion i, r * (i) and a standard deviation σ * (i). for each emotion, a z-score was computed: in the remainder of the manuscript, every emotional profile incompatible with random expectation was highlighted in black or marked with a check. since we used a two-tailed z-test (with a significance level of 0.05), this means that an emotional richness can either be higher or lower than random expectation. the investigated corpus of tweets represents a complex multilevel system, where conceptual knowledge and emotional perceptions are entwined on a number of levels. tweets are made of text and include words, which convey meaning [31] . from the analysis of word networks, we can obtain information on the organisation of knowledge proper of social media users, which is embedded in their generated content [16] . however, tweets also convey meaning through the use of hashtags, which can either refer to specific words or point to the overall topic of the whole tweet. both words and hashtags can evoke emotions in different contexts, thus giving rise to complex patterns [17] . similar to words in natural language, the same hashtags can be perceived and used in language differently by different users, according to the context. the simultaneous presence of word-and hashtag-occurrences in tweets is representative of the knowledge shared by social media users when conveying specific content and ideas. this interconnected representation of knowledge can be exploited by simultaneously considering both hashtag-level and word-level information, since words specify the meaning attributed to hashtags. in this section we use mercurial to analyse the data collected. we do so by characterising the hashtag networks, both in terms of meaning and emotional profiles. precedence is given to hashtags as they not only convey meaning as individual linguistic units but also represent more general-level topics characterising the online discourse. then, we inter-relate hashtag networks with word networks. finally, we perform the emotional profiling of hashtags in specific contexts. the combination of word-and hashtag-networks specifies the perceptions embedded by online users around the same entities, e.g. coronavirus, in social discourses coming from different contexts. the largest connected components of the three hashtag networks included: 1000 hashtags and 8923 links for #italylockdown; 720 hashtags and 5915 links for #sciacalli; 6665 hashtags and 53395 links for #italylockdown. all three networks are found to be highly clustered (mean local clustering coefficient [38] of 0.82) and with an average distance between any two hashtags of 2.1. only 126 hashtags were present in all the three networks. table 1 reports the most central hashtags included in each corpus of tweets thematically revolving around #iorestoacasa, #sciacalli and #italylockdown. the ranking relies on closeness centrality, which in here quantifies the tendency for hashtags to co-occur with other hashtags expressing analogous concepts and, therefore, are at short network distance from each other (see methods). hence, hashtags with a higher closeness centrality represent the prominent concepts in the social discourse. this result is similar to those showing that closeness centrality captures concepts which are relevant for early word acquisition [39] and production [46] in language. additional evidence that closeness can capture semantically central concepts is represented by the closeness ranking, which assigns top-ranked positions to #coronavirus and #covid-19 in all three twitter corpora. this is a consequence of the corpora being about the covid-19 outbreak (and of the network metric being able to capture semantic relevance). in the hashtag network built around #italylockdown, the most central hashtags are relative to the coronavirus, including a mix of negative hashtags such as #pandemia (english: "pandemic") and positive ones such as #italystaystrong. similarly, the hashtag network built around #sciacalli highlighted both positive (#facciamorete (english: "let's network") and negative (#irresponsabili -english: "irresponsible") hashtags. however, the social discourse around #sciacalli also featured prominent hashtags from politics, including references to specific italian politicians, to the italian government, and hashtags expressing protest and shame towards the acts of a prominent italian politician. conversely, the social discourse around #iorestoacasa included many positive hashtags, eliciting hope for a better future and the need to act responsibly (e.g. #andratuttobene -english: "everything will be fine", or #restiamoacasa -english: "let's stay at home"). the most prominent hashtags in each network (cf. table 1 ) indicate the prevalence of a positive social discourse around #iostoacasa and the percolation of strong political debate in relation to the negative topics conveyed by #sciacalli. however, we want to extend these punctual observations of negative/positive valences of single hashtags to the overall global networks. to achieve this, we use emotional profiling. hashtags can be composed of individual or multiple words. by extracting individual words from the hashtags of a given network, it is possible to reconstruct the emotional profile of the social discourse around the focal hashtags #sciacalli, #italylockdown and #iorestoacasa. we tackle this by using the emotion-based [18] and the dimension based [30] emotional profiles (see methods). the emotional profiles of hashtags featured in co-occurrence networks are reported in figure 2 (top). the top section of the figure represents perceived valence and arousal represented as a circumplex model of affect [30] . this 2d space or disk is called emotional circumplex and its coordinates represent emotional states that are well-supported by empirical behavioural data and brain research [30] . as explained also in the figure caption, each word is endowed with an (x, y) table 1 : top-ranked hashtags in co-occurrence networks based on closeness centrality. higher ranked hashtags co-occurred with more topic-related concepts in the same tweet. in all three rankings, the most central hashtag was the one defining the topic (e.g. #italylockdown) and was omitted from the ranking. (-0.6,+0.6) represents a point of strong negative valence and positive arousal, i.e. alarm. figure 2 reports the emotional profiles of all hashtags featured in co-occurrence networks for #italylockdown (left), #sciacalli (middle) and #iorestoacasa (right). to represent the interquartile range of all words for which valence/arousal rating are available, we use a neutrality range. histograms falling outside of the neutrality range indicate specific emotional states expressed by words included within hashtags (e.g. #pandemia contains the word "pandemia" with negative valence and high arousal). in figure 2 (left, top), the peak of the emotional distribution for hashtags associated with #italylockdown falls within the neutrality range. this finding indicates that hashtags co-occurring with #italylockdown, a neutral hashtag by itself, were also mostly emotionally neutral conceptual entities. despite this main trend, the distribution also features deviations from the peak mostly in the areas of calmness and tranquillity (positive valence, lower arousal) and excitement (positive valence, higher arousal). weaker deviations (closer to the neutrality range) were present also in the area of anxiety. this reconstructed emotional profile indicates that the italian social discourse featuring #italylockdown was mostly calm and quiet, perceiving the lockdown as a positive measure for countering responsibly the covid-19 outbreak. not surprisingly, the social discourse around #sciacalli shows a less prominent positive emotional profile, with a higher probability of featuring hashtags eliciting anxiety, negative valence and increased states of arousal, as it can be seen in figure 2 (center, top) . this polarised emotional profile represents quantitative evidence for the coexistence of mildly positive and strongly negative content within the online discourse labelled by #sciacalli. this is further evidence that the negative hashtag #sciacalli was indeed used by italian users to denounce or raise alarm over the negative implications of the lockdown, especially in relation to politics and politicians' actions. however, the polarisation of political content and debate over social media platforms has been encountered in many other studies [21, 13, 12] and cannot be attributed to the covid-19 outbreak only. finally, figure 2 (top right) shows that positive perception was more prominently reflected in the emotional profile of #iorestoacasa, which was the hashtag massively promoted by the italian government for supporting the introduction of the nationwide lockdown in italy. the emotional figure 2 : emotional profiles of all hashtags featured in co-occurrence networks for #italylockdown (left), #sciacalli (middle) and #iorestoacasa (right). top: circumplex emotional profiling. all hashtags representing one or more words were considered. for each word, valence (x-coordinate) and arousal (y-coordinate) scores were attributed (see methods) resulting in a 2d density histogram (yellow overlay) relative to the probability of finding an hashtag in a given location in the circumplex, the higher the probability the stronger the colour. regions with the same probabilities are enclosed in grey lines. a neutrality range indicates where 50% of the words in the underlying valence/arousal dataset would fall and it thus serves as a reference value for detecting abnormal emotional profiles. distributions falling outside of this range indicate deviations from the median behaviour (i.e. interquartile range, see methods). bottom: nrc-based emotional profiling, detecting how many hashtags inspired a given emotion in a hashtag network. results are normalised over the total number of hashtags in a networks. emotions compatible with random expectation were highlighted in gray. profile of the 6000 hashtags co-occurring with #iorestoacasa indicate a considerably positive and calm perception of domestic confinement, seen as a positive tool to stay safe and healthy. the prominence of hopeful hashtags in association with #iorestoacasa, as reported in the previous subsection, indicate that many italian twitter users were serene and hopeful about staying at home at the start of lockdown. hashtag networks were emotionally profiled not only by using the circumplex model (see above) but also by using basic emotional associations taken from the nrc emotion lexicon (figure 2, bottom) . across all hashtag networks, we find a statistically significant peak 3 in trust, analogous of the peaks close to emotions of calmness and serenity an observed in the circumplex models. however, all the hashtag networks included also negative emotions like anger and fear, which are natural human responses to unknown threats and were observed also with the circumplex representations. the intensity of fearful, alarming and angry emotions is stronger in the #sciacalli hashtag network, which was used by social users to denounce, complain and express alertness about the consequences of the lockdown. in addition to the politically-focused jargon highlighted by closeness centrality alone, by combining closeness with graph distance entropy (see methods and [32] ) we identify other topics which are uniformly at short distance from others in the social discourse around #sciacalli, such as: #mascherine (english: "protective masks", which was also ranked high by using closeness only), #amuchina (the most popular brand, and synonym of, hand sanitiser), #supermercati (english: "supermarkets"). this result suggests an interesting interpretation of the negative emotions around #sciacalli. beside the inflaming political debate and the fear of the health emergency, in fact, a third element emerges: italian twitter users feared and were angry about the raiding and stockpiling of first aid items, symptoms of panic-buying in the wake of the lockdown. the above comparisons indicate consistency between dimension-based (i.e. the circumplex) and emotion-specific emotional profiling. since the latter offers also a more precise categorisation of words in emotions, we will focus on emotion-specific profiling. importantly, to fully understand the emotional profiles outlined above, it is necessary to identify the language expressed in tweets using a given combination of hashtags (see also figure 1 , bottom). as the next step of the mercurial analysis, we gather all tweets featuring the focal hashtags #italylockdown, #sciacalli, or #iorestoacasa and any of their co-occurring hashtags and build the corresponding word networks, as explained in the methods. closeness centrality over these networks provided the relevance of each single word in the social discourse around the topic identified by a hashtag. only words with closeness higher than the median were reported. figure 3 shows the cloud of words appearing in all tweets that include #sciacalli, displayed according to their nrc emotional profile. similar to the emotional profile extracted from hashtags co-occurring with #sciacalli, the words used in tweets with this hashtag also display a polarised emotional profile with high levels of fear and trust. thanks to the multi-layer analysis, this dichotomy can now be better understood in terms of the individual concepts eliciting it. by using closeness on word networks, we identified concepts such as "competente" (english: "competent"), "continua" (english: "continue", "keep going"), and "comitato" (english: "committee") to be relevant for the trust-sphere. these words convey trust in the expert committees appointed by the italian government to face the pandemic and protect the citizens. we find that other prominent words contributing to make the discourse around #sciacalli trustful are "aiutare" (english: "to help"), "serena" (english: "serene"), "rispetto" (english: "respect") and "veritãă" (english: "truth"), which further validate a trustful, open-minded and fair perception of the political and emergency debate outlined above. this perception was mixed with negative elements, mainly eliciting fear but also sadness and anger. the jargon of a political debate emerges in the word cloud of fear: "dif-ficoltãă" (english: "difficulty"), "criminale" (english: "criminal"), "dannati" (english: "scoundrel"), "crollare" (english: "to break down"), "banda" (english: "gang"), "panico" (english: "panic") and "caos" (english: "chaos"). these words indicate that twitter users felt fear directed to specific targets. a speculative explanation for exorcising fear can be finding a scapegoat and then target it with anger. the word cloud of such emotion supports the occurrence of such phenomenon by featuring words like "denuncia" (english: "denouncement"), "colpevoli" (english: "guilty"), "vergogna" (english: "shame"), "combattere" (english: "to fight") and "colpa" (english: "blame"). the above words are reflected also in other emotions like sadness, which features also words like "cadere" (english: "to fall") and "miseria" (english: "misery", "out of grace"). these prominent words in the polarised emotional profile of #sciacalli, suggest that twitter users feared criminal behaviour, possibly related to unwise political debates or improper stockpiling of supplies (as showed by the hashtag analysis). our findings also suggest that the reaction to such fearful state, which also projects sadness about negative economic repercussions, was split into a strong, angry denounce of criminal behaviour and messages of trust for the order promoted by competent organisations and committees. it is interesting to note that, according to ekman's theory of basic emotions [23] , a combination of sadness and fear can be symptomatic of desperation, which is a critical emotional state for people in the midst of a pandemic-induced lockdown. the same analysis is reported in figure 4 for the social discourse of #italylockdown (top) and #iorestoacasa (bottom). in agreement with the circumplex profiling, for both #italylockdown and #iorestoacasa the intensity of fear is considerably lower than trust. however, when investigated in conjunction with words, the overall emotional profile of #italylockdown appears to be more positive, displaying higher trust and joy and lower sadness, than the emotional profile of #iorestoacasa. although the difference is small, this suggests that hashtags alone are not enough to fully characterise the perception of a conceptual unit, and should always be analysed together with the natural language associated to them. figure 3 : emotional profile and word cloud of the language used in tweets with #sciacalli. words are organised according to the emotion they evoke. font size is larger for words of higher closeness centrality in the word co-occurrence network relative to the hashtag (see methods). every emotion level incompatible with random expectation is highlighted with a check mark. the trust around #italylockdown comes from concepts like "consigli" (english: "tips", "advice"), "compagna" (english: "companion", "partner"), "chiara" (english: "clear"), "abbracci" (english:"hugs") and "canta" (english: "sing"). these words and the positive emotions they elicit suggest that italian users reacted to the early stages of the lockdown with a pervasive sense of commonality and companionship, reacting to the pandemic with externalisations of positive outlooks for the future, e.g. by playing music on the balconies 4 . interestingly, this positive perception co-existed with a more complex and nuanced one. despite the overall positive reaction, in fact, the discourse on #italylockdown also shows fear for the difficult times facing the contagion ("contagi") and the lockdown restrictions ("restrizioni"), and also anger, identifying the current situation as a fierce battle ("battaglia") against the virus. the analysis of anticipation, the emotional state projecting desires and beliefs into the future, shows the emergence of concepts such as "speranza" (english: "hope"), "possibile" (english: "possible") and "domani" (english: "tomorrow"), suggesting a hopeful attitude towards a better future. the social discourse around #iorestoacasa brought to light a similar emotional profile, with a slightly higher fear towards being quarantined at home (quarantena (english: "quarantine"), comando (english: "command", "order", emergenza (english: "emergency"). both surprise and sadness were elicited by the the word "confinamento" (english: "confinement"), which was prominently featured in the network structure arising from the tweets we analysed. in summary, the above emotional profiles of hashtags and words from the 101,767 tweets suggest figure 4 : emotional profile and word cloud of the language used in tweets with #italylockdown (top) and #iorestoacasa (bottom). words are organised according to the emotion they evoke. font size is larger for words of higher closeness centrality in the word co-occurrence network relative to the hashtag (see methods). every emotional richness incompatible with random expectation is highlighted with a check mark. that italians reacted to the lockdown measure with: 1. a fearful denounce of criminal acts with political nuances and sadness/desperation about negative economic repercussions (from #sciacalli); 2. positive and trustful externalisations of fraternity and affect, combined with hopeful attitudes towards a better future (from #italylockdown and #iorestoacasa); 3. a mournful concern about the psychological weight of being confined at home, inspiring sadness and disgust towards the health emergency (from #iorestoacasa). in the previous section we showed our findings on how italians perceived the early days of lockdown on social media. but what about their perception of the ultimate cause of such lockdown, covid-19? to better reconstruct the perception of #coronavirus, it is necessary to consider the different contexts where this hashtag occurs. figure 5 displays the reconstruction of the emotional profile of words used in tweets with #coronavirus and either #italylockdown, #sciacalli, or #iorestoacasa. our results suggest that the emotional profiles of language used in these three categories of tweets are different. for example, when considering tweets including #sciacalli, which the previous analysis revealed being influenced by political and social denounces of criminal acts, #coronavirus is perceived with a more polarised fear/trust dichotomy. although #coronavirus was perceived as trustful as random expectations when co-occurring with #sciacalli (z-score: 1.69 < 1.96), it was perceived with significantly higher trust when appearing in tweets with #iorestoacasa (z-score: 3.05 > 1.96) and #italylockdown (z-score: 3.51 > 1.96). to reinforce this picture, the intensity of fear towards #coronavirus was statistically significantly lower than random expectation in the discourse of #iorestoacasa (z-score: -2.35 < -1.96) and #italylockdown (z-score: -3.01 < -1.96). this difference is prominently reflected in both the circumplex model ( figure 5 , right) and the nrc emotional profile ( figure 5, left) , although in the latter both emotional intensities are compatible with random expectation. these quantitative comparisons provide data-driven evidence that twitter users perceived the same conceptual entity, i.e. covid-19, with a higher trust when associating it to concrete means for hampering pathogen diffusion like lockdown and house confinement, and with a higher fear when denouncing the politics and economics behind the pandemic. however, social distancing, lockdown and house confinement clearly do not have only positive sides. rather, as suggested by our analysis, they bear complex emotional profiles, where sadness, anger and fear towards the current situation and future developments have been prominently expressed by italians on social media. this study delved into the massive information flow of italian social media users in reaction to the declaration of the pandemic status of covid-19 by who, and the announcement of the nationwide lockdown by the italian government in the first half of march 2020. we explored the emotional profiles of italians during this period by analysing the social discourse around the official lockdown hashtag promoted by the italian government (#iorestoacasa), together with a most trending hashtag of social protest (#sciacalli), and a general hashtag about the lockdown (#italylockdown). the fundamental premise of this work is that social media opens a window on the minds of millions of people [17] . monitoring social discourse on online platforms provides unprecedented opportunities for understanding how different categories of people react to real world events [9, 12, 15] . here we introduced a new framework, multi-layer co-occurrence networks for emotional profiling (mercurial), which is based on cognitive network science and that allowed us to: (i) quantitatively structure social discourse as a multi-layer network of hashtag-hashtag and word-word co-occurrences in tweets; (ii) identify prominent discourse topics through network metrics backed up by cognitive interpretation [14] ; (iii) reconstruct and cross-validate the emotional profile attributed to each hashtag or topic of conversation through the emotion lexicon and the circumplex model of affect from social psychology and cognitive neuroscience [30] . our interdisciplinary framework provides a first step in combining network and cognitive science principles to quantify sentiment for specific topics. our analysis also included extensive robustness checks (e.g. selecting words based on different centrality measures, statistical testing for emotions), further highlighting the potential of the framework. the analysis of concept network centrality identified hashtags of political denounce and protest against irrational panic buying (e.g. face masks and hand sanitiser) around #sciacalli but not in the hashtag networks for #italylockdown and #iorestoacasa. our results also suggest that the social discourse around #sciacalli was further characterised by fear, anger, and trust, whose emotional intensity was significantly stronger than random expectation. we also found that the most prominent concepts eliciting these emotions revolve around social denounce (anger), concern for the collective well-being (fear), and the measures implemented by expert committees and authorities (hope). this interpretation is supported also by plutchik's wheel of emotions [22] , according to which combinations of anger, disgust and anticipation can be symptoms of aggressiveness and contempt. however, within plutchik's wheel, trust and fear are not in direct opposition. the polarisation of positive/negative emotions observed around #sciacalli might be a direct consequence of a polarisation of different social users with heterogeneous beliefs, which is a phenomenon present in many social systems [21] but is also strongly present in social media through the creation of echo chambers enforcing specific narratives and discouraging the discussion of opposing views [13, 2, 47, 11, 10] . emotional polarisation might therefore be a symptom of a severe lack of social consensus across italian users in the early stages of the lockdown induced by covid-19. in social psychology, social consensus is a self-built perception that the beliefs, feelings, and actions of others are analogous to one's own [48] . destabilising this perception can have detrimental effects such as reducing social commitment towards public good or even lead to a distorted perception of society, favouring selfdistrust and even conditions such as social anxiety [48] . instead, acts such as singing from the balconies together can reduce fear and enhance self-trust [42] , as well as promote commitment and social bonding [49] , which is also an evolutionary response to help coping with a threat, in this case a pandemic, through social consensus. when interpreted under the lens of social psychology, the flash mobs documented by traditional media and identified here as relevant by semantic network analysis for #italylockdown and #iorestoacasa become important means of facing the distress induced by confinement [48, 42, 49 ]. anger and fear permeated not only #sciacalli but were found, to a lesser extent, also in association with other hashtags such as #iorestoacasa or #italylockdown. recent studies (cf. [50] ) found that anger and fear can drastically reduce individuals' sense of agency, a subjective experience of being in control of our own actions, linking this behavioural/emotional pattern also to alteration in brain states. in turn, a reduced sense of agency can lead to losing control, potentially committing violent, irrational acts [50] . consequently, the strong signals of anger and fear detected here represent red flags about a building tension manifested by social users which might contribute to the outbreak of violent acts or end up in serious psychological distress due to lowered self-control. one of the most direct implications of the detected strong signals of fear, anger and sadness is represented by increased violent behaviour. in cognitive psychology, the general aggression model (gam) [51] is a well-studied model for predicting and understanding violent behaviour as the outcome of a variety of factors, including personality, situational context and the personal internal state of emotion and knowledge. according to gam, feeling emotions of anger in a situation of confinement can strongly promote violent behaviour. in italy, the emotions of anger and anxiety we detected through social media are well reflected in the dramatic rise in reported cases of domestic violence. for instance, the anti-violence centers of d.i.re (donne in rete contro la violenza) reported an anomalous increase of +74.5% in the number of women looking for help for domestic violence in march 2020 in italy 5 . hence, monitoring social media can be insightful about potential tensions mediated and discussed by large populations, a topic in need for further research and with practical prominent repercussions for fighting covid-19. as discussed, we found the hashtag #coronavirus to be central across all considered hashtag networks. however, our analysis outlined different emotional nuances of #coronavirus across different networks. in psycholinguistics, contextual valence shifting [52] is a well-known phenomenon whereby the very same conceptual unit can be perceived wildly differently by people according to its context. this phenomenon suggests the importance of considering words in a contextual manner, by comparison to each other, as it was performed in this study, rather than alone. indeed, contexts can change the meaning and emotional perception of many words in language. we showed here that the same connotation shifting phenomenon [52] can happen also for hashtags. online users perceived #coronavirus with stronger intensities of trust and lower fear (than random expectation) when using that hashtag in the context of #iorestoacasa and #italylockdown, but not when associated to #sciacalli. this shifting underlines the importance of considering contextual information surrounding a hashtag in order to better interpret its nuanced perception. to this aim, cognitive networks represent a powerful tool, providing quantitative metrics (such as graph distance entropy) that would be otherwise not applicable with mainstream frequency approaches in psycholinguistics. mercurial facilitates a quantitative characterisation of the emotions attributed to hashtags and discourses. nonetheless, it is important to bear in mind that the analysis we conducted relies on some assumptions and limitations. for instance, following previous work [12] , we built unweighted and undirected networks, neglecting information on how many times hashtags co-occurred. including these weights would be important for detecting communities of hashtags, beyond network centrality. notice that including weights would come at the cost of not being able to use graph distance entropy, which is defined over unweighted networks and was successfully used here for exposing the denounce of panic buying in #sciacalli. another limitation is relative to the emotional profiling performed with the nrc lexicon, in which the same word can elicit multiple emotions. since we measured emotional intensity by counting words eliciting a given emotion (plus the negations, see methods), a consequence was the repetition of the same words across the sectors of the above word clouds. building or exploiting additional data about the predominance of a word in a given emotion would enable us to identify words which are peripheral to a given emotion, reduce repetitions and offer even more detailed emotional profiles. recently, forma mentis networks [29, 32] have been introduced as a method to detect the organisation of positive/negative words in the mindsets of different individuals. a similar approach might be followed for emotions in future research. acting upon specific emotions rather than using the circumplex model would also solve another problem, in that the attribution of arousal to individual words is prone to more noise, even in mega-studies, compared to detecting word valence [53] . another limitation is that emotional profiles might fluctuate over time. the insightful results outlined and discussed here were aggregated over a short time window, thus reducing the impact of aggregation itself. future analyses on longer time windows should adopt time-series for investigating emotional patterns, addressing key issues like non-stationary tweeting patterns over time and statistical scarcity due to tweet crawling (see also [12] ). the current analysis has focused on aggregated tweets, but previous studies have shown both stable individual and intercultural differences in affect [54] , especially for dimensions such as arousal. similarly, some emotions are harder to measure than others, which might affect reliability and thus underestimate their contribution. the current approach estimates emotional profiles on the basis of a large set of words, which will reduce some language-specific differences. the collection of currently missing large-scale italian normative datasets for lexical sentiment could further improve the accuracy of the findings. this study approaches the relation between emotions and mental distress mostly from the perspective that attitudes and emotions of the author are conveyed in the linguistic content. however, the emotion profile might also have implications for readers as well, as recent research suggests that even just reading words of strong valence/arousal can have deep somatic and visceral effects, e.g. raising heart beat or promoting involuntary muscle tension [55] . furthermore, authors and readers participate in an information network, and quantifying which tweets are liked or retweeted depending on the structure of social network can provide further insight on their potential impact [12, 21, 10, 4, 56] , which calls for future approaches merging social networks, cognitive networks and emotional profiling. finally, understanding the impact of nuanced emotional appraisals would also benefit from investigating how these are related to behavioural and societal outcomes including the numbers of the contagion (e.g. hospitalisations, death rate, etc.) and compliance with physical distancing [57] . given the massive attention devoted to the covid-19 pandemic by social media, monitoring online discourse can offer an insightful thermometer of how individuals discussed and perceived the pandemic and the subsequent lockdown. our mercurial framework offered quantitative readings of the emotional profiles among italian twitter users during early covid-19 diffusion. the detected emotional signals of political and social denounce, the trust in local authorities, the fear and anger towards the health and economic repercussions, and the positive initiatives of fraternity, all outline a rich picture of emotional reactions from italians. importantly, the psychological interpretation of mercurial's results identified early signals of mental health distress and antisocial behaviour, both linked to violence and relevant for explaining increments in domestic abuse. future research will further explore and consolidate the behavioural implications of online cognitive and emotional profiles, relying on the promising significance of our current results. our cognitive network science approach offers decision-makers the prospect of being able to successfully detect global issues and design timely, data-informed policies. especially under a crisis, when time constraints and pressure prevent even the richest and most organised governments from fully understanding the implications of their choices, an ethical and accurate monitoring of online discourses and emotional profiles constitutes an incredibly powerful support for facing global threats. m.s. acknowledges daniele quercia, nicola perra and andrea baronchelli for stimulating discussion. how to fight an infodemic the covid-19 social media infodemic assessing the risks of "infodemics" in response to covid-19 epidemics covid-19 infodemic: more retweets for science-based information on coronavirus than for false information immediate psychological responses and associated factors during the initial stage of the 2019 coronavirus disease (covid-19) epidemic among the general population in china world health organization. mental health during covid-19 outbreak the immediate mental health impacts of the covid-19 pandemic among people with or without quarantine managements a longitudinal study on the mental health of general population during the covid-19 epidemic in china quantifying the effect of sentiment on information diffusion in social media phase transitions in information spreading on structured populations beating the news using social media: the case study of american idol bots increase exposure to negative and inflammatory content in online social systems exposure to opposing views on social media can increase political polarization cognitive network science: a review of research on cognition through the lens of network representations, processes, and dynamics text-mining forma mentis networks reconstruct public perception of the stem gender gap in social media probing the topological properties of complex networks modeling short written texts our twitter profiles, our selves: predicting personality with twitter emotions evoked by common words and phrases: using mechanical turk to create an emotion lexicon semeval-2018 task 1: affect in tweets measuring emotions in the covid-19 real world worry dataset a complex network approach to political analysis: application to the brazilian chamber of deputies the emotions. university press of america the nature of emotion: fundamental questions emotional contagion the ripple effect: emotional contagion and its influence on group behavior experimental evidence of massivescale emotional contagion through social networks the rippling dynamics of valenced messages in naturalistic youth chat emotions and social movements: twenty years of theory and research forma mentis networks quantify crucial differences in stem perception between students and experts the circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology largescale network representations of semantics in the mental lexicon forma mentis networks map how nursing and engineering students enhance their mindsets about innovation and health during professional growth from topic networks to distributed cognitive maps: zipfian topic universes in the area of volunteered geographic information distance entropy cartography characterises centrality in complex networks covid-19 labelled network subgraphs reveal stylistic subtleties in written texts predicting lexical norms: a comparison between a word association model and text-based word co-occurrence models modelling early word acquisition through multiplex lexical networks and machine learning norms of valence, arousal, and dominance for 13,915 english lemmas a circumplex model of affect the effects of group singing on mood affect regulation, mentalization and the development of the self forma mentis networks reconstruct how italian high schoolers and international stem experts perceive teachers, students, scientists, and school wordnet: an electronic lexical database the multiplex structure of the mental lexicon influences picture naming in people with aphasia recursive patterns in online echo chambers on the perception of social consensus the ice-breaker effect: singing mediates fast social bonding i just lost it! fear and anger reduce the sense of agency: a study using intentional binding the general aggression model: theoretical extensions to violence contextual valence shifters obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words the relation between valence and arousal in subjective experience varies with personality and culture somatic and visceral effects of word valence, arousal and concreteness in a continuum lexical space retweeting for covid-19: consensus building, information sharing, dissent, and lockdown life the ids of the tweets analysed in this study are available on the open science foundation repository: https://osf.io/jy5kz/. key: cord-273941-gu6nnv9d authors: chandran, uma; mehendale, neelay; patil, saniya; chaguturu, rathnam; patwardhan, bhushan title: chapter 5 network pharmacology date: 2017-12-31 journal: innovative approaches in drug discovery doi: 10.1016/b978-0-12-801814-9.00005-2 sha: doc_id: 273941 cord_uid: gu6nnv9d abstract the one-drug/one-target/one-disease approach to drug discovery is presently facing many challenges of safety, efficacy, and sustainability. network biology and polypharmacology approaches gained appreciation recently as methods for omics data integration and multitarget drug development, respectively. the combination of these two approaches created a novel paradigm called network pharmacology (np) that looks at the effect of drugs on both the interactome and the diseasome level. ayurveda, the traditional system of indian medicine, uses intelligent formulations containing multiple ingredients and multiple bioactive compounds; however, the scientific rationale and mechanisms remain largely unexplored. np approaches can serve as a valuable tool for evidence-based ayurveda to understand the medicines’ putative actions, indications, and mechanisms. this chapter discusses np and its potential to explore traditional medicine systems to overcome the drug discovery impasse. drug discovery, the process by which new candidate medications are discovered, initially began with random searching of therapeutic agents from plants, animals, and naturally occurring minerals (burger, 1964) . for this, they depended on the materia medica that was established by medicine men and priests from that era. this was followed by the origin of classical pharmacology in which the desirable therapeutic effects of small molecules were tested on intact cells or whole organisms. later, the advent of human genome sequencing revolutionized the drug discovery process that developed into targetàbased drug discovery, also known as reverse pharmacology. this relies on the hypothesis that the modulation of the activity of a specific protein will have therapeutic effects. the protein that the drug binds to or interacts with is also referred to as a "target." in this reductionist approach, small molecules from a chemical library are screened for their effect on the target's known or predicted function (hacker et al., 2009) . once the small molecule is selected for a particular target, further modifications are carried out at the atomic level to ameliorate the lock-and-key interactions. this one-drug/onetarget/one-therapeutic approach was followed for the last several decades. the information technology revolution at the end of 20th century metamorphosed the drug discovery process as well (clark and pickett, 2000) . advancements in omics technologies during this time were used to develop strategies for different phases of drug research (buriani et al., 2012) . computational power was implemented in the discovery process for predicting a drug-likeness of newly designed or discovered compounds and ligandprotein docking for predicting the binding affinity of a small molecule with a protein three-dimensional structure. in silico tools were developed to predict other pharmacological properties of the drug molecules such as absorption, distribution, metabolism, excretion, and toxicity-abbreviated together as admet (van de waterbeemd and gifford, 2003; clark and grootenhuis, 2002) . the technological advancements triggered discovery efforts in a direction to discover more specific magic bullets that were completely against the holistic approach of traditional medicine. this magic bullet approach is currently in decline phase. the major limitations of this drug discovery approach are side effects and the inability to tackle multifactorial diseases. this is mainly due to the linearity of this approach. during the peak, historical time of drug discovery and development of natural productsàbased drugs had played a significant role due to their superior chemical diversity and safety over synthetic compound libraries (zimmermann et al., 2007) . currently, it is estimated that more than one hundred new, natural productàbased leads are in clinical development (harvey, 2008) . many active compounds (bioactives) from traditional medicine sources could serve as good starting compounds and scaffolds for rational drug design. natural products normally act through modulation of multiple targets rather than a single, highly specific target. but in drug discovery and development, technology was used to synthesize highly specific mono-targeted molecules that mimic the bioactives from natural compounds rather than understanding the rationale behind their synergistic action and developing methods to isolate the bioactives from natural resources. researchers understand that most diseases are due to dysfunction of multiple proteins. thus, it is important to address multiple targets emanating from a syndrome-related, metabolic cascade, so that holistic management can be effectively achieved. therefore, it is necessary to shift the strategy from one that focuses on a single-target, new chemical entity to one of a multiple-target, synergistic, formulation-discovery approach . this tempted the research world to go back and extensively explore natural sources, where modern pharmacology had begun. this renewed research focus indicates the need to rediscover the drug discovery process by integrating traditional knowledge with state-of-the-art technologies (patwardhan, 2014a) . a new discipline called network pharmacology (np) has emerged which attempts to understand drug actions and interactions with multiple targets (hopkins, 2007) . it uses computational power to systematically catalogue the molecular interactions of a drug molecule in a living cell. np appeared as an important tool in understanding the underlying complex relationships between botanical formula and the whole body berger and iyengar, 2009) . it also attempts to discover new drug leads and targets and to repurpose existing drug molecules for different therapeutic conditions by allowing an unbiased investigation of potential target spaces (kibble et al., 2015) . however, these efforts require some guidance for selecting the right type of targets and new scaffolds of drug molecules. traditional knowledge can play a vital role in this process of formulation discovery and repurposing existing drugs. by combining advances in systems biology and np, it might be possible to rationally design the next generation of promiscuous drugs (cho et al., 2012; hopkins, 2008; ellingson et al., 2014) . np analysis not only opens up new therapeutic options, but it also aims to improve the safety and efficacy of existing medications. the postgenomic era witnessed a rapid development of computational biology techniques to analyze and explore existing biological data. the key aim of the postgenomic biomedical research was to systematically catalogue all molecules and their interactions within a living cell. it is essential to understand how these molecules and the interactions among them determine the function of this immensely complex machinery, both in isolation and when surrounded by other cells. this led to the emergence and advancement of network biology, which indicates that cellular networks are governed by universal laws and offer a new conceptual framework that could potentially revolutionize our view of biology and disease pathologies in the 21st century (barabási and oltvai, 2004) . during the first decade of the 21st century, several approaches for biological network construction were put forward that used computational methods, and literature mining especially, to understand the relation between disease phenotypes and genotypes. as a consequence, lmma (literature mining and microarray analysis), a novel approach to reconstructing gene networks by combining literature mining and microarray analysis, was proposed (li et al., 2006; huang and li, 2010) . with this, a global network was first derived using the literatureàbased, cooccurrence method and then refined using microarray data. the lmma biological network approach enables researchers to keep themselves up to date with relevant literature on specialized biological topics and to make sense of the relevant large-scale microarray dataset. also, lmma serves as a useful tool for constructing specific biological network and experimental design. lmmaàlike representations enable a systemic recognition for the specific diseases in the context of complex gene interactions and are helpful for studying the regulation of various complex biological, physiological, and pathological systems. the significance of accumulated-data integration was appreciated by pharmacologists, and they began to look beyond the classic lock-and-key concept as a far more intricate picture of drug action became clear in the postgenomic era. the global mapping of pharmacological space uncovered promiscuity, the specific binding of a chemical to more than one target (paolini et al., 2006) . as there can be multiple keys for a single lock, in the same way, a single key can fit into multiple locks. similarly, a ligand might interact with many targets and a target may accommodate different types of ligands. this is referred to as "polypharmacology." the concept of network biology was used to integrate data from drugbank (re and valentini, 2013) and omim (hamosh et al., 2005) , an online catalog of human genes and genetic disorders to understand the industry trends, the properties of drug targets, and to study how drug targets are related to disease-gene products. in this way, when the first drug-target network was constructed, isolated and bipartite nodes were expected based on the existed one-drug/one-target/onedisease approach. rather, the authors observed a rich network of polypharmacology interactions between drugs and their targets (yildirim et al., 2007 ). an overabundance of "follow-on" drugs that are drugs that target already targeted proteins was observed. this suggested a need to upgrade the singletarget single-drug paradigm, as single-protein single-function relations are limited to accurately describing the reality of cellular processes. advances in systems biology led to the realization that complex diseases cannot be effectively treated by intervention at single proteins. this made the drug researchers accept the concept of polypharmacology which they previously thought as an undesirable property that needs to be removed or reduced to produce clean drugs acting on single-targets. according to network biology, simultaneous modulation of multiple targets is required for modifying phenotypes. developing methods to aid polypharmacology can help to improve efficacy and predict unwanted off-target effects. hopkins (hopkins, 2007 (hopkins, , 2008 observed that network biology and polypharmacology can illuminate the understanding of drug action. he introduced the term "network pharmacology." this distinctive new approach to drug discovery can enable the paradigm shift from highly specific magic bulletàbased drug discovery to multitargeted drug discovery. np has the potential to provide new treatments to multigenic complex diseases and can lead to the development of e-therapeutics where the ligand formulation can be customized for each complex indication under every disease type. this can be expanded in the future and lead to customized and personalized therapeutics. integration of network biology and polypharmacology can tackle two major sources of attrition in drug development such as efficacy and toxicity. also, this integration holds the promise of expanding the current opportunity space for druggable targets. hopkins proposed np as the next paradigm in drug discovery. polypharmacology expands the space in drug discovery approach. hopkins had suggested three strategies to the designers of multitarget therapies: the first was to prescribe multiple individual medications as a multidrug combination cocktail. patient compliance and the danger of drugàdrug interactions would be the expected drawbacks of this method. the second proposition was the development of multicomponent drug formulations. the change in metabolism, bioavailability, and pharmacokinetics of formulation as well as safety would be the major concerns of this approach. the third strategy was to design a single compound with selective polypharmacology. according to hopkins, the third method is advantageous, as it would ease the dosing studies. also, the regulatory barriers for the single compound are fewer compared to a formulation. an excellent example of this is metformin, the first-line drug for type ii diabetes that has been found to have cancerinhibiting properties (leung et al., 2013) . the following years witnessed the application research of np by integrating network biology and polypharmacology. a computational framework, based on a regression model that integrates human proteinàprotein interactions, disease phenotype similarities, and known geneàphenotype associations to capture the complex relationships between phenotypes and genotypes, has been proposed. this was based on the assumption that phenotypically similar diseases are caused by functionally related genes. a tool named cipher (correlating protein interaction network and phenotype network to predict disease genes) has been developed that predicts and prioritizes disease-causing genes (wu et al., 2008) . cipher helps to uncover known disease genes and predict novel susceptibility candidates. another application of this study is to predict a human disease landscape that can be exploited to study the related genes for related phenotypes that will be clustered together in a molecular interaction network. this will facilitate the discovery of disease genes and help to analyze the cooperativity among genes. later, cipher-hit, a hitting-time-based method to measure global closeness between two nodes of a heterogeneous network, was developed (yao et al., 2011) . a phenotypeàgenotype network can be explored using this method for detecting the genes related to a particular phenotype. a net-workàbased gene clustering and extension were used to identify responsive gene modules in a conditionàspecific gene network aimed to provide useful resources to understand physiological responses (gu et al., 2010) . np was also used to develop mirnaàbased biomarkers (lu et al., 2011) . for this, a network of mirna and their targets was constructed and further refined to study the data for specific diseases. this process integrated with literature mining was useful to develop potent mirna markers for diseases. np was also used to develop a drug geneàdisease comodule (zhao and li, 2012) . initially, a drug-disease network was constructed by information gathered from databases followed by the integration of gene data. the gene closeness was studied by developing a mathematical model. this network inferred the association of multiple genes for most of the diseases and target sharing of drugs and diseases. these kinds of networks give insight into new drug-disease associations and their molecular connections. during the progression period of network biology, natural products were gaining importance in the chemical space of drug discovery, as these have been economically designed and synthesized by nature for the benefit of evolution (wetzel et al., 2011) . researchers began analyzing the logic behind traditional medicine systems and devised computational ways to ease the analysis. a comprehensive herbal medicine information system that was developed integrates information of more than 200 anticancer herbal recipes that have been used for the treatment of different types of cancer in the clinic, 900 individual ingredients, and 8500 small organic molecules isolated from herbal medicines (fang et al., 2005) . this system, which was developed using an oracle database and internet technology, facilitates and promotes scientific research in herbal medicine. this was followed by the development of many databases that serve as a source of botanical information and a powerful tool that provides a bridge between traditional medicines and modern molecular biology. these kinds of databases and tools made the researchers conceive the idea of np of botanicals and their formulations to understand the underlying mechanisms of traditional medicines. we refer to such networks as "ethnopharmacological networks" and the technique as "network ethnopharmacology (nep)" (patwardhan and chandran, 2015) . shao li pioneered this endeavor and proposed this network as a tool to explain the zheng (syndrome of traditional chinese medicine (tcm)) and the multiple-targets' mechanism of tcm (li, 2007) . li et al. tried to provide a molecular basis for 1000-year-old concept of zheng using a neuro-endocrine-immune (nei) network . zheng is the basic unit and key concept in tcm theory. it is also used as a guideline in disease classification in tcm. the hot (hans zheng in mandarin) and cold (re zheng) are the two statuses of zheng which therapeutically directs the use of herbs in tcm. chinese herbs are classified as hotàcooling and are used to remedy hot zheng and coldàwarming herbs that are used to remedy cold zheng. according to the authors, hormones may be related to hot zheng, immune factors may be related to cold zheng, and they may be interconnected by neurotransmitters. this study provides a methodical approach to understand tcm within the framework of modern science. later they reconstructed the nei network by adding multilayer information including data available on the kegg database related to signal transduction, metabolic pathways, proteinàprotein interactions, transcription factor, and micro rna regulations. they also connected drugs and diseases through multilayered interactions. the study of cold zheng emphasized its relation to energy metabolism, which is tightly correlated with the genes of neurotransmitters, hormones, and cytokines in the nei interaction network ma et al., 2010) . another database, tcmgenedit, provides information about tcms, genes, diseases, tcm effects, and tcm ingredients mined from a vast amount of biomedical literature. this would facilitate clinical research and elucidate the possible therapeutic mechanisms of tcms and gene regulations (fang et al., 2008) . to study the combination rule of tcm formulae, an herb network was created using 3865 collaterals-related formulae . they developed a distance-based, mutual-information model (dmim) to uncover the combination rule. dmim uses mutual-information entropy and "between herb distance" to measure the tendency of two herbs to form an herb pair. they experimentally evaluated the combination of a few herbs for angiogenesis. understanding the combination rule of herbs in formulae will help the modernization of traditional medicine and also help to develop a new formulae based on the current requirement. a network targetàbased paradigm was proposed for the first time to understand the synergistic combinations , and an algorithm termed "nims" (network tar-getàbased identification of a multicomponent synergy) was also developed. this was a step that facilitated the development of multicomponent therapeutics using traditional wisdom. an innovative way to study the molecular mechanism of tcm was proposed during this time by integrating the tcm experimental data with microarray gene expression data (wen et al., 2011) . as a demonstrative example, si-wu-tang's formula was studied. rather than uncovering the molecular mechanism of action, this method would help to identify new health benefits of tcms. the initial years of the second decade of the 21st century witnessed the network ethnopharmacological exploration of tcm formulations. the scope of this new area attracted scientists, and they hoped nep could provide insight into multicompound drug discoveries that could help overcome the current impasse in drug discovery (patwardhan, 2014b; . nep was used to study the antiinflammatory mechanism of qingfei xiaoyan, a tcm . the predicted results were used to design experiments and analyze the data. experimental confirmation of the predicted results provides an effective strategy for the study of traditional medicines. the potential of tcm formulations as multiple compound drug candidates has been studied using tcm formulations based np. tcm formulations studied in this way are listed in table 5 .1. construction of a database containing 19,7201 natural product structures, followed by their docking to 332 target proteins of fda-approved drugs, shows the amount of space shared in the chemical space between natural products and fda drugs (gu et al., 2013a) . molecular-docking technique plays a major role in np. the interaction of bioactives with molecular targets can be analyzed by this technique. molecular dockingàbased nep can be a useful tool to computationally elucidate the combinatorial effects of traditional medicine to intervene disease networks (gu et al., 2013c ). an approach that combines np and pharmacokinetics has been proposed to study the material basis of tcm formulations (pei et al., 2013) . this can be extrapolated to study other traditional medicine formulations as well. in cancer research, numerous natural products have been demonstrated to have anticancer potential. natural products are gaining attraction in anticancer research, as they show a favorable profile in terms of absorption and metabolism in the body with low toxicity. in a study all of the known bioactives were docked for their property to interact with 104 cancer targets (luo et al., 2014) . it was inferred that many bioactives are targeting multiple ejiao slurry regulates cancer cell differentiation, growth, proliferation, and apoptosis, and shows an adjuvant therapeutic effect that enriches the blood and increases immunity xu et al. (2014b) xiao-chaihu decoction and da chaihu-decoction xchd treats diseases accompanying symptoms of alternating fever and chills, no desire for food or drink, and dry pharynx, while dchd treats those with symptoms of fullness, pain in abdomen, and constipation. dragon's blood used in colitis and acts through interaction with 26 putative targets xu et al. (2014a) protein targets and thus are linked to many types of cancers. np coupled to sophisticated spectroscopical analysis such as ultra-performance liquid chromatographyàelectrospray, ionizationàtandem mass spectroscopy (uplc-esi-ms/ms) is a useful approach to study the absolute molecular mechanism of action of botanical formulations based on their constituent bioactives (xu et al., 2014a) . bioactiveàtarget analysis has shown that some of the botanical formulations are more effective than their corresponding marketed drugàtarget interactions . this indicates the potential of np to better understand the power of botanical formulations and to develop efficient and economical treatment options. the holistic approach of botanical formulations can be better explained by np. a study has reported this property by exemplifying a tcm formulation against viral infectious disease . not only does the formulation target the proteins in the viral infection cycle, but it also regulates the proteins of the host defense system; thus, it acts in a very distinctive manner. this unique property of formulations is highly efficient for strengthening the broad and nonspecific antipathogenic actions. thus, network-based, multitarget drugs can be developed by testing the efficacy of the formulation, identifying, and isolating the major bioactives and redeveloping a multicomponent therapeutic using the major bioactives based on synergism (leung et al., 2013) . np also serves to document and analyze the clinical prescriptions of traditional medicine practitioners . a traditional medicine network that links bioactives to clinical symptoms through targets and diseases is a novel way to explore the basic principles of traditional medicines (luo et al., 2015) . the network-based approaches provide a systematic platform for the study of multicomponent traditional medicine and has applications for its beneficial modernization. this platform not only recovers traditional knowledge, but it also provide new findings that can be used for resolving current problems in the drug industry . this section explains a handful of ethnopharmacological networks that were developed to understand the scientific rationale of traditional medicine. dragon's blood (db) tablets, which are made of resins from dracaena spp., daemonorops spp., croton spp., and pterocarpus spp., is an effective tcm for the treatment of colitis. in a study, an np-based approach was adopted to provide new insights relating to the active constituents and molecular mechanisms underlying the effects of db (xu et al., 2014a) . the constituent chemicals of the formulation were identified using an ultraperformance liquid chromatography-electrospray ionization-tandem mass spectrometry method. the known targets of those identified 48 compounds were mined from literature and putative targets that were predicted with the help of computational tools. the compounds were further screened for bioavailability followed by the systematic analysis of the known and putative targets for colitis. the network evaluation revealed the mechanism of action of db bioactives for colitis through the modulation of the proteins of the nod-like receptor signaling pathway (fig. 5.1) . the antioxidant mechanism of zhi-zi-da-huang decoction as an approach to treat alcohol liver disease was elucidated using np an and feng, 2015 ). an endothelial cell proliferation assay was performed for an antiangiogenic alkaloid, sinomenine, to validate the network targetàbased identification of multicomponent synergy (nims) predictions. the study was aimed at evaluating the synergistic relationship between different pairs of therapeutics, and sinomenine was found to have a maximum inhibition rate with matrine, both through the network and in vitro studies. the discovery of bioactives and elucidation of the mechanism of action of the herbal formulae, qing-luo-yin and the liu-wei-di-huang pill, using np, has given insight to the design validation experiments that accelerated the process of drug discovery . validation experiments based on the network findings regarding cold zheng and hot zheng on a rat model of collagenàinduced arthritis showed that the cold zhengàoriented herbs tend to affect the hub nodes in the cold zheng network, and the hot zheng-oriented herbs tend to affect the hub nodes in the hot zheng network . np was used to explain the addition and subtraction theory of tcm. two decoctions: xiao chaihu and da chaihu were studied using np approach to investigate this theory. according to the addition and subtraction theory, the addition or removal of one or more ingredients from a traditional formulation resulted in a modified formula that plays a vital role in individualized medicine. compounds from additive herbs were observed to be more efficient on diseaseàassociated targets (fig. 5.2) . these additive compounds were found to act on 93 diseases through 65 drug targets (li et al., 2014a) . experimental verification of the antithrombotic network of fufang xueshuantong (fxst) capsule was done through in vivo studies on lipopoly-saccharideàinduced disseminated intravascular coagulation (dic) rat model. it was successfully shown that fxst significantly improves the activation of the coagulation system through 41 targets from four herbs (sheng et al., 2014) . np analysis of the bushenhuoxue formula showed that six components-rhein, tanshinone iia, curcumin, quercetin and calycosin-acted through 62 targets for the treatment of chronic kidney disease. these predictions were validated using unilateral ureteral obstruction models, and it was observed that even though the individual botanicals showed a significant decrease in creatinine levels, the combination showed lower blood creatinine and urea nitrogen levels (shi et al., 2014) . the antidiabetic effects of ge-gen-qin-lian decoction were investigated using an insulin secretion assay, and an insulinàresistance model using 13 of the 19 ingredients showed antidiabetic activity using np studies (li et al., 2014b) . to confirm the predictions of the network of liu-wei-di-huang pill, four proteins-pparg, rara, ccr2, and esr1-that denote different functions and are targeted by different groups of ingredients were chosen. the interactions between various bioactives and their effect on the expression of the proteins showed that the np approach can accurately predict these interactions, giving hints regarding the mechanism of action of the compounds (liang et al., 2014) . experimental results confirmed that the 30 core ingredients in modified simiaowan, obtained through network analysis, significantly increased huvec viability and attenuated the expression of icam-1 and proved to be effective in gout treatment (zhao et al., 2015) . the role of anthraquinone and flavanols (catechin and epicatechin) in the therapeutic potential of rhubarb in renal interstitial fibrosis was examined using network analysis and by conventional assessment involving serum biochemistry, histopathological, and immunohistochemical assays (xiang et al., 2015) . in silico analysis and experimental validation demonstrated that compound 11/12 of fructus schisandrae chinensis targets gba3/shbg . np is a valuable method to study the synergistic effects of bioactives of traditional medicine formulation. this was experimentally shown on the sendeng-4 formulation for rheumatoid arthritis (fig. 5.3 ). data and network analysis have shown that the formulation acts synergistically through nine categories of targets (zi and yu, 2015) . another network that studied three botanicals, salviae miltiorrhizae, ligusticum chuanxiong, and panax notoginseng for coronary artery disease (cad), displayed their mode of action through 67 targets, out of which 13 are common among the botanicals (fig. 5.4) . these common targets are associated with thrombosis, dyslipidemia, vasoconstriction, and inflammation . this gives insight to how these botanicals are managing cad. another approach using np is the construction of networks based on experimental data followed by literature mining. this method is very effective for large space data analysis, which will help to derive the mechanism of action of the formulation. a network of qishenyiqi formulation having cardioprotective effects, constructed based on the microarray data and the published literature, showed that 9 main compounds were found to act through 16 pathways, out of which 9 are immune and inflammation-related (li et al., 2014c) . the mechanism of action for the bushen zhuanggu formulation was proposed based on lc-ms/ms standardization, pharmacokinetic analysis, and np (pei et al., 2013) . the efficacy of shenmai injection was evaluated using a rat model of myocardial infarction, genome-wide transcriptomic experiment, and then followed by a np analysis. the overall trends in the ejection fraction and fractional shortening were consistent with the networkàrecovery index (nri) from the network . in order to develop an ethnopharmacological network, exploring the existing databases to gather information regarding bioactives and targets is the first step. further information such as target-related diseases, tissue distribution and pathways are also to be collected depending on the type of study that is going to be undertaken. the universal natural products database (unpd) (gu et al., 2013a ) is one of the major databases that provides bioactives information. other databases that provide information regarding bioactives include cvdhd (gu et al., 2013b) , tcmsp (ru et al., 2014) , tcm@taiwan (sanderson, 2011) , supernatural (banerjee et al., 2015) , and dr. dukes's phytochemical and ethnobotanical database (duke and beckstrom-sternberg, 1994) . the molecular structures of bioactives are usually stored as "sd" files and chemical information as smiles and inchkeys in these databases. any of these file formats can be used as inputs to identify the targets in protein information databases. binding database or "binding db" (liu et al., 2007) and chembl (bento et al., 2014) are databases for predicting target proteins. binding db searches the exact or similar compounds in the database and retrieves the target information of those compounds. the similarity search gives the structurally similar compounds with respect to the degree of similarity as scores to the queried structure. the information regarding both annotated and predicted targets can be collected in this way. this database is connected to numerous databases, and these connections can be used to extract further information regarding the targets. the important databases linked to binding db are uniprot (bairoch et al., 2005) , which gives information related to proteins and genes; reactome, a curated pathway database (croft et al., 2011) ; and the kyoto encyclopedia of genes and genomes (kegg), a knowledge base for systematic analysis of gene functions and pathways (ogata et al., 1999) . therapeutic targets database (ttd) (zhu et al., 2012) gives fully referenced information of targeted diseases of proteins, their pathway information, and the corresponding drug directed to each target. disease and gene annotation (dga), a database that provides a comprehensive and integrative annotation of human genes in disease networks, is useful in identifying the disease type that each indication belongs to (peng et al., 2013) . the human protein atlas (hpa) database (pontén et al., 2011) is an open database showing the spatial distribution of proteins in 44 different normal human tissues. the information of the distribution of proteins in tissues can be gathered from hpa. the database also gives information regarding subcellular localization and protein class. an overall review of the methods to implement np for herbs and herbal formulations is also available, including a systematic review of the databases that one could use for the same (kibble et al., 2015; lagunin et al., 2014) . integration of knowledge bases helps data gathering for network pharmacological studies, and its knowledge base shows the inter-relationships among these databases (fig. 5 .5) . the counts of entities, such as bioactives, targets, and diseases, can vary based on the knowledge bases that are relied on for data collection. an integration of knowledge bases can overcome this limitation. another factor that affects the counts of these entities is the time frame for data collection. this change occurs due to the ongoing, periodic updates of the databases. a network is the schematic representation of the interaction among various entities called nodes. in pharmacological networks, the nodes include bioactives, targets, tissue, tissue types, disease, disease types, and pathways. these nodes are connected by lines termed edges, which represent the relationship between them (morris et al., 2012) . building a network involves two opposite approaches: a bottom-up approach on the basis of established biological knowledge and a top-down approach starting with the statistical analysis of available data. at a more detailed level, there are several ways to build and illustrate a biological network. perhaps the most versatile and general way is the de novo assembly of a network from direct experimental or computational interactions, e.g., chemical/gene/protein screens. networks encompassing biologically relevant nodes (genes, proteins, metabolites), their connections (biochemical and regulatory), and modules (pathways and functional units) give an authentic idea of the real biological phenomena (xu and qu, 2011) . cytoscape, a java-based open source software platform (shannon et al., 2003) , is a useful tool for visualizing molecular interaction networks and integrating them with any type of attribute data. in addition to the basic set of features for data integration, analysis, and visualization, additional features are available in the form of apps, including network and molecular profiling analysis and links with other databases. in addition to cytoscape, a number of visualization tools are available. visual network pharmacology (vnp) , which is specially designed to visualize the complex relationships among diseases, targets, and drugs, mainly contains three functional modules: drug-centric, target-centric, and disease-centric vnp. this disease-target-drug database documents known connections among diseases, targets, and the usfda-approved drugs. users can search the database using disease, target, or drug name strings; chemical structures and substructures; or protein sequence similarity, and then obtain an online interactive network view of the retrieved records. in the obtained network view, each node is a disease, target, or drug, and each edge is a known connection between two of them. the connectivity map, or the cmap tool, allows the user to compare gene-expression profiles. the similarities or differences in the signature transcriptional expression profile and the small molecule transcriptional response profile may lead to the discovery of the mode of action of the small molecule. the response profile is also compared to response profiles of drugs in the cmap database with respect to the similarity of transcriptional responses. a network is constructed and the drugs that appear closest to the small molecule are selected to have better insight into the mode of action. other software, such as gephi, an exploration platform for networks and complex systems, and cell illustrator, a java-based tool specialized in biological processes and systems, can also be used for building networks . ayurveda, the indian traditional medicine, offers many sophisticated formulations that have been used for hundreds of years. the traditional knowledge digital library (tkdl, http://www.tkdl.res.in) contains more than 36,000 classical ayurveda formulations. approximately 100 of these are popularly used at the community level and also as over-the-counter products. some of these drugs continue to be used as home remedies for preventive and primary health care in india. until recently, no research was carried out to explore ayurvedic wisdom using np despite ayurveda holding a rich knowledge of traditional medicine equal to or greater than tcm. our group examined the use of np to study ayurvedic formulations with the wellknown ayurvedic formulation triphala as a demonstrable example (chandran et al., 2015a, b) . in this chapter, we demonstrate the application of np in understanding and exploring the traditional wisdom with triphala as a model. triphala is one of the most popular and widely used ayurvedic formulations. triphala contains fruits of three myrobalans: emblica officinalis (eo; amalaki) also known as phyllanthus emblica; terminalia bellerica (tb; vibhitaka); and terminalia chebula (tc; haritaki). triphala is the drug of choice for the treatment of several diseases, especially those of metabolism, dental, and skin conditions, and treatment of cancer (baliga, 2010) . it has a very good effect on the health of heart, skin, eyes, and helps to delay degenerative changes, such as cataracts (gupta et al., 2010) . triphala can be used as an inexpensive and nontoxic natural product for the prevention and treatment of diseases where vascular endothelial growth factor aàinduced angiogenesis is involved . the presence of numerous polyphenolic compounds empowers it with a broad antimicrobial spectrum (sharma, 2015) . triphala is a constituent of about 1500 ayurveda formulations and it can be used for several diseases. triphala combats degenerative and metabolic disorders possibly through lipid peroxide inhibition and free radical scavenging (sabu and kuttan, 2002) . in a phase i clinical trial on healthy volunteers, immunostimulatory effects of triphala on cytotoxic t cells and natural killer cells have been reported (phetkate et al., 2012) . triphala is shown to induce apoptosis in tumor cells of the human pancreas, in both in vitro and in vivo models (shi et al., 2008) . although the anticancer properties of triphala have been studied, the exact mechanism of action is still not known. the beneficial role of triphala in disease management of proliferative vitreoretinopathy has also been reported (sivasankar et al., 2015) . one of the key ingredients of triphala is amalaki. some studies have already shown the beneficial effect of amalaki rasayana to suppress neurodegeneration in fly models of huntington's and alzheimer's diseases (dwivedi et al., 2012 (dwivedi et al., , 2013 . triphala is an effective medicine to balance all three dosha. it is considered as a good rejuvenator rasayana, which facilitates nourishment to all tissues, or dhatu. here we demonstrate the multidimensional properties of triphala using human proteome, diseasome, and microbial proteome targeting networks. the botanicals of triphala-eo, tb, and tc-contain 114, 25, and 63 bioactives, respectively, according to unpd data collected during june 2015. of these, a few bioactives are common among the three botanicals. thus, triphala formulation as a whole contains 177 bioactives. out of these, 36 bioactives were score-1, based on binding db search carried out during june 2015. eo, tb, and tc contain 20, 4, and 20 score-1 bioactives, respectively ( fig. 5.6 ). the score-1 bioactives that are common among three plants are chebulanin, ellagic acid, gallussaeure, 1,6-digalloyl-beta-d-glucopiranoside, methyl gallate, and tannic acid. this bioactive information is the basic step toward constructing human proteome and microbial proteome targeting networks. thirty-six score-1 bioactives of triphala are shown to interact with 60 human protein targets in 112 combinations (fig. 5.7) . quercetin, ellagic acid, 1,2,3,4,6-pentagalloylglucose and 1,2,3,6-tetrakis-(o-galloyl)-beta-d-glucose are the four bioactives that interact with the maximum number of targets: 21, 16, and 7, respectively. the other major bioactives that have multitargeting property include catechin; epicatechin; gallocatechin; kaempferol; and trans-3,3',4',5,7-pentahydroxylflavane. the major protein targets of triphala include alkaline phosphatase (alpl); carbonic anhydrase 7 (ca7); coagulation factor x (f10), dna repair protein rad51 homolog 1 (rad51); gstm1 protein (gstm1); beta-secretase 1 (bace1); plasminogen activator inhibitor 1 (serpine1), prothrombin (f2); regulator of g-protein signaling (rgs) 4, 7, and 8, tissue-type plasminogen activator (plat); and tyrosineprotein phosphatase nonreceptor type 2 (ptpn2). the 60 targets of triphala are associated with 24 disease types, which include 130 disease indications (fig. 5.8) . the major disease types in which triphala targets are associated include cancers, cardiovascular diseases, nervous system diseases, and metabolic diseases. analysis of existing data indicates that targets of triphala bioactives are involved in the 40 different types of cancers making it the largest group of diseases, involving triphala targets. this linkage is through the interaction of 25 bioactives and 27 target proteins in 46 different bioactiveàtarget combinations. the types of cancers which are networked by triphala include pancreatic, prostate, breast, lung, colorectal and gastric cancers, tumors, and more. quercetin, ellagic acid, prodelphinidin a1, and 1,2,3-benzenetriol are the important bioactives; and rad51, bace1, f2, mmp2, igf1r, and egfr are the important targets that play a role in cancer. triphala shows links to 18 indications of cardiovascular diseases through 12 bioactives and 11 targets. the cardiovascular diseases that are covered in the triphala network include atherosclerosis, myocardial ischemia, infarction, cerebral vasospasm, thrombosis, and hypertension. the bioactives playing a major role in cardiovascular diseases are quercetin, 1,2,3,4,6-pentagalloyoglucose, 1,2,3,6-tetrakis-(o-galloyl)-beta-d-glucose, bellericagenin a1, and prodelphinidin a1, whereas the targets playing an important role are serpine1, f10, f2, and fabp4. triphala's network to nervous system disorders contains 13 diseases in which the significant ones are alzheimer's disease, parkinson's disease, diabetic neuropathy, and retinopathy. in this subnetwork, 14 bioactives interact with 11 targets through 21 different interactions. quercetin, 1,2,3,4,6-pentagalloyoglucose, 1,2,3,6-tetrakis-(o-galloyl)-beta-d-glucose, and epigallocatechin-3-gallate are the most networked bioactives whereas the most networked targets are bace1, serpine1, plat, aldr, ca2. the association of triphala with metabolic disorders is determined by six bioactives that interact with seven targets. the major metabolic diseases come in this link are obesity, diabetic complications, noninsulin-dependent diabetes, hypercholesterolemia, hyperlipidemia, and more. the bioactives having more interactions with targets are ellagic acid, quercetin, and bellericagenin a1, whereas the highly networked targets are igf1r, fabp5, aldr, and akr1b1. triphala bioactives are also linked to targets of other diseases comprising autoimmune diseases, ulcerative colitis, mccuneàalbright syndrome, psoriasis, gout, osteoarthritis, endometriosis, lung fibrosis, glomerulonephritis, and more. the proteome-targeting network of triphala, thus, shows its ability to synergistically modulate 60 targets that are associated with 130 disease indications. this data is generated with the available information that included only one-fifth of the total number of bioactives. further logical analysis and experimental studies based on the network result are needed to explore the in-depth mechanism of action of triphala. for researchers in this area, these kind of networks can give an immense amount of information that can be developed further to reveal the real mystery behind the actions of traditional medicine. triphala is also referred to as a "tridoshic rasayana," as it balances the three constitutional elements of life. it tonifies the gastrointestinal tract, improves digestion, and is known to exhibit antiviral, antibacterial, antifungal, and antiallergic properties (sharma, 2015; amala and jeyaraj, 2014; sumathi and parvathi, 2010) . triphala mashi (mashi: black ash) was found to have nonspecific antimicrobial activity, as it showed a dose-dependent inhibition of gram-positive and gram-negative bacteria (biradar et al., 2008) . hydroalcoholic, aqueous, and ether extracts of the three fruits of triphala were reported to show antibacterial activity against uropathogens with a maximum drug efficacy recorded by the alcoholic extract (bag et al., 2013; prasad et al., 2009) . the methanolic extract of triphala showed the presence of 10 active compounds using gc-ms and also showed potent antibacterial and antifungal activity (amala and jeyaraj, 2014) . triphala has been well studied for its antimicrobial activity against gram-positive bacteria, gram-negative bacteria, fungal species, and different strains of salmonella typhi (amala and jeyaraj, 2014; sumathi and parvathi, 2010; gautam et al., 2012; srikumar et al., 2007) . triphala showed significant antimicrobial activity against enterococcus faecalis and streptococcus mutans grown on tooth substrate thereby making it a suitable agent for prevention of dental plaque (prabhakar et al., 2010 (prabhakar et al., , 2014 . the application of triphala in commercial antimicrobial agents has been explored. a significant reduction in the colony forming units of oral streptococci was observed after 6% triphala was incorporated in a mouthwash formulation (srinagesh et al., 2012 ). an ointment prepared from triphala (10% (w/w)) showed significant antibacterial and wound healing activity in rats infected with staphylococcus aureus, pseudomonas aeruginosa, and streptococcus pyogenes (kumar et al., 2008) . the antiinfective network of triphala sheds light on the efficacy of the formulation in the simultaneous targeting of multiple microorganisms. also, this network provides information regarding some novel bioactiveàtarget combinations that can be explored to combat the problem of multidrug resistance. among the bioactives of triphala, 24 score-1 bioactives target microbial proteins of 22 microorganisms. the botanicals of triphala-eo, tb, and tccontain 19, 3, and 8 score1 bioactives respectively which showed interactions with microbial proteins. they act through modulation of 35 targets which are associated with diseases such as leishmaniasis, malaria, tuberculosis, hepatitis c, acquired immunodeficiency syndrome (aids), cervical cancer, candidiasis, luminous vibriosis, yersiniosis, skin and respiratory infections, severe acute respiratory syndrome (sars), avian viral infection, bacteremia, sleeping sickness, and anthrax ( fig. 5.9 ). the microorganisms captured in the triphala antiinfective network includes candida albicans, hepatitis c virus, human immunodeficiency virus 1, human papillomavirus type 16, human sars coronavirus leishmania amazonensis, mycobacterium tuberculosis, staphylococcus aureus, plasmodium falciparum, and yersinia enterocolitica. in mycobacterium tuberculosis, dtdp-4-dehydrorhamnose 3,5-epimerase rmlc is one of the four enzymes involved in the synthesis of dtdp-l-rhamnose, a precursor of l-rhamnose (giraud et al., 2000) . the network shows that triphala has the potential to modulate the protein through four bioactives such as punicalins, terflavin b, 4-o-(s)-flavogallonyl-6-o-galloylbeta-d-glucopyranose, and 4,6-o-(s,s)-gallagyl-alpha/beta-d-glucopyranose. research on new therapeutics that target the mycobacterial cell wall is in progress. rhamnosyl residues play a structural role in the mycobacterial cell wall by acting as a linker connecting arabinogalactin polymer to peptidoglycan and are not found in humans, which gives them a degree of therapeutic potential (ma et al., 2001) . triphala can be considered in this line to develop novel antimycobacterial drugs. the network shows the potential of gallussaeure and 3-galloylgallic acid to modulate human immunodeficiency virus type 1 reverse transcriptase. inhibition of human immunodeficiency virus at the initial stage itself is crucial and thus, targeting human immunodeficiency virus type 1 reverse transcriptase, at the preinitiation stage is considered to be an effective therapy. protein e6 of human papillomavirus 16 (hpv16) prevents apoptosis of figure 5.9 the microbial proteomeàtargeting network of triphala. dark green verus are the botanicals of triphala and oval green nodes are the score1 bioactives. targets, diseases, and microorganisms are represented by blue diamond nodes, red triangle nodes, and pink octagon nodes, respectively. infected cells by binding to fadd and caspase 8 and hence being targeted for development of antiviral drugs (yuan et al., 2012) . kaempferol of triphala is found to target protein e6 of hpv16, which is a potential mechanism to control the replication of the virus. the network also shows triphala's potential to act on plasmodium falciparum. enoyl-acyl carrier protein reductase (enr) has been investigated as an attractive target due to its important role in membrane construction and energy production in plasmodium falciparum (nicola et al., 2007) while the parasite interacts with human erythrocyte spectrin and other membrane proteins through protein m18 aspartyl aminopeptidase (lauterbach and coetzer, 2008) . trans-3,3',4',5,7-pentahydroxylflavane, epigallocatechin, and epicatechin can modulate both while epigallocatechin 3-gallate can regulate enoyl-acyl carrier protein reductase and, quercetin and vanillic acid can act on m18 aspartyl aminopeptidase. epigallocatechin 3-gallate can also target 3-oxoacyl-acyl-carrier protein reductase which is a potent therapeutic target because of its role in type ii fatty acid synthase pathway of plasmodium falciparum (karmodiya and surolia, 2006) . epigallocatechin 3-gallate and quercetin are the bioactives that have shown maximum antimicrobial targets interaction. while epigallocatechin 3-gallate shows interaction with 3-oxoacyl-(acyl-carrier protein) reductase, cpg dna methylase, enoyl-acyl-carrier protein reductase, glucose-6phosphate 1-dehydrogenase, hepatitis c virus serine protease, ns3/ns4a and yoph of plasmodium falciparum, saccharomyces cerevisiae, and spiroplasma monobiae; quercetin acts on 3c-like proteinase (3cl-pro), arginase, beta-lactamase ampc, glutathione reductase, m18 aspartyl aminopeptidase, malate dehydrogenase and tyrosine-protein kinase transforming protein fps of escherichia coli, fujinami sarcoma virus, human sars coronavirus (sars-cov), leishmania amazonensis, plasmodium falciparum, saccharomyces cerevisiae, and thermus thermophiles. np has gained impetus as a novel paradigm for drug discovery. this approach using in silico data is fast becoming popular due to its cost efficiency and comparably good predictability. thus, network analysis has various applications and promising future prospects with regard to the process of drug discovery and development. np has proven to be a boon for drug research, and that helps in the revival of traditional knowledge. albeit there are a few limitations of using np for nep studies. this is because the bioactives form the foundation of any traditional medicine network. 2. absorption, distribution, metabolism, excretion, and toxic effects (admet) parameters associated with the bioactives/formulation when they are administered in the form of the medicine need to be considered in order to extrapolate in silico and cheminformatics data to in vitro and in vivo models. in silico tools that offer the prediction of these parameters can be depended on for this. but traditional medicines are generally accompanied by a vehicle for delivery of the medicine. these vehicles, normally various solvents-water, milk, lemon juice, butter, ghee (clarified butter), honey-that alter the solubility of the bioactives, play a role in regulating admet parameters. experimental validation studies are required to evaluate this principle of traditional medicine. 3. target identification usually relies on a single or a few databases due to the limited availability of databases with free access. this can occasionally give incomplete results. also, there may be novel targets waiting to be discovered that could be a part of the mechanism of action of the bioactives. to deal with this discrepancy in the network, multiple databases should be considered for target identification. integration of databases serving similar functions can also be a solution for this problem. in addition to this, experimental validation of the target molecules using proteinàprotein interaction studies or gene expression studies will provide concrete testimony to the network predictions. 4. a number of traditional medicines act through multiple bioactives and targets. synergy in botanical drugs helps to balance out the extreme pharmacological effects that individual bioactives may have. the interactions of bioactives with various target proteins, their absorption into the body after possible enzyme degradation, their transport, and finally their physiological effect are a crucial part of traditional medicine (gilbert and alves, 2003) . however, in vitro assays or in silico tools are unable to give a clear idea as to the complete and exact interactions in a living organism. np is only the cardinal step toward understanding the mechanism of bioactives/formulations. but this gives an overview of the action of traditional medicine which can be used to design in vivo experiments and clinical trials. this saves time and cost of research and inventions. 5. it is observed that formulations are working by simultaneous modulation of multiple targets. this modulation includes activation of some targets and inhibition of other. in order to understand this complex synergistic activity of formulation, investigative studies regarding the interactions of ligands with targets are to be carried out. this can be achieved by implementing high-throughput omics studies based on the network data. network pharmacological analysis presents an immense scope for exploring traditional knowledge to find solutions for the current problems challenging the drug discovery industry. nep can also play a key role in new drug discovery, drug repurposing, and rational formulation discovery. many of the bioactiveàtarget combinations have been experimentally studied. the data synthesis using np provides information regarding the mode of action of traditional medicine formulations, based on their constituent bioactives. this is a kind of reverse approach to deduce the molecular mechanism of action of formulations using modern, integrated technologies. the current network analysis is based on the studies that have been conducted and the literature that is available. hence, the data is inconclusive as a number of studies are still underway and novel data is being generated continuously. despite its limitations, this still is a favorable approach, as it gives insight into the hidden knowledge of our ancient traditional medicine wisdom. np aids the logical analysis of this wisdom that can be utilized to understand the knowledge as well as to invent novel solutions for current pharmacological problems. determination of antibacterial, antifungal, bioactive constituents of triphala by ft-ir and gc-ms analysis antibacterial potential of hydroalcoholic extracts of triphala components against multidrug-resistant uropathogenic bacteria--a preliminary report triphala, ayurvedic formulation for treating and preventing cancer: a review super natural ii--a database of natural products network biology: understanding the cell's functional organization the chembl bioactivity database: an update network analyses in systems pharmacology exploring of antimicrobial activity of triphala mashi-an ayurvedic formulation. evidence-based complement approaches to drug discovery omic techniques in systems biology approaches to traditional chinese medicine research: present and future network pharmacology: an emerging technique for natural product drug discovery and scientific research on ayurveda network pharmacology of ayurveda formulation triphala with special reference to anti-cancer property molecular mechanism research on simultaneous therapy of brain and heart based on data mining and network analysis anti-inflammatory mechanism of qingfei xiaoyan wan studied with network pharmacology chapter 5: network biology approach to complex diseases progress in computational methods for the prediction of admet properties reactome: a database of reactions, pathways and biological processes mechanism study on preventive and curative effects of buyang huanwu decoction in qi deficiency and blood stasis diseases based on network analysis an analysis of chemical ingredients network of chinese herbal formulae for the treatment of coronary heart disease in vivo effects of traditional ayurvedic formulations in drosophila melanogaster model relate with therapeutic applications ayurvedic amalaki rasayana and rasa-sindoor suppress neurodegeneration in fly models of huntington's and alzheimer's diseases tc;mgenedit: a database for associated traditional chinese medicine, gene and disease information using text mining antifungal potential of triphala churna ingredients against aspergillus species associated with them during storage rmlc, the third enzyme of dtdp-l-rhamnose pathway, is a new class of epimerase identification of responsive gene modules by networkbased gene clustering and extending: application to inflammation and angiogenesis use of natural products as chemical library for drug discovery and network pharmacology cvdhd: a cardiovascular disease herbal database for drug discovery and network pharmacology understanding traditional chinese medicine antiinflammatory herbal formulae by simulating their regulatory functions in the human arachidonic acid metabolic network evaluation of anticataract potential of triphala in selenite-induced cataract: in vitro and in vivo studies pharmacology: principles and practice online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders network pharmacology: the next paradigm in drug discovery vnp: interactive visual network pharmacology of diseases, targets, and drugs detection of characteristic sub pathway network for angiogenesis based on the comprehensive pathway network analyses of co-operative transitions in plasmodium falciparum beta-ketoacyl acyl carrier protein reductase upon co-factor and acyl carrier protein binding network pharmacology applications to map the unexplored target space and therapeutic potential of natural products triphala promotes healing of infected full-thickness dermal wound chemo-and bioinformatics resources for in silico drug discovery from medicinal plants beyond their traditional use: a critical review network-based drug discovery by integrating systems biology and computational technologies systems pharmacologybased approach for dissecting the addition and subtraction theory of traditional chinese medicine: an example using xiao-chaihu-decoction and da-chaihu-decoction a network pharmacology approach to determine active compounds and action mechanisms of ge-gen-qin-lian decoction for treatment of type 2 diabetes traditional chinese medicinebased network pharmacology could lead to new multicompound drug discovery framework and practice of network-based studies for chinese herbal formula traditional chinese medicine network pharmacology: theory, methodology and application. chin constructing biological networks through combined literature mining and microarray analysis: a lmma approach understanding zheng in traditional chinese medicine in the context of neuro-endocrine-immune network herb network construction and comodule analysis for uncovering the combination rule of traditional chinese herbal formulae network target for screening synergistic drug combinations with application to traditional chinese medicine analysis on correlation between general efficacy and chemical constituents of danggui-chuanxiong herb pair based on artificial neural network network pharmacology study on major active compounds of fufang danshen formula a network pharmacology study of chinese medicine qishenyiqi to reveal its underlying multi-compound, multitarget, multi-pathway mode of action herb network analysis for a famous tcm doctor' s prescriptions on treatment of rheumatoid arthritis. evidence-based complement a novel network pharmacology approach to analyse traditional herbal formulae: the liu-wei-di-huang pill as a case study bindingdb: a web-accessible database of experimentally determined protein-ligand binding affinities network pharmacology study on major active compounds of siwu decoction analogous formulae for treating primary dysmenorrhea of gynecology blood stasis syndrome computational pharmacological comparison of salvia miltiorrhiza and panax notoginseng used in the therapy of cardiovascular diseases triphala and its active constituent chebulinic acid are natural inhibitors of vascular endothelial growth factor-a mediated angiogenesis computational identification of potential microrna network biomarkers for the progression stages of gastric cancer systems pharmacology strategies for anticancer drug discovery based on natural products multiscale modeling of druginduced effects of reduning injection on human disease: from drug molecules to clinical symptoms of disease drug targeting mycobacterium tuberculosis cell wall synthesis: genetics of dtdp-rhamnose synthetic enzymes and development of a microtiter plate-based screen for inhibitors of conversion of dtdp-glucose to dtdp-rhamnose bridging the gap between traditional chinese medicine and systems biology: the connection of cold syndrome and nei network analysis and visualization of biological networks with cytoscape discovery of novel inhibitors targeting enoyl-acyl carrier protein reductase in plasmodium falciparum by structure-based virtual screening global mapping of pharmacological space rediscovering drug discovery network ethnopharmacology approaches for formulation discovery integrative approaches for health: biomedical research, ayurveda and yoga material basis of chinese herbal formulas explored by combining pharmacokinetics with network pharmacology the disease and gene annotations (dga): an annotation resource for human disease significant increase in cytotoxic t lymphocytes and natural killer cells by triphala: a clinical phase i study. evid. based complement alternat the human protein atlas as a proteomic resource for biomarker discovery evaluation of antimicrobial efficacy of herbal alternatives (triphala and green tea polyphenols), mtad, and 5% sodium hypochlorite against enterococcus faecalis biofilm formed on tooth substrate: an in vitro study evaluation of antimicrobial efficacy of triphala (an indian ayurvedic herbal formulation) and 0.2% chlorhexidine against streptococcus mutans biofilm formed on tooth substrate: an in vitro study potent growth suppressive activity of curcumin in human breast cancer cells: modulation of wnt/beta-catenin signaling tcmsp: a database of systems pharmacology for drug discovery from herbal medicines anti-diabetic activity of medicinal plants and its relationship with their antioxidant property databases aim to bridge the east-west divide of drug discovery cytoscape: a software environment for integrated models of biomolecular interaction networks network pharmacology analyses of the antithrombotic pharmacological mechanism of fufang xueshuantong capsule with experimental support using disseminated intravascular coagulation rats a network pharmacology approach to understanding the mechanisms of action of traditional medicine: bushenhuoxue formula for treatment of chronic kidney disease triphala inhibits both in vitro and in vivo xenograft growth of pancreatic tumor cells by inducing apoptosis aqueous and alcoholic extracts of triphala and their active compounds chebulagic acid and chebulinic acid prevented epithelial to mesenchymal transition in retinal pigment epithelial cells, by inhibiting smad-3 phosphorylation evaluation of the growth inhibitory activities of triphala against common bacterial isolates from hiv infected patients antibacterial efficacy of triphala against oral streptococci: an in vivo study antibacterial potential of the three medicinal fruits used in triphala: an ayurvedic formulation dissection of mechanisms of chinese medicinal formula realgar-indigo naturalis as an effective treatment for promyelocytic leukemia a network study of chinese medicine xuesaitong injection to elucidate a complex mode of action with multicompound, multitarget, and multipathway phytochemical and pharmacological review of da chuanxiong formula: a famous herb pair composed of chuanxiong rhizoma and gastrodiae rhizoma for headache in silico analysis and experimental validation of active compounds from fructus schisandrae chinensis in protection from hepatic injury discovery of molecular mechanisms of traditional chinese medicinal formula si-wu-tang using gene expression microarray and connectivity map biology-oriented synthesis a network pharmacology approach to evaluating the efficacy of chinese medicine using genome-wide transcriptional expression data identifying roles of "jun-chen-zuo-shi" component herbs of qishenyiqi formula in treating acute myocardial ischemia by network pharmacology network-based global inference of human disease genes the study on the material basis and the mechanism for anti-renal interstitial fibrosis efficacy of rhubarb through integration of metabonomics and network pharmacology a systems biology-based approach to uncovering the molecular mechanisms underlying the effects of dragon's blood tablet in colitis, involving the integration of chemical analysis, adme prediction, and network pharmacology study on action mechanism of adjuvant therapeutic effect compound ejiao slurry in treating cancers based on network pharmacology alternative medicine. intech network pharmacological research of volatile oil from zhike chuanbei pipa dropping pills in treatment of airway inflammation navigating traditional chinese medicine network pharmacology and computational tools modularity-based credible prediction of disease genes and detection of disease subtypes on the phenotype-gene heterogeneous network small molecule inhibitors of the hpv16-e6 interaction with caspase 8 an integrative platform of tcm network pharmacology and its application on a herbal formula network understanding of herb medicine via rapid identification of ingredient-target interactions dbnei2.0: building multilayer network for drug-neidisease systems pharmacology dissection of the anti-inflammatory mechanism for the medicinal herb folium eriobotryae network pharmacology study on the mechanism of traditional chinese medicine for upper respiratory tract infection a network pharmacology approach to determine active ingredients and rationality of herb combinations of modifiedsimiaowan for treatment of gout a co-module approach for elucidating drug-disease associations and revealing their molecular basis deciphering the underlying mechanisms of diesun miaofang in traumatic injury from a systems pharmacology perspective network pharmacology-based prediction of the multi-target capabilities of the compounds in taohong siwu decoction, and their application in osteoarthritis a network-based analysis of the types of coronary artery disease from traditional chinese medicine perspective: potential for therapeutics and drug discovery therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery multi-target therapeutics: when the whole is greater than the sum of the parts studying traditional medicine that would hopefully get resolved in the future. the major limitations and possible solutions are listed:1. nep currently relies on various databases for literature and bioactive mining. databases, though curated, may show discrepancies due to numerous sources of information, theoretical, and experimental data. moreover, the botanicals that undergo certain preparatory procedures during the formulation of the medicine may have its constituents that have chemically changed due to the procedures; like boiling, acid/ alkali reactions, interactions between the bioactives, etc. a way to navigate around this problem is to make use of modern, high-throughput chemical identification techniques like ultra-performance liquid chromatogra-phyàelectrospray ionizationàtandem mass spectroscopy (uplc-esi-ms/ms). this technique will help to identify the exact bioactives or the chemical constituents of the formulation, and will enrich the subsequent key: cord-346606-bsvlr3fk authors: siriwardhana, yushan; gür, gürkan; ylianttila, mika; liyanage, madhusanka title: the role of 5g for digital healthcare against covid-19 pandemic: opportunities and challenges date: 2020-11-04 journal: nan doi: 10.1016/j.icte.2020.10.002 sha: doc_id: 346606 cord_uid: bsvlr3fk covid-19 pandemic caused a massive impact on healthcare, social life, and economies on a global scale. apparently, technology has a vital role to enable ubiquitous and accessible digital health services in pandemic conditions as well as against “re-emergence” of covid-19 disease in a post-pandemic era. accordingly, 5g systems and 5g-enabled e-health solutions are paramount. this paper highlights methodologies to effectively utilize 5g for e-health use cases and its role to enable relevant digital services. it also provides a comprehensive discussion of the implementation issues, possible remedies and future research directions for 5g to alleviate the health challenges related to covid-19. the recent spread of coronavirus disease (covid19) due to severe acute respiratory syndrome coronavirus 2 (sars-cov-2) [1] has caused substantial changes in the lifestyle of communities all over the world. by the end of june 2020 at the time of this writing, over eleven million positive cases of covid-19 were recorded, causing over 500,000 deaths. countries have been facing a number of healthcare, financial, and societal challenges due to the covid-19 pandemic. overwhelmed healthcare facilities due to rapid growth of new covid-19 patients, are experiencing interruptions in provision of regular health services. moreover, healthcare personnel are also becoming vulnerable to covid-19 and this is taxing the healthcare resources even more. to cease the wide spread of the virus, governments impose strict restrictions and control on travel within and between countries, negatively affecting the economies. while the remote work was considered as an alternative with limitations, certain jobs became obsolete. the increased unemployment is a burgeoning problem even for strong economies. apart from that, government expenditure on unemployed workforce, losing income from sectors associated with tourism such as airlines, hotels, local transport, and entertainment were major challenges for the economies. governments had to introduce new guidelines on social distancing to prevent the spread of the virus. this resulted in closing schools, isolating cities and even restricting public interactions, affecting the regular lifestyle of people. such disruption could lead to unprecedented _______________________ *corresponding author email addresses: yushan.siriwardhana@oulu.fi (yushan siriwardhana), gueu@zhaw.ch (gürkan gür), mika.ylianttila@oulu.fi (mika ylianttila), madhusanka.liyanage@oulu.fi, madhusanka@ucd.ie (madhusanka liyanage) consequences such as losing physical and mental well-being. maintaining the societal well-being during the covid-19 era is therefore a daunting task. the technological advancement is one of the key strengths in the current era to overcome the challenging circumstances of covid-19 outbreak. the timely application of relevant technologies will be imperative to not only to safeguard, but also to manage the post-covid-19 world. the novel ict technologies such as internet of things (iot) [2] , artificial intelligence (ai) [3] , big data, 5g communications, cloud computing and blockchain [4] can play a vital role to facilitate the environment fostering protection and improvement of people and economies. the capabilities they provide for pervasive and accessible health services are crucial to alleviate the pandemic related problems. 5g communications present a paradigm shift from the present mobile networks to provide universal high-rate connectivity and a seamless user experience [5] . 5g networks target delivering 1000x higher mobile data volume per area, 100x higher number of connected devices, 100x higher user data rate, 10x longer battery life for low power massive machine communications, and 5x reduced end-toend (e2e) latency [6] . these objectives will be realized by the key technologies such as mmwaves, small cell networks, massive multiple input multiple output (mimo) and beamforming [7] . by utilizing these technologies, 5g will mainly support three service classes i.e. enhanced mobile broadband (embb), ultra reliable and low latency communication (urllc) and massive machine type communication (mmtc). the novel 5g networks will be built alongside fundamental technologies such as software defined networking (sdn), network function virtualization (nfv), multi-access edge computing (mec) and network slicing (ns). sdn and nfv enable programmable 5g networks to support the fast deployment and flexible management of 5g services. mec extends the intelligence to the edge of the radio network along with higher processing and storage capabilities. ns creates logical networks on a common infrastructure to enable different types of services with 5g networks. these 5g technologies will enable ubiquitous digital health services combating covid-19, described in the following section as 5g based healthcare use cases. however, there are also implementation challenges which need to mitigated for efficient and high-performance solutions with wide availability and user acceptance as discussed in section 3. in this work, we elaborate on these aspects and provide an analysis of 5g for healthcare to fight against the covid-19 pandemic and its consequences. capabilities of 5g technologies can be effectively utilized to address the challenges associated with covid-19 presently and in the post covid-19 era. existing healthcare services should be tailored to fit the needs of covid19 era while developing novel solutions to address the specific issues originated with the pandemic. in this section, the paper discusses several use cases where 5g is envisaged to play a significant role. these use cases are depicted in figure 1 and the technical requirements of use cases are outlined in table 1 . telehealth is the provision of healthcare services in a remote manner with the use of telecommunication technologies [8]. these services include remote clinical healthcare, health related education, public health and health administration, defining broader scope of services. telemedicine [9] refers to remote clinical services such as healthcare delivery, diagnosis, consultation, treatment where a healthcare professional utilizes communication infrastructure to deliver care to a patient at a remote site. telenursing refers to the use of telecommunication technologies to deliver nursing care and conduct nursing practice. telepharmacy is defined as a service which delivers remote pharmaceutical care via telecommunications to patients who do not have direct contact with a pharmacist. (e.g. remote delivery of prescription drugs). telesurgery [10] allows surgeons to perform surgical procedures over a remote distance. all these healthcare related teleservices are highly encouraged in post-covid-19 period due to multiple reasons. lack of resources (i.e., hospital capacity, human resources, protective equipment) in healthcare facilities due to existing covid-19 patients, social distancing guidelines imposed by authorities, requirements of maintaining the regular healthcare services adhering to the new guidelines imposed by the healthcare administrations and the need to minimize the risk of healthcare professionals getting exposed to covid-19 are factors motivating teleservices related to healthcare. these teleservices sometimes have strict requirements and call for sophisticated underlying technologies for proper functionality. as an example, a telemedicine followup visit between the patient and the doctor, would require 4k/8k video streaming with low-latency and low jitter. telehealth based remote health education programs should be accessible to the students from anywhere via an internet connection having a proper bandwidth. monitoring the patients via telenursing also requires uninterrupted hd/4k video stream between the patient and the nurse. remote delivery of drugs is possible via unmanned ariel vehicles (uav), which requires assured connectivity with the base station to send/receive control instructions without delays. extreme use cases like telesurgery requires ultra-low latency communication (less than 20 ms e2e latency) between the surgeon and the patient, connectivity between number of devices such as cameras, sensors, robots, augmented reality (ar) devices, wearables, and haptic feedback devices [11] . the future 5g networks will use the mmwave spectrum, which leads to the deployment of ultra-dense small cell networks, including the network connectivity for indoor environments. technologies like massive mimo combined with beamforming will contribute for providing extremely high data rates for large number of intended users. these technologies together provide a better localization for indoor environments [12] . these 5g technologies realize the embb service class which facilitates the transmission of 4k/8k videos between the healthcare professional and the patient, irrespective of the location of access. the new radio access technology developed by the 5g networks, also known as 5g new radio (nr) supports urllc. the urllc service class helps to realize the ultra-low latency requirements of telesurgery applications. a local 5g operator (l5go) has its core and access network deployed locally on premises, serves the healthcare facility with multiple base stations deployed both outdoors and indoors to provide connectivity for case specific needs. this deployment is beneficial for telesurgery use case to achieve ultra-low latency, given that there is a requirement of surgeon and patient being in separate rooms due to the pandemic situation. mec servers deployed at the 5g base stations can be utilized to deploy the control functions for uavs for proper payload deliveries. the fundamental design changes in 5g networks will enable the communication of large number of iot devices, which usually transfer less data compared to human activities such as streaming. these mmtc services provide the support to 5g enabled medical iots (miots) that can be used to monitor and treat remote patients. mmtc will connect and enable communication between heterogeneous devices into the 5g network so that they can operate in synchronicity. a sensor in a wearable device of the patient can immediately sends a signal to the remote nurse via 5g network so that the nurse can activate a special equipment in the patient's room using the mobile device. the use of 5g technologies in a hospital environment for telehealth use cases is illustrated in figure 2 . the spread of covid-19 disease demands the rapid launching of new healthcare services/applications, change the way present healthcare services are provided [13] , integrate modern tools such as ai and machine learning (ml) in the data analysis process [14] . a new application can collect the data of covid-19 patients from different healthcare centers, upload the data to a cloud server and make the information available to public so that others can rely on the information for different purposes. a live video conferencing based interactive applications which enable healthcare professionals to discuss with patients and help them is another example [15] . other applications would perform regular health monitoring of patients such as followup visits, provide instructions on medical services, and spread knowledge on present covid-19 situation and upto date precautions. the difficulty during the pandemic was that there was a need to automate most of the regular work to minimize the interaction between people and new application development needs were also sudden. this calls for a flexible network infrastructure which supports the development of such applications within a short period of time. in contrast to the present 4g networks, 5g supports the creation of new network services as softwarized network functions (nfs) by utilizing sdn and nfv technologies. these nfs can be hosted at the cloud servers, operator premises, or in the edge of the network based on the application demands. mec servers equipped with storage and computing power and reside at the edge of the radio network, will be a suitable platform to host these applications. the deployment of such applications will be more flexible in 5g networks because of the sdn and nfv. bringing the nfs towards the edge eliminates the dependency of the infrastructure beyond the edge, making the applications more reliable. increasing the capacity of the 5g network is much easier because the network itself is programmable. 5g networks are capable of deploying network slices which create logical networks to cater the services with similar type of requirements such as iot slice and low latency slice, thereby serving applications with guaranteed service levels. a surge in demand for personal protective equipment (ppe), ventilators and certain drugs was observed at the beginning of the covid-19 spread, causing an imbalance of the regular supply chains [16] . manufacturing plants were unable to maintain the regular production due to the shortage of raw materials and labor force, therefore they were not capable of responding to the increased demand for the goods. the supplies of finished products were also delayed due to transport restrictions and there were no proper alternative distribution mechanisms so that the people who are really in need would receive them. n95 masks, hand sanitizers, and regular medicine are some of the goods where this imbalance of supply was often seen. those who reacted quickly could stock items in surplus while others who are in need did not receive them. donations to the victims were not always distributed in a fair manner because of the absence of centralized management systems. delivery of the items to the final consumer was a concern due to the risk of covid-19 spread and the restrictions imposed by the authorities to limit the physical contact. it is a challenge for the governments, healthcare authorities, distributors to implement proper mechanisms to manage the supply chains of healthcare items in the covid-19 period. to address the issues in healthcare related supply chains, industries can adopt smart manufacturing techniques equipped with iot sensor networks, automated production lines which dynamically adapt to the variations in demand, and sophisticated monitoring systems. iot based supply chains could be used to properly track the products from the manufacturing plant to the end consumer, i.e. connected goods. uav based automated delivery mechanisms are specially suited in the covid-19 situation to deliver medicine, vaccines, masks to the end consumer minimizing the physical contact. 5g supports direct connectivity for iot and mmtc between iot devices. this will fuel the possibility to use large amount of iot devices to increase the efficiency of supply chains. deployment l5gos to serve the needs of industries is a better way to integrate iot sensors, actuators, robots directly into 5g network enabling a 5g based smart manufacturing system. the proper network connectivity for the sensors, actuators, robots in the manufacturing plants will be enabled by the mmwave 5g small cells deployed indoors. massive mimo will provide connectivity for a large number of devices and beamforming technique ensures a better quality of the network connection. the direct connectivity of goods into the 5g systems makes the supply chains more transparent. mec integrated with 5g, can be used to process the data locally to improve the scalability of the systems as well as security and privacy of collected data. moreover, mec integrated with 5g can easily be used to implement decentralized solutions via blockchain [17] , [18] . the delivery of items to the final destination can be performed via beyond lineof-sight (blos) uav guided by the 5g network. this could minimize unnecessary interactions in covid-19 period and reduce human efforts. real-time data is available for the authorized users for monitoring and tracking, which increases the transparency of the operation. covid-19 positive patients with mild conditions are usually advised for self-isolation to prevent further spread. while self-isolation is a better alternative to manage the capacity of healthcare facilities, the self-isolating individuals should be properly monitored to make sure that they follow the self-isolation guidelines. the challenge is to track every movement of the patient, which is currently impossible. in an event of a violation of self-isolation guidelines, control instructions should be sent. mobile device based selfisolation monitoring is possible via an application which sends random gps data of patient's mobile phone to a cloud server. wearable devices attached to the patient's body use their sensors to measure the conditions of the patient and upload the data via the mobile phone. uav based solutions can monitor the conditions of the patients from a distance. uavs can monitor body temperature via infrared thermography and identify the person via face recognition algorithms. moreover, contact tracing of identified positive cases is extremely important [19] . however, present contact tracing mechanisms involve significant human engagement and consist of a lot of manual work. this prevents the identification of all the possible close contacts and hinders the effectiveness of the contact tracing. manual tracing does not guarantee that all the possible close contacts are identified. bluetooth low energy (ble) based contact tracing applications use ble wearable devices, which advertise its id periodically so that other compatible devices in close proximity can capture the id and store with the important details such as timestamp, gps location data (optionally). once an infected covid-19 patient is detected, the ble solution provides the ids of the close contacts over a defined period. ble based solutions identify the contacts in the range of few meters, whereas pure gps based solutions do not have that accuracy. role of 5g mmtc in 5g is responsible for massive connectivity of heterogeneous iot devices such as sensors, wearables, and robots. the small cell networks equipped with mimo and beamforming in 5g will ensure better connectivity and positioning including indoor environments. hence, iot devices directly connected to 5g network can be effectively used to monitor the compliance of self-isolation. instead of using general mobile device data, the patients can be attached with a low power wearable devices which transfer data via ble technology. those sensory data can be updated to the cloud via the 5g network and the authorized parties can monitor the behavior of the patient. a similar concept can be applied to contact tracing where the wearable ble devices collect data of nearby devices and upload to the cloud via 5g network. once a patient is tested positive, all the close contact details are already in the cloud